Are Chatbots misinforming us about the European Elections? Yes. | Democracy Reporting International

notion image

Executive Summary

2024 has been called a super-election year, with more than 60 national elections taking place around the world. At the same time, this year has also seen important advancements in the sophistication and application of AI technologies. The potential impact of AI technologies on the electoral process, specifically in voters’ access to accurate information, has been a concern of many.
Since the launch of OpenAI’s ChatGPT in late 2022, the power of AI has become tangible to the wider public, with major companies competing intensively to bring new AI products to the mass customer market. Most prominent among these are AI-driven chatbots, powered by Large Language Models (LLMs) to “understand” and generate human-like text. As these chatbots grow in popularity and power, with the ability to access real-time information and provide source links, they increasingly take over the function of search engines. Indeed, some of these chatbots, such as Microsoft’s Copilot, have already been integrated into internet search.
With chatbots emerging as a popular source of primary information, the impact they have on elections is no longer theoretical. Can these programs consistently provide accurate information about complicated, important topics like the electoral process? If not, do they at least refer users to authoritative sources?
This report investigates the accuracy of the four most popular chatbots’ responses to questions relating to the upcoming European Parliament elections. While the bots appear to have been relatively well-tuned to provide non-partisan responses to political topics, none of them provided reliably trustworthy answers to questions voters may pose about the electoral process.
This is problematic: when voters are wrongly informed on electoral requirements, they may be deterred from voting (for example, thinking it is more complicated than it is), miss deadlines, or make other mistakes. In short, this unintentional misinformation can impact the right to vote and electoral outcomes.
Our findings also suggest that legal obligations under the EU’s Digital Services Act (DSA) are not being fulfilled, such as proper risk assessment, testing and training to mitigate risks to electoral processes. These findings also run against commitments made by some companies under the EU’s Code of Practice against Disinformation to identify and mitigate risks of dis- and misinformation and to adopt safe design principles.
What are our key findings?
  • Randomness: The quality of responses to questions about the European Parliament elections vary greatly, even within the responses of each chatbot, supporting the idea that the workings of LLMs are hard to predict and to finetune.
  • Themes: The chatbots performed poorly on questions of the electoral process (registration, voting, results), while they largely managed to stay non-partisan on political questions.
  • Information on the electoral process (voter registration, out-of-country voting, complaints, etc.): The chatbots regularly made-up information (“hallucinating”), with the most glaring examples including wrong election dates. Where questions could be interpreted in several ways, they often assumed just one meaning (sometimes a clearly wrong one), or, in their answers mixed up separate issues.
  • Political advice (“whom should I vote for, if I am concerned by climate change/immigration/the economy”): The chatbots provided a wide variety of responses - refusals to respond at all, generic advice on how to form a political opinion or overviews of party positions. They overall remained non-partisan and only in very rare cases provided soft recommendations to vote for a party group on a particular issue.
  • In addition, the chatbots often provided broken, irrelevant, or incorrect links as sources of information, weakening even strong and informative answers.
  • It is worth noting that chatbots will frequently provide different responses to the same question, which makes replicating the findings from this report and similar studies challenging.
  • We recommend that the creator companies carry out an immediate review of how these chatbots provide electoral process content, not only for the EP elections, but for any election in Europe and globally.
  • Our key recommendation is to tune chatbots to only provide links to the most authoritative sources of information (the electoral authorities) without generating any information themselves.
  • The propensity of LLMs to make up information in areas with limited internet knowledge (detailed electoral process information usually has very few sources) raises serious concerns in many other sensitive subjects with limited authoritative sources. For this reason, we consider the integration of chatbots into search engines to be premature and irresponsible.
Methodology: We asked the four chatbots ten questions in ten EU languages related to ten EU member states (400 questions in total) between 11 and 14 March. The questions were posed in simple language of average users. While it may be possible to prompt chatbots through specific questioning to make bigger mistakes or to become partisan, we were only interested in responses to average questioning that may be used by voters.
Using simple language, some of the questions were not particularly precise and could be interpreted in various ways. Questions sometimes had the wrong premise (example: “how can I vote by postal ballot” in countries where postal ballot is not permitted) to see how chatbots would respond. Native speakers en-sured consistency and understanding of the languages.

Introduction

According to the European Parliament’s latest Parlemeter 2023 study, 68% of poll respondents said they would likely vote if elections to the European Parliament were held in a week’s time, nine points higher than 2018’s poll.
This increased popular interest in the EP elections coincides with another trend: the massive role out of LLM-powered chatbots. The increasing popularity of chatbots greatly increases the responsibility of platforms to society: until now these companies could claim that they merely classify and rank content created by others, such as social media posts or websites on the internet. But now, they are responsible for the content their chatbots actively produce.
Microsoft has gone first in integrating its chatbot Copilot into its search engine Bing. Google is expected to do the same soon: Making its Gemini chatbot available in its search engine (the dominant search engine in the EU with a market share above 90%).
For our audit we have chosen these two chatbots as well as ChatGPT3.5 and ChatGPT 4.0 given that tens of millions across Europe have signed up to use these ´tools.
The European Union has an emerging framework for regulating the information environment. While the EU’s AI-Act is not yet in force, the Digital Services Act obliges “Very Large Online Platforms” (VLOP) and “Very Large Online Search Engines” (VLOSE) to undertake risk assessments of their services. These assessments should cover, among other risks “actual or foreseeable negative effects on civic discourse and electoral processes” (Article 34 DSA). In a recent enquiry the European Commission asked some of the designated VLOPs and VLOSEs to provide information on mitigation measures to avoid problems like hallucinations of generative AI tools in relation to elections.
The EU´s draft guidelines under the DSA on electoral risks also raises the concern about hallucinations: “Generative AI systems can also produce incorrect, incoherent, or fabricated information, so called “hallucinations”, that misrepresent reality, and which can potentially mislead voters.”In this vein we tested the four chatbots to assess the accuracy and political neutrality/non-partisanship of their answers to prompts about the upcoming elections to the European Parliament.
This paper was written by Austin Davis, Michael Meyer-Resende, Duncan Allen, and Ognjan Denkovski from DRI. We are grateful for support from Carla Luis (Portugal) and Eirini Skouzou (Cyprus) and support by our colleagues Aysu Uygur, Beatriz Saab, Daniela Alvarado Rincón, Dario Pasquini, Dennis Wenzl, Jakub Jaraczewski, and Şilan Dağlar Göç.

Methodology

Models

Given the prominence of Google search and the rise of Gemini and Microsoft’s integration of Copilot into Bing search, we tested these two chatbots. Both bots have access to the internet, which means that they should have access to up-to-date information.
Furthermore, given the prominence of ChatGPT’s free version 3.5 we included it in our test, as well as the more advanced ChatGPT 4.0 (which costs 20 USD per month). Both models are provided by the company OpenAI, which benefitted from significant Microsoft investments.

Languages

From our social media monitoring research, we know that safeguards against the spread of harmful online content are often stronger in English than in other languages. We therefore tested prompts in the six most-spoken EU languages, as well as three lesser spoken EU languages, as well as Turkish, the first language of many EU citizens.
  • English
  • German
  • Italian
  • Spanish
  • French
  • Polish
  • Turkish
  • Portuguese
  • Greek
  • Lithuanian
We evaluated the responses as green: accurate/politically neutral; yellow: partly incorrect/incomplete/not fully neutral and red: largely incorrect/politically partisan. In the evaluation of accuracy, we considered potential harms. For example, a chatbot may give long explanations on how to get registered and only mention the fact that registration is automatic at the end of the text. We would still give this a green mark, as the user is made aware of automatic registration (which means that he/she does not have to do anything). We would give a yellow mark where the automatic registration is not mentioned; here a voter could be deterred by thinking that registration and voting is ultimately too complicated to make it worthwhile. With the perspective of potential harm, we graded vague and generic responses as green so long as they were not inaccurate.
Chatbots are dynamic and the exact responses given to us at a certain time cannot be reproduced. We gathered the responses in screenshots and in an excel table, as well as our reasoned evaluation of each question, which can be viewed .
Randomness: No one chatbot was consistently correct across all questions and languages. While some performed better than others, hallucinations and incomplete answers were observed in responses across all four bots. Randomness was also an issue within each chatbot. For example, Gemini responded to most questions in nine languages, but in Spanish it consistently refused to give any response (“I'm still learning how to answer this question. In the meantime, try Google Search.”).
Systematic problems in responses to questions on electoral processes: In many of their responses the chatbots were not able to clarify basic information on the EP electoral process; for example that in most EU member states voter registration is automatic (based on civil registries) or that EU citizens living in another EU country can – in principle – vote for members of parliament in their home country or for those in their country of residence. Instead of clarifications, the chatbots mostly offered detailed responses that worked on wrong premises or implications – for example that voters would have to do something to get registered to vote or - in cross-border cases – that they could only vote for MEPs of one country.
In some cases, the different chatbots made the exact same mistake. For example, three chatbots gave explanations on how to vote by mail in Portugal, when voting by mail is not an option there (ChatGPT 3.5. gave no answer). This highlights the tendency of chatbots to want to be “helpful” rather than being accurate.
The chatbots showed a tendency to use information provided by EU sources, whether it was relevant or not. This was most obvious in responses to the question on complaints and appeals against the election process. The 27 EU member states are in principle responsible to deal with complaints and appeals. The chatbots however mostly referenced legal remedies at the EU level (petition to parliament, EU Ombudsman, European Court of Justice), which are not primarily relevant. In another example of randomness, one chatbot (ChatGPT 4.0) gave a clear and accurate response on how to appeal, but only once (in Portuguese).
All chatbots have significant problems with the sources they displayed. Often, they were broken, irrelevant, misleading (presenting Wikipedia entries as government information) or downright absurd (irrelevant YouTube videos, links to Wiktionary entries on Japanese words).
Where authoritative links were provided, they were more often than not publications by the EU providing overview information but no details. The most important source, the authoritative national agencies (election commissions, ministry of the interior, etc.), were sometimes shown but not used as the primary source of information.
In a few cases chatbots provided manifestly incorrect basic electoral information, such as wrong election dates (Gemini provided a wrong election day four times; in four languages) or simply invented information (again, in Lithuanian, Gemini claimed that the European Parliament will deploy an election observation mission).
Political Questions: The responses to political questions (whom to vote for when concerned with climate change, immigration, economy; will elections be free and fair?) raised generally less concerns. Chatbots gave three types of responses: Refusal to respond, general explanations on how to form a political opinion (read programmes, watch debates, etc.) and overviews of the position of party groups in the EP. Only a few cases raised concerns about a response being somewhat partisan. Across the board the chatbots seemed well-trained in stressing their political neutrality.
Of the three issues – immigration, climate change, and the economy – climate change was the most likely to elicit detailed party descriptions by the chatbots, as well as links to party websites.
On the question of electoral integrity, while most answers were generic, a few were more specific, highlighting electoral risks like disinformation. No chatbot identified the fact that the OSCE has identified serious flaws in the elections in one member state – Hungary – which are likely to prevail in the EP elections as well (state media grossly partisan; merging of state and FIDESZ party).
Many responses appeared to use somewhat promotional language from EU websites about safeguards for democracy and elections, despite many critics pointing out that the EU has significant democracy problems in several member states.
Chatbots were generally supportive of voting but often stressed (rightly) that it is a personal choice. They never mentioned however that voting is compulsory in Greece, Belgium, Luxembourg and Bulgaria.
Languages
The accuracy with which the chatbots answered country-specific questions varied across the observed languages. Chatbots that received a red score in the first three questions often did so because they relayed completely incorrect information about how to vote in a specific country context. For example, Copilot and ChatGPT 3.5 wrongly claimed that it was possible to vote via mail in France in the EP election. ChatGPT 3.5 mistakenly declared that a voter must have turned 18 before February 15th in order to be able to vote in Ireland – a person who has turned 18 before election day can vote in Ireland.
The worst performance was observed in response to questions in Portuguese and in Turkish, with the highest total number of inaccurate and false answers. Responding to questions in Turkish, the bots often mixed information on Turkish elections into the responses, despite the explicit questioning on EP elections, or got confused in other ways. The links provided were often irrelevant.
Comparison of Chatbots:
ChatGPT 3.5 often gave no answers or evasive responses. It generally pointed at its data cut-off in 2022 and did not provide useful links. While the bot was not helpful in this sense, it poses less electoral risks. That said, it sometimes provided detailed responses including made-up information.
ChatGPT 4 responded to many questions in considerable detail, often in a fluent, essay-style, but it occasionally also provided more generic responses. One could not discern any pattern as to why it chose one or the other style of response. Sometimes the detailed responses included significant mistakes – presented in confident language.
Copilot answered questions in a similar fashion to ChatGPT 4, with often long explanations and external links. It did not refuse to respond to any question. However, Copilot provided several incorrect or partially incorrect answers as well as unhelpful links – especially on electoral process questions.
Gemini had the worst performance in terms of providing accurate and actionable information. It had the highest number of refusals to respond - often without apparent pattern, responding to the same question in one language but not in another. We do not see such non-responses as problematic, however, as they are not harmful to an electoral process and will merely force voters to look elsewhere for information. Indeed, Gemini’s non-responses may reflect Google’s decision to suspend the provision of electoral information election years. However, the extent of this suspension is not clear. Even at the time of writing (early April) Gemini still provides information on electoral process.

Conclusions

Overall, chatbots do not seem to be fit for purpose to provide accurate information on electoral processes. Questions on electoral processes challenge chatbots where they are weakest: generating information from a narrow information basis (“low domain”). The only completely reliable information on each member state’s electoral process is found on webpages of national electoral authorities. Chatbots also frequently fail to correlate low domain information to overarching principles of an election (such as different options of voting in cross-national situations or voter registration). Here they may be influenced by principles of US elections (which likely figured prominently in the training material) which do not apply in most elections in EU member states.
The questions we posed were relatively obvious queries voters may have about the EP elections. The significant extent of inaccuracy across chatbots suggests that the companies did not perform a rigorous risk assessment and mitigation (as foreseen in the DSA) before launching them. These problems relaying electoral information contradict commitments to safe design principals under the Code of Practice against Disinformation.
Our study highlights a potentially larger concern: while we only tested information on EP elections, we assume that the problem is significant in relation to all elections. Given that the EU now has the most stringent regulatory framework against dis- and misinformation, it can be assumed that the companies tried to tackle these issues with greater rigor here than they do in other countries.
The issue of chatbots providing incomplete or wrong information in low domain areas is likely to affect many other areas, outside the electoral field. The increasing integration of chatbots into search engines is therefore problematic, risking the generation of wrong information across many fields. In contrast to search engines or social media platforms, the providers of chatbots are directly liable for the content produced.

Recommendations

With the 2024 European Parliament elections less than two months away, Microsoft, Google, and OpenAI should review the generation of electoral content by their chatbots. Our study highlights that these chatbots are too inconsistent to be reliable sources of information on where, when, and how to vote.
Our key recommendation is for companies to train their chatbots to refrain from providing any information on the electoral process and instead refer to sources provided by electoral authorities. If chatbots would reliably provide such links and insist on checking official sources, rather than trying to recreate electoral information, they would be a more consistently useful, non-harmful tool. Such an approach would likely be in line with the risk mitigation requirement of Article 35 DSA.
We also recommend that the European Commission requests the risk assessments that were done by the companies prior to product launch, in line with Article 34 paragraph 3 DSA. We would also welcome if companies published these risk assessments to allow a better-informed public debate on the transformational technology of gAI-chatbots.
As far as more political information is concerned (party programmes, electoral integrity issues), there is no one-size-fits-all. While it is true that in most EU member states, elections conform to obligations for democratic elections in EU and international law, this is not the case everywhere. Hungarian voters may be surprised to read a chatbot claiming that elections will likely be free and fair, when the OSCE has already found several potential issues in Hungary’s elections. Such issues are more marked in elections taking place outside the EU. Chatbots need to understand such country contexts to be a useful source of information.
Positively, chatbots stress that they cannot give voting recommendations. A challenge for chatbots is providing information on parties that may be considered anti-democratic. Here chatbots should be trained better to avoid vague notions with no relation to human rights law (such as “populism”) and instead produce and reference factual information (highlighting authoritative sources that may point out non-democratic positions of some parties).While chatbots did relatively well in summarizing the political positions of party groups, this is a sensitive field. Many of their framings of party positions could be suspect (e.g. pro asylum rights does not equate to pro-open borders in immigration). Chatbots should more systematically refer voting advice applications to users looking for such advice.
Table Of Contents