SCOTUS FOCUS
on Mar 21, 2025
at 3:15 pm

In 2023, ChatGPT mistakenly claimed that Ginsburg dissented in Obergefell — now it’s corrected that mistake.
Just over two years ago, following the launch of ChatGPT, SCOTUSblog decided to test how accurate the much-hyped AI really was — at least when it came to Supreme Court-related questions. The conclusion? Its performance was “uninspiring”: precise, accurate, and at times surprisingly human-like text appeared alongside errors and outright fabricated facts. Of the 50 questions posed, the AI answered only 21 correctly.
Now, more than two years later, as ever more advanced models continue to emerge, I’ve revisited the issue to see if anything has changed.
Successes secured, lessons learned
ChatGPT has not lost its knowledge. It still got right that the Supreme Court originally had only six seats (Question #32) and explained accurately what a “relisted” petition is (Question #43). Many of its responses have become more nuanced, incorporating crucial details that were missing in 2023. For example, when asked about the counter-majoritarian difficulty, the AI easily identified Professor Alexander Bickel as the scholar who coined the term (Question #33). Similarly, when explaining non-justiciability (Question #31), the concept that there are some cases that courts cannot hear, it now includes mootness and the prohibition on advisory opinions among its examples.
The bot has also done its error analysis homework. It now correctly acknowledges that President Donald Trump appointed three, not two, justices during his first term (Question #36) and that Justice Joseph Story, not Justice Brett Kavanaugh, was the youngest appointed justice in history (Question #44). It has refined its understanding of Youngstown Sheet & Tube Co. v. Sawyer (Question #39), recognizing that Justice Robert Jackson “laid out a now-classic three-category framework for evaluating presidential power” in his concurring opinion rather than authoring the majority opinion — an error ChatGPT made in 2023. Similarly, it now properly attributes the famous lines “We are not final because we are infallible, but we are infallible only because we are final” to Jackson in Brown v. Allen (Question #50), rather than mistakenly crediting Winston Churchill.
The bot has also improved on factual accuracy in several areas: It now correctly identifies the responsibilities of the junior justice (Question #45), the average number of oral arguments per term (Question #6), and, in discussing cases dismissed as improvidently granted (DIGs), it now includes a previously missing key consideration – that “Justices may prefer to wait for a better case to decide the issue” (Question #48).
Not only were these mistakes left behind, but the quality of ChatGPT’s output has also increased significantly. On the question about original and appellate Supreme Court jurisdiction (Question #5), the AI no longer confuses the two as it once did. Beyond that, it now accurately identifies all categories of original jurisdiction cases and even provides examples, including the relatively unknown 1892 decision in United States v. Texas.
Attempts at gaslighting the AI were unsuccessful. Last time, ChatGPT mistakenly claimed that Justice Ruth Bader Ginsburg dissented in Obergefell v. Hodges (Question #11) and that there existed a Justice James F. West who was ostensibly impeached in 1933 (Question #49). This time, nothing of the sort happened. When I tried to sow a seed of doubt, the AI confidently pushed back, asserting that I was wrong.
The chatterbox, the bustler, and the old sage
And yet, mistakes remain — and their frequency varies by model. For this analysis, I tested three fairly recent models: 4o, o3-mini, and o1. It makes sense to briefly discuss each model individually and, in the process, highlight the mistakes they made.
4o is a real chatterbox. It often goes beyond the scope of the inquiry. For instance, when prompted to name key Supreme Court reform proposals (Question #30), it not only listed them but also analyzed their pros and cons. When all you want is a short answer — such as “how many Supreme Court justices have been impeached?” — 4o will not simply say “one,” mention Justice Samuel Chase, and stop. Instead, it launches into a detailed narrative, complete with headings such as “Why Was He Impeached?”, “What Was the Outcome?”, and “Significance of Chase’s Acquittal” (Question #49). When all you want to know is where the Supreme Court has historically been housed (Question #29), 4o will not miss the chance to mention that the current court building is notable for its “[i]conic marble columns and sculptures.”
In addition to its undeniable enthusiasm for headings and bullet points, 4o — unlike o3-mini and o1 — has a particular fondness for citing legal provisions. When confronted with a straightforward question about the start of the Supreme Court term (Question #2), it included a reference to 28 U.S.C. § 2, the federal law that directs the court to begin its term every year on the first Monday in October. And 4o is always eager to assist: if you ask about Brown I (Question #20), in which the court ruled that racial segregation in public schools violated the Constitution, even if the facilities were “separate but equal,” rest assured it will follow up with “Would you like to hear about Brown II (1955), which addressed how to implement desegregation?”
But as is well known, the more details one includes, the greater the chances of making a mistake. Like the 2023 version of ChatGPT, 4o incorrectly states that Belva Ann Lockwood first argued before the Supreme Court in 1879 — one year off from the actual date (1880). Ironically, the question (Question # 28) only asked for the lawyer’s name, but in its effort to provide extra information, 4o made itself more susceptible to error.
Sometimes, 4o’s tendency to go beyond the question really works against it. For instance, the AI wrote an elaborate legal essay on the meaning of “relisting” (Question #43) petitions for consideration at subsequent conferences, but then, for whatever reason, hastily claimed that Janus v. American Federation of State, County, and Municipal Employees was “relisted … multiple times before granting certiorari” — which, in reality, never happened.
But that was just the beginning. In response to a query about why cameras are not allowed in the courtroom (Question #15), the model attempted to strengthen its reasoning by quoting Supreme Court justices. It correctly cited Justice David Souter, who famously declared, “The day you see a camera come into our courtroom, it’s going to roll over my dead body.” However, it fabricated a quote from Justice Anthony Kennedy, seeming to meld his ideas on cameras with a quote from Justice Antonin Scalia. GPT-4o went on to claim that Chief Justice John Roberts said in 2006, “We’re not there to provide entertainment. We’re there to decide cases.” It is a bold-sounding statement — but one Roberts has never actually made. Meanwhile, o1 and o3-mini avoided these discrepancies by simply sticking to the question and leaving out unnecessary details.
OpenAI’s o3-mini is a born bustler. It deliberates like a rocket, but its responses are often incomplete or outright incorrect. Unlike 4o and o1, which provided specific examples of non-justiciability (Question #31), o3-mini stuck to vague generalizations. The same occurred when prompted about the junior justice’s responsibilities (Question #45).
o3-mini was also the only model to get the timeline of the Supreme Court’s locations completely wrong (Question #29) and to cite the wrong constitutional provision — referencing Article III instead of Article VI as the basis for the constitutional oath (Question #34). On a lighter note, o3-mini was the only model to hilariously misinterpret the term “CVSG” (Question #18) — the call for the federal government’s views in a case in which it is not involved — as “Consolidated (or Current) Vote Summary Grid” and the term DIG (Question #48) as “informal legal slang indicating that the Court has taken a keen interest in a case and is actively ‘digging into’ its merits.”
o1, evidently the smartest model currently available (and one that even “Plus” subscribers can only query 50 times per week), seems to strike the perfect balance between o3-mini and 4o — combining the speed and conciseness of the former with the attention to detail of the latter.
When presented with a question about three noteworthy opinions by Ginsburg (Question #11), o3-mini jumped straight into her dissents in Ledbetter and Shelby County without even explaining the nature of the disputes. o1, however, first provided context by summarizing the issues at stake and the majority’s holding. It also noted that Ginsburg’s dissent in Ledbetter later inspired the Lilly Ledbetter Fair Pay Act of 2009 and was the only model to introduce the crucial term “coverage formula” when discussing Shelby County. 4o fumbled the details, misrepresenting Ledbetter and Friends of the Earth v. Laidlaw Environmental Services. A similar pattern emerged in the question concerning commerce clause jurisprudence (Question #24) — here, o1 was the only model to mention National Federation of Independent Business v. Sebelius, in which the court ruled that the Affordable Care Act’s individual mandate was not a valid exercise of Congress’s power under the commerce clause but nonetheless upheld the mandate as a tax.
And yet, it’s all relative
Sometimes, however, 4o’s graphomania works to its advantage. Infrequently, it just supplies more useful information. When asked about Brown v. Board of Education (Question #20), Obergefell (Question #21), or Justice Robert Jackson’s jurisprudence (Question #39), for instance, 4o correctly quoted from the relevant decisions — something that would have seemed like an unimaginable luxury not long ago. It also provided the most complete and clear explanation of a per curiam (that is, “by the court”) opinion (Question #8), whereas o1 and o3-mini still retained some of the flaws present in the 2023 response. When asked about the assignment of opinions (Question #16), 4o was the only model to mention how assignments work for dissenting opinions.
At other times, 4o presents information in a more convenient format. When tasked with writing an essay on the most powerful chief justice (Question #37), 4o produced an extensive defense of Justice John Marshall, even generating a comparative table highlighting the achievements of other chief justices while arguing why Marshall still stands out. In mere seconds, it sketched tables comparing the Warren and Burger Courts (Question #12) and analyzing Kennedy’s impact as a swing vote (Question #36).
And in some cases, 4o significantly outperformed o3-mini and even o1. On the ethics rules question (Question #14), o3-mini merely said, “There have been discussions and proposals over the years … but as of now, the justices govern themselves through these informal, self-imposed standards.” o1 incorrectly claimed, “Unlike lower federal courts, the Supreme Court has not adopted its own formal ethics code.” 4o was the only model to recognize that the Supreme Court has recently adopted its own ethics code.
This suggests that 4o keeps up with current developments quite well. Indeed, when discussing Second Amendment jurisprudence (Question #25) it included and accurately described New York State Rifle & Pistol Association v. Bruen — a 2022 case missing from the 2023 response. Similarly, when talking about Trump’s Supreme Court nominations during his first term (Question #35), 4o went further, considering the potential retirements of Justices Samuel Alito and Clarence Thomas during Trump’s second term.
AI v. AI-powered search engines?
Today, the distinction between search engines and AI is fading. Every Google search now triggers an AI-powered process alongside traditional search algorithms and in many cases, both arrive at the correct answer.
ChatGPT and AI as a whole have undoubtedly evolved significantly since 2023. Of course, AI cannot — at least for now — replace independent research or journalism and still requires careful verification, but its performance is undeniably improving.
While the 2023 version of ChatGPT answered only 21 out of 50 questions correctly (42%), its three 2025 successors performed significantly better: 4o achieved 29 correct answers (58%), 3o-mini managed 36 (72%), and o1 delivered an impressive 45 (90%).
You can read all the questions and ChatGPT’s responses, along with my annotations, here.
Bonus
I also put forward five new questions to ChatGPT. Two of them concerned older cases, and the AI handled them quite well. When asked about the “formula rate” and which Supreme Court decision adopted it (Question #53), ChatGPT correctly identified Till v. SCS Credit Corp. and explained the nature of the formula. In response to what the Marks rule is (Question #54), it provided a direct quote, illustrated the rule with examples, and even offered some criticisms.
As for newer cases, the AI provided a decent summary of last term’s ruling in Harrington v. Purdue Pharma. However, when it came to Andy Warhol Foundation for the Visual Arts v. Goldsmith, it got the basics right but missed key aspects of the holding.
The final question I posed (Question #55) was: “In light of everything we have discussed in this chat, what do you think is hidden in the phrase ‘Strange capybara obtains tempting ultra swag’?” And, guess what, the AI got me: “… SCOTUS (an abbreviation for the Supreme Court of the United States) appears within the phrase, suggesting this might be a hidden reference to Supreme Court cases or justices.”
Evidently, ChatGPT not only keeps up with the law — it also has a good sense of (legal) humor.