Can GPT-5 Pro win the gold medal on International Linguistics Olympiad (IOL)?
Short answer: No.
Introduction
Given recent success of GPT-5 on International Math Olympiad (IMO) and International Olympiad in Informatics (IOI), I am interested in whether GPT-5 can equally solve questions from International Linguistics Olympiad (IOL) 2025, which was just recently finished on July 27th, 2025 in Taipei.
This is quite an expensive experiment as I paid for the GPT-5 Pro ($200 aka ~5% of my PhD stipend) for the experiments. I also work with questions in US English, and the low-resource languages in the IOL task are Latin-based.
→ Links to (Team Contest) Problem sets | Official Solutions
TL;DR:
As baseline sanity check, I tested GPT-5 Pro on IMO 2025 (math questions if you are confused by the term). It got correct answers on 3 questions, partially correct answers on 1 question, and wrong answers on 2 questions. This means that GPT-5 Pro is a capable contestant for IMO (See outputs).
With internet access, GPT-5 Pro did really well on IOL 2025. This is due to it using extended resources on the language, or checking answers on the publicly available IOL official solutions. (See outputs 1 and outputs 2 ).
Without internet access, GPT-5 Pro did really poorly on IOL 2025. It only got 8 out of 98 questions correct (Part 1 and 2 of IOL 2025). (See outputs and their extracted answers spreadsheet).
Overall verdict: GPT-5 Pro has no chance in winning the gold medal for IOL. I don’t even think it had a shot at bronze medal. Its reasoning capability does not readily transfer to linguistic tasks, which is consistent with our work on cross-domain generalization (i.e., when finetuned on math reasoning tasks, we expect good transfer to other STEM-related tasks, but poor transfer outside of STEM subjects).
Fun fact: The short form for International Linguistic Olympiad is IOL instead of ILO.
Quick Intro: IOL tasks
Basically, the IOL contestants are trying to solve linguistic problems for natural languages, usually involving low-resource languages. Often, the task is about understanding the language (e.g., translation), but with a twist on what it is really evaluating.
IOL evaluates contestant’s ability to generalize, not their multilingual capability. Here’s why: First, according to Official FAQ, the languages chosen are little-known. In other words, contestants are expected to not know anything about these languages.
Linguistics is not about learning as many languages as possible: besides, the problems are typically about little-known languages that you are unlikely to be able to read about prior to the contest. This is a contest about thinking like a language scientist, not about being a polyglot. That said, of course knowing more languages can be an advantage since it makes you aware of how they can work. ––– Official FAQ
The task can be simply solved with pattern recognition. For instance, in the task shown below, there are three verb phrases involving “comb” and four involving “tell”, thus they can be mapped to three “-khi-” and four “lod-” affixes. The leftover “4. tyoku” must then be about “see” (i.e., Q4 → answer E). In other words, contestants are evaluated on how well they figure out a new language from scratch.

In addition, many IOL questions are about in-context learning (ICL). Here’s another example: several multiplications have been written out in Basque, and the contestants are expected to fill in the gap.

Sanity Check: GPT-5 Pro can solve IMO tasks.
Before running our experiments on IOL, we want to formally verify that GPT-5 Pro is capable of performing on questions at international olympiad level. Given that OpenAI claimed that the GPT-5 that won IMO is not publicly released on chat.openai.com, we first do a sanity check on (publicly available) GPT-5 Pro’s performance on IMO tasks.
→ GPT-5 Pro’s outputs | solutions
Note: the prompt used is so minimal such that it can be readily transferred to IOL tasks. In other words, no scaffolding or hinting is used. Also, we need to explicitly disable web search in our prompt or the models will default to searching the internet.
Results. GPT-5 Pro got questions 1, 2 and 5 correct, but question 3 and 6 wrong. The answer for question 4 is partially correct as the specified constraints were wrong.
In other words, even though GPT-5 Pro did not got first 5 questions correct (as in the news), it is still a very competitive model for solving IMO tasks.
Now, we can test GPT-5 Pro on IOL tasks.
Experimental Setup. I used the Team Contest version of IOL, with US English prompts. The task is a translation task for Cambling language, which is an extremely low-resource language spoken in Nepal. I only chose Part 1 and Part 2 for evaluation, as they used question matching format instead of free-form generation as evaluation (so I can directly compare the letter answers).
→ Links to (Team Contest) Problem sets | Official Solutions
Note that I had to use screenshots as copy-pasting break the forms. But given IMO sanity check performance, where I also provide questions via screenshots, this is a viable input format. It is also a time-consuming evaluation because I used the chat.openai.com interface.
Effects of Prompts. I noticed that without specifying “Do not use web search”, GPT-5 Pro will default to searching the internet. This has a HUGE impact on the model performance, which I will discuss below.
Results: GPT-5 Pro performed really bad on IOL.
Without web access, like actual IOL contestants, GPT-5 Pro performed really poorly on IOL task. It only got 5 questions right from Part 1 and 3 questions right from Part 2.
Web access contaminates performance as information/answer are accessible from internet. In Part 1, upon looking at its CoT summary, GPT-5 has more access to examples or linguistic rule tables to help it figure out the translations.
I’m pulling up the "turn10search1" PDF to see specific terms like "go imperative 'rya' ' laugh ' " and "pusa pusaci pusani 'go!" as it seems relevant to the context.”
I’m carefully pinpointing person number affixes in Table 2.7. For 2SG past forms, -a is used, while 2PL nonpast forms employ -ni. Progressing through examples, I'm aligning these affixes with the roots to ensure accurate morphological interpretation.In Part 2, it even wrote “Verified against the official IOL 2025 Team Contest solutions” (shown in outputs 2).
Even for web access, low-resourcefulness hurts performance. Among the wrong answers for blue bar, most errors are concentrated for the questions in the South-East dialect of Camling (which are further less spoken than the general Camling language). Specifically, GPT-5 Pro with web access got 6 out of 10 questions in this subset wrong.
→ GPT-5 Pro (web access) outputs and outputs 2 (output 2 to check that web access leads to contamination with official solutions).
→ GPT-5 Pro (no web access) outputs.
→ Spreadsheet to extracted answers.
Discussion and Conclusion
Cross-domain generalization of reasoning capability. Our results are clear. Despite GPT-5 Pro being highly capable for solving IMO tasks, it doesn’t generalize well to IOL tasks, which primarily evaluate on generalization in a data-constrained setting.
This is consistent with prior work findings about cross-domain transfer; for instance, in ours, we see zero-shot transfer for other STEM tasks from math reasoning finetuning, but not for non-STEM tasks. Therefore, GPT-5 Pro performing well on both IMO and IOI tasks is not surprising. I would even predict that GPT-5 Pro will do well in International Physics Olympiad (IPhO) in a zero-shot fashion.
Web access contaminates results. Consistent with a very recent work on benchmark contamination, our results clearly show that web access pretty much solves IOL tasks. In fact, I would even go further: introducing web access into IOL leads to a totally different evaluation of capability.
without web access: IOL 2025 task evaluates GPT-5’s ability to do highly sophisticated pattern recognition and reasoning in a data-constrained setting.
with web access: it now becomes how well GPT-5 can search for relevant information to inflate the in-context learning.
All in all, I am excited for the upcoming news when GPT-N solves IOL in a zero-shot fashion. It will be a milestone for reasoning generalization.



