Gerasimos (Makis) Lampouras

Hello there!

I am a Principal Research Scientist and currently serve as the Team Leader of the Speech and Language group in Huawei Noah’s Ark Lab, London. If you are intersted in joining us, you can find information about open positions here. Also, keep in mind that spots for Research Internships are open all year round and you can apply through here.

My areas of interest generally fall somewhere between Natural Language Processing and Machine Learning, with my latest efforts focused on improving the auto-regressive reasoning capabilities of Large Language Models. Recently this journey has led me to try and teach LLMs to generate / understand programming languages, while previously my work was more focused on natural language generation, dialogue systems, and machine learning applications for NLP.

Previously, I was a research associate at the University of Cambridge, University of Sheffield and University College London. I graduated and received my MSc and PhD from the Athens University of Economics and Business.

In my spare time, I survive my cat and struggle to keep up with too many intestests, but I do enjoy reading fiction, playing games (board, video, and dungeon mastering TTRPGs), and (unfortunately rarely) paint.

selected publications

ICLR
A Benchmark for Deep Information Synthesis

Debjit Paul , Daniel Murphy , Milan Gritta , and 14 more authors

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DeepSynth, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DeepSynth contains 120 tasks collected across 7 domains and data sources covering 42 countries. DeepSynth is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis and design tasks with verifiable answers. When evaluated on DeepSynth, 9 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting \ourdata as a crucial benchmark for guiding future research.
@article{dpaul2026deepsynth, title = {A Benchmark for Deep Information Synthesis}, author = {Paul, Debjit and Murphy, Daniel and Gritta, Milan and Cardenas, Ronald and Prokhorov, Victor and Bolliger, Lena Sophia and Toker, Aysim and Miles, Roy and Oncescu, Andreea-Maria and Sivakumar, Jasivan Alex and Borchert, Philipp and Elezi, Ismail and Zhang, Meiru and Lee, Ka Yiu and Zhang, Guchun and Wang, Jun and Lampouras, Gerasimos}, journal = {In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)}, year = {2026}, }
ICLR
DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

Meiru Zhang , Philipp Borchert , Milan Gritta , and 1 more author

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib

Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ”sub-components”. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model’s capabilities.
@article{zhang2025driftdecomposeretrieveillustrate, title = {DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems}, author = {Zhang, Meiru and Borchert, Philipp and Gritta, Milan and Lampouras, Gerasimos}, journal = {In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)}, year = {2026}, }
EACL
Process Evaluation for Agentic Systems

Milan Gritta , Debjit Paul$ , Xiaoguang Li , and 3 more authors

In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), 2026

Abs arXiv Bib

The significance of tasks entrusted to LLM-based assistants (agents) and the associated societal risks are increasing each year. Agents are being explored in critical domains such as medicine, finance, law, infrastructure, and other sensitive applications that require system transparency and high user trust. The quality of these agents is typically evaluated by accuracy, sometimes extended to partial correctness. In this position paper, we argue that this focus on outcomes is insufficient as it can obscure risky agent behaviours such as skipping critical steps, hallucinating tool use, relying on outdated parametric knowledge and other means of bypassing recommended processes. Our core position is that a holistic agent evaluation must include process evaluation, especially for critical applications. We conduct a small-scale study to assess the feasibility of automatic process evaluation, present a compliance score, analyse use cases of bad and good behaviours, and offer recommendations for more holistic evaluation.
@article{gritta2026processeval, title = {Process Evaluation for Agentic Systems}, author = {Gritta, Milan and Paul$, Debjit and Li, Xiaoguang and Shang, Lifeng and Wang, Jun and Lampouras, Gerasimos}, journal = {In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL)}, year = {2026}, }
arXiv
Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

Jasivan Alex Sivakumar , Philipp Borchert , Ronald Cardenas , and 1 more author

arXiv pre-print, 2025

Abs arXiv Bib

Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.
@article{sivakumar2025conjecturingoverlookedstepformal, title = {Conjecturing: An Overlooked Step in Formal Mathematical Reasoning}, author = {Sivakumar, Jasivan Alex and Borchert, Philipp and Cardenas, Ronald and Lampouras, Gerasimos}, journal = {arXiv pre-print}, year = {2025}, }
arXiv
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li , Philipp Borchert , and Gerasimos Lampouras

arXiv pre-print, 2025

Abs arXiv Bib

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.
@article{li2025topoalignframeworkaligningcode, title = {TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition}, author = {Li, Yupei and Borchert, Philipp and Lampouras, Gerasimos}, journal = {arXiv pre-print}, year = {2025}, }
ACL
DReSD: Dense Retrieval for Speculative Decoding

Milan Gritta , Huiyin Xue , and Gerasimos Lampouras

In Findings of the Association for Computational Linguistics - ACL, 2025

Abs arXiv Bib Code

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (CITATION)REST], which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).
@article{gritta-etal-2025-dresd, title = {DReSD: Dense Retrieval for Speculative Decoding}, author = {Gritta, Milan and Xue, Huiyin and Lampouras, Gerasimos}, journal = {In Findings of the Association for Computational Linguistics - ACL}, year = {2025}, }
EMNLP
SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Fenia Christopoulou , Ronald Cardenas , Gerasimos Lampouras , and 2 more authors

In Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

Abs arXiv Bib Code

Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
@article{christopoulou2024sparsepocontrollingpreferencealignment, title = {SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks}, author = {Christopoulou, Fenia and Cardenas, Ronald and Lampouras, Gerasimos and Bou-Ammar, Haitham and Wang, Jun}, journal = {In Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2025}, }
ICLR
Human-like Episodic Memory for Infinite Context LLMs

Zafeirios Fountas , Martin A Benfeghoul , Adnan Oomerjee , and 4 more authors

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2024

Abs arXiv Bib

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench and InfiniteBench benchmarks demonstrate EM-LLM’s superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM’s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens - a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.
@article{fountas2024humanlikeepisodicmemoryinfinite, title = {Human-like Episodic Memory for Infinite Context LLMs}, author = {Fountas, Zafeirios and Benfeghoul, Martin A and Oomerjee, Adnan and Christopoulou, Fenia and Lampouras, Gerasimos and Bou-Ammar, Haitham and Wang, Jun}, journal = {In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)}, year = {2024}, }
EACL
Text-to-Code Generation with Modality-relative Pre-training

Fenia Christopoulou , Guchun Zhang , and Gerasimos Lampouras

In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), 2024

Abs arXiv Bib Code

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model–where training sequences typically contain both natural and (linearised) programming language. Such approaches effectively map both modalities of the sequence into the same embedding space. However, programming language keywords (e.g. “while”) often have very strictly defined semantics. As such, transfer learning from their natural language usage may not necessarily be beneficial to their code application and vise versa. Assuming an already pre-trained language model, in this work we investigate how sequence tokens can be adapted and represented differently, depending on which modality they belong to, and to the ultimate benefit of the downstream task. We experiment with separating embedding spaces between modalities during further model pre-training with modality-relative training objectives. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets, measuring pass@k and a novel incremental variation.
@article{christopoulou2024text, title = {Text-to-Code Generation with Modality-relative Pre-training}, author = {Christopoulou, Fenia and Zhang, Guchun and Lampouras, Gerasimos}, journal = {In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL)}, year = {2024}, }
arXiv
PanGu-Coder: Program synthesis with function-level language modeling

Fenia Christopoulou , Gerasimos Lampouras , Milan Gritta , and 8 more authors

arXiv pre-print, 2022

Abs arXiv Bib

We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.
@article{christopoulou2022pangu, title = {PanGu-Coder: Program synthesis with function-level language modeling}, author = {Christopoulou, Fenia and Lampouras, Gerasimos and Gritta, Milan and Zhang, Guchun and Guo, Yinpeng and Li, Zhongqi and Zhang, Qi and Xiao, Meng and Shen, Bo and Li, Lin and others}, journal = {arXiv pre-print}, year = {2022}, }
EMNLP
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis

Philip John Gorinski , Matthieu Zimmer , Gerasimos Lampouras , and 2 more authors

In Findings of the Association for Computational Linguistics - EMNLP, 2023

Abs arXiv Bib

The advent of large pre-trained language models in the domain of Code Synthesis has shown remarkable performance on various benchmarks, treating the problem of Code Generation in a fashion similar to Natural Language Generation, trained with a Language Modelling (LM) objective. In addition, the property of programming language code being precisely evaluable with respect to its semantics – through the use of Unit Tests to check its functional correctness – lends itself to using Reinforcement Learning (RL) as a further training paradigm. Previous work has shown that RL can be applied as such to improve models’ coding capabilities; however, such RL-based methods rely on a reward signal based on defined Unit Tests, which are much harder to obtain compared to the huge crawled code datasets used in LM objectives. In this work, we present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests, suitable for RL training of Code Synthesis models. We also introduce a straightforward, simple yet effective Actor-Critic RL training scheme and show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model’s performance by up to 9.9% improvement over the original underlying code synthesis LM, and up to 4.3% over RL-based models trained with standard PPO or CodeRL.
@article{gorinski2023automatic, title = {Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis}, author = {Gorinski, Philip John and Zimmer, Matthieu and Lampouras, Gerasimos and Deik, Derrick Goh Xin and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics - EMNLP}, year = {2023}, }