team publications | Gerasimos (Makis) Lampouras

2025

ACL
DReSD: Dense Retrieval for Speculative Decoding

Milan Gritta , Huiyin Xue , and Gerasimos Lampouras

In Findings of the Association for Computational Linguistics - ACL, 2025

Abs arXiv Bib Code

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (CITATION)REST], which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).
@article{gritta-etal-2025-dresd, title = {DReSD: Dense Retrieval for Speculative Decoding}, author = {Gritta, Milan and Xue, Huiyin and Lampouras, Gerasimos}, journal = {In Findings of the Association for Computational Linguistics - ACL}, year = {2025}, }

2024

arXiv
SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Fenia Christopoulou , Ronald Cardenas , Gerasimos Lampouras , and 2 more authors

arXiv pre-print, 2024

Abs arXiv Bib Code

Preference Optimization (PO) has proven an effective step for aligning language models to human-desired behaviors. Current variants, following the offline Direct Preference Optimization objective, have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected by each word in a sequence equally but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly. Notably, our method induces sparsity in the learned masks, allowing the model to learn how to best weight reward and KL divergence contributions at the token level, learning an optimal level of mask sparsity. Extensive experiments on multiple domains, including sentiment control, dialogue, text summarization and text-to-code generation, illustrate that our approach assigns meaningful weights to tokens according to the target task, generates more responses with the desired preference and improves reasoning tasks by up to 2 percentage points compared to other token- and response-level PO methods.
@article{christopoulou2024sparsepocontrollingpreferencealignment, title = {SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks}, author = {Christopoulou, Fenia and Cardenas, Ronald and Lampouras, Gerasimos and Bou-Ammar, Haitham and Wang, Jun}, journal = {arXiv pre-print}, year = {2024}, }
ICLR
Mixture of Attentions For Speculative Decoding

Matthieu Zimmer , Milan Gritta , Gerasimos Lampouras , and 2 more authors

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2024

Abs arXiv Bib

The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.
@article{zimmer2024mixtureattentionsspeculativedecoding, title = {Mixture of Attentions For Speculative Decoding}, author = {Zimmer, Matthieu and Gritta, Milan and Lampouras, Gerasimos and Ammar, Haitham Bou and Wang, Jun}, journal = {In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)}, year = {2024}, }
ICLR
Human-like Episodic Memory for Infinite Context LLMs

Zafeirios Fountas , Martin A Benfeghoul , Adnan Oomerjee , and 4 more authors

In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2024

Abs arXiv Bib

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench and InfiniteBench benchmarks demonstrate EM-LLM’s superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM’s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens - a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.
@article{fountas2024humanlikeepisodicmemoryinfinite, title = {Human-like Episodic Memory for Infinite Context LLMs}, author = {Fountas, Zafeirios and Benfeghoul, Martin A and Oomerjee, Adnan and Christopoulou, Fenia and Lampouras, Gerasimos and Bou-Ammar, Haitham and Wang, Jun}, journal = {In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR)}, year = {2024}, }
NAACL
Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency

Leonidas Gee , Milan Gritta , Gerasimos Lampouras , and 1 more author

In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

Abs arXiv Bib

Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation have observed corresponding drops in functional correctness. To that end, we introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting while avoiding a reliance on larger models for learning signals. Code-Optimise achieves significant improvements in pass@k while decreasing the competitive baseline runtimes by an additional 6% for in-domain data and up to 3% for out-of-domain data. As a byproduct, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval, resulting in faster and cheaper inference. The generated data and codebase will be open-sourced at this http URL.
@article{gee2024codeoptimiseselfgeneratedpreferencedata, title = {Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency}, author = {Gee, Leonidas and Gritta, Milan and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2024}, }
EACL
Text-to-Code Generation with Modality-relative Pre-training

Fenia Christopoulou , Guchun Zhang , and Gerasimos Lampouras

In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), 2024

Abs arXiv Bib Code

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model–where training sequences typically contain both natural and (linearised) programming language. Such approaches effectively map both modalities of the sequence into the same embedding space. However, programming language keywords (e.g. “while”) often have very strictly defined semantics. As such, transfer learning from their natural language usage may not necessarily be beneficial to their code application and vise versa. Assuming an already pre-trained language model, in this work we investigate how sequence tokens can be adapted and represented differently, depending on which modality they belong to, and to the ultimate benefit of the downstream task. We experiment with separating embedding spaces between modalities during further model pre-training with modality-relative training objectives. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets, measuring pass@k and a novel incremental variation.
@article{christopoulou2024text, title = {Text-to-Code Generation with Modality-relative Pre-training}, author = {Christopoulou, Fenia and Zhang, Guchun and Lampouras, Gerasimos}, journal = {In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL)}, year = {2024}, }
NAACL

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

Milan Gritta , Gerasimos Lampouras , and Ignacio Iacobacci

In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

Abs arXiv Code

Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM’s distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE’s efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
SCI-CHAT

Findings of the First Workshop on Simulating Conversational Intelligence in Chat

Yvette Graham , Mohammed Rameez Qureshi , Haider Khalid , and 3 more authors

In Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT), 2024

Abs arXiv

The aim of this workshop is to bring together experts working on open-domain dialogue research. In this speedily advancing research area many challenges still exist, such as learning information from conversations, engaging in realistic and convincing simulation of human intelligence and reasoning. SCI-CHAT follows previous workshops on open domain dialogue but with a focus on the simulation of intelligent conversation as judged in a live human evaluation. Models aim to include the ability to follow a challenging topic over a multi-turn conversation, while positing, refuting and reasoning over arguments. The workshop included both a research track and shared task. The main goal of this paper is to provide an overview of the shared task and a link to an additional paper that will include an in depth analysis of the shared task results following presentation at the workshop.
CVPR
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

Petru-Daniel Tudosiu , Yongxin Yang , Shifeng Zhang , and 5 more authors

In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Abs arXiv Bib Blog Code

Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at this https URL.
@article{tudosiu2024mulanmultilayerannotated, title = {MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation}, author = {Tudosiu, Petru-Daniel and Yang, Yongxin and Zhang, Shifeng and Chen, Fei and McDonagh, Steven and Lampouras, Gerasimos and Iacobacci, Ignacio and Parisot, Sarah}, journal = {In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, }
arXiv
Encoder-Decoder Framework for Interactive Free Verses with Generation with Controllable High-Quality Rhyming

Tommaso Pasini , Alejo López-Ávila , Husam Quteineh , and 5 more authors

arXiv pre-print, 2024

Abs arXiv Bib

Composing poetry or lyrics involves several creative factors, but a challenging aspect of generation is the adherence to a more or less strict metric and rhyming pattern. To address this challenge specifically, previous work on the task has mainly focused on reverse language modeling, which brings the critical selection of each rhyming word to the forefront of each verse. On the other hand, reversing the word order requires that models be trained from scratch with this task-specific goal and cannot take advantage of transfer learning from a Pretrained Language Model (PLM). We propose a novel fine-tuning approach that prepends the rhyming word at the start of each lyric, which allows the critical rhyming decision to be made before the model commits to the content of the lyric (as during reverse language modeling), but maintains compatibility with the word order of regular PLMs as the lyric itself is still generated in left-to-right order. We conducted extensive experiments to compare this fine-tuning against the current state-of-the-art strategies for rhyming, finding that our approach generates more readable text and better rhyming capabilities. Furthermore, we furnish a high-quality dataset in English and 12 other languages, analyse the approach’s feasibility in a multilingual context, provide extensive experimental results shedding light on good and bad practices for lyrics generation, and propose metrics to compare methods in the future.
@article{pasini2024encoderdecoderframeworkinteractivefree, title = {Encoder-Decoder Framework for Interactive Free Verses with Generation with Controllable High-Quality Rhyming}, author = {Pasini, Tommaso and López-Ávila, Alejo and Quteineh, Husam and Lampouras, Gerasimos and Du, Jinhua and Wang, Yubing and Li, Ze and Sun, Yusen}, journal = {arXiv pre-print}, year = {2024}, }
IJCAI
Correct and Optimal: the Regular Expression Inference Challenge

Mojtaba Valizadeh , Philip John Gorinski , Ignacio Iacobacci , and 1 more author

In Proceedings of the International Joint Conference on Artificial Intelligence, 2024

Abs arXiv Bib

We propose regular expression inference (REI) as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program optimisation task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings P and N and a cost function cost(·), the task is to generate an expression r that accepts all strings in P and rejects all strings in N, while no other such expression r′ exists with cost(r′) < cost(r). REI has advantages as a challenge problem: (i) regular expressions are wellknown, widely used, and a natural idealisation of code; (ii) REI’s asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g. P or N cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI, with its emphasis on optimisation, is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal regular expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to progress in code/language modelling.
@article{valizadeh2024correctoptimalregularexpression, title = {Correct and Optimal: the Regular Expression Inference Challenge}, author = {Valizadeh, Mojtaba and Gorinski, Philip John and Iacobacci, Ignacio and Berger, Martin}, journal = {In Proceedings of the International Joint Conference on Artificial Intelligence}, year = {2024}, }
arXiv
αVIL: Learning to Leverage Auxiliary Tasks for Multitask Learning

Rafael Kourdis , Gabriel Gordon-Hall , and Philip John Gorinski

arXiv pre-print, 2024

Abs arXiv Bib

Multitask Learning is a Machine Learning paradigm that aims to train a range of (usually related) tasks with the help of a shared model. While the goal is often to improve the joint performance of all training tasks, another approach is to focus on the performance of a specific target task, while treating the remaining ones as auxiliary data from which to possibly leverage positive transfer towards the target during training. In such settings, it becomes important to estimate the positive or negative influence auxiliary tasks will have on the target. While many ways have been proposed to estimate task weights before or during training they typically rely on heuristics or extensive search of the weighting space. We propose a novel method called α-Variable Importance Learning (\alphaVIL) that is able to adjust task weights dynamically during model training, by making direct use of task-specific updates of the underlying model’s parameters between training epochs. Experiments indicate that \alphaVIL is able to outperform other Multitask Learning approaches in a variety of settings. To our knowledge, this is the first attempt at making direct use of model updates for task weight estimation.
@article{kourdis2024alphavillearningleverageauxiliary, title = {αVIL: Learning to Leverage Auxiliary Tasks for Multitask Learning}, author = {Kourdis, Rafael and Gordon-Hall, Gabriel and Gorinski, Philip John}, journal = {arXiv pre-print}, year = {2024}, }

2023

EACL
Exploring data augmentation for code generation tasks

Pinzhen Chen , and Gerasimos Lampouras

In Findings of the Association for Computational Linguistics - EACL, 2023

Abs arXiv Bib

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and expanded it through multi-modality and multi-tasking, yet the data for downstream tasks remain modest in size. Focusing on data utilization for downstream tasks, we propose and adapt augmentation methods that yield consistent improvements in code translation and summarization by up to 6.9% and 7.5% respectively. Further analysis suggests that our methods work orthogonally and show benefits in output code style and numeric consistency. We also discuss test data imperfections.
@article{chen2023exploring, title = {Exploring data augmentation for code generation tasks}, author = {Chen, Pinzhen and Lampouras, Gerasimos}, journal = {In Findings of the Association for Computational Linguistics - EACL}, year = {2023}, }
EMNLP
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis

Philip John Gorinski , Matthieu Zimmer , Gerasimos Lampouras , and 2 more authors

In Findings of the Association for Computational Linguistics - EMNLP, 2023

Abs arXiv Bib

The advent of large pre-trained language models in the domain of Code Synthesis has shown remarkable performance on various benchmarks, treating the problem of Code Generation in a fashion similar to Natural Language Generation, trained with a Language Modelling (LM) objective. In addition, the property of programming language code being precisely evaluable with respect to its semantics – through the use of Unit Tests to check its functional correctness – lends itself to using Reinforcement Learning (RL) as a further training paradigm. Previous work has shown that RL can be applied as such to improve models’ coding capabilities; however, such RL-based methods rely on a reward signal based on defined Unit Tests, which are much harder to obtain compared to the huge crawled code datasets used in LM objectives. In this work, we present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests, suitable for RL training of Code Synthesis models. We also introduce a straightforward, simple yet effective Actor-Critic RL training scheme and show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model’s performance by up to 9.9% improvement over the original underlying code synthesis LM, and up to 4.3% over RL-based models trained with standard PPO or CodeRL.
@article{gorinski2023automatic, title = {Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis}, author = {Gorinski, Philip John and Zimmer, Matthieu and Lampouras, Gerasimos and Deik, Derrick Goh Xin and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics - EMNLP}, year = {2023}, }
TACL
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Songbo , Han Zhou , Mete Hergul , and 5 more authors

In Transactions of the Association for Computational Linguistics, 2023

Abs arXiv Bib

Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.
@article{hu-etal-2023-multi-3, title = {Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems}, author = {Songbo and Zhou, Han and Hergul, Mete and Gritta, Milan and Zhang, Guchun and Iacobacci, Ignacio and Vuli{\'c}, Ivan and Korhonen, Anna}, journal = {In Transactions of the Association for Computational Linguistics}, year = {2023}, }
EMNLP
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

Songbo Hu , Han Zhou , Moy Yuan , and 5 more authors

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023

Abs arXiv Bib

Achieving robust language technologies that can perform well across the world’s many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled experiments, we demonstrate that performance disparities depend on a number of factors: the nature of the ToD task at hand, the underlying pretrained language model, the target language, and the amount of ToD annotated data. We empirically prove the existence of the adaptation and intrinsic biases in current ToD systems: e.g., ToD systems trained for Arabic or Turkish using annotated ToD data fully parallel to English ToD data still exhibit diminished ToD task performance. Beyond providing a series of insights into the performance disparities of ToD systems in different languages, our analyses offer practical tips on how to approach ToD data collection and system development for new languages.
@article{hu-etal-2023-systematic, title = {A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems}, author = {Hu, Songbo and Zhou, Han and Yuan, Moy and Gritta, Milan and Zhang, Guchun and Iacobacci, Ignacio and Korhonen, Anna and Vuli{\'c}, Ivan}, journal = {In Proceedings of the Conference on Empirical Methods in Natural Language Processing}, year = {2023}, }
EACL
XQA-DST: Multi-Domain and Multi-Lingual Dialogue State Tracking

Han Zhou , Ignacio Iacobacci , and Pasquale Minervini

In Findings of the Association for Computational Linguistics - EACL, 2023

Abs arXiv Bib

Dialogue State Tracking (DST), a crucial component of task-oriented dialogue (ToD) systems, keeps track of all important information pertaining to dialogue history: filling slots with the most probable values throughout the conversation. Existing methods generally rely on a predefined set of values and struggle to generalise to previously unseen slots in new domains. To overcome these challenges, we propose a domain-agnostic extractive question answering (QA) approach with shared weights across domains. To disentangle the complex domain information in ToDs, we train our DST with a novel domain filtering strategy by excluding out-of-domain question samples. With an independent classifier that predicts the presence of multiple domains given the context, our model tackles DST by extracting spans in active domains. Empirical results demonstrate that our model can efficiently leverage domain-agnostic QA datasets by two-stage fine-tuning while being both domain-scalable and open-vocabulary in DST. It shows strong transferability by achieving zero-shot domain-adaptation results on MultiWOZ 2.1 with an average JGA of 36.7%. It further achieves cross-lingual transfer with state-of-the-art zero-shot results, 66.2% JGA from English to German and 75.7% JGA from English to Italian on WOZ 2.0.
@article{zhou2023xqadstmultidomainmultilingualdialogue, title = {XQA-DST: Multi-Domain and Multi-Lingual Dialogue State Tracking}, author = {Zhou, Han and Iacobacci, Ignacio and Minervini, Pasquale}, journal = {In Findings of the Association for Computational Linguistics - EACL}, year = {2023}, }
arXiv
Graph Attention with Hierarchies for Multi-hop Question Answering

Yunjie He , Philip John Gorinski , Ieva Staliunaite , and 1 more author

arXiv pre-print, 2023

Abs arXiv Bib

Multi-hop QA (Question Answering) is the task of finding the answer to a question across multiple documents. In recent years, a number of Deep Learning-based approaches have been proposed to tackle this complex task, as well as a few standard benchmarks to assess models Multi-hop QA capabilities. In this paper, we focus on the well-established HotpotQA benchmark dataset, which requires models to perform answer span extraction as well as support sentence prediction. We present two extensions to the SOTA Graph Neural Network (GNN) based model for HotpotQA, Hierarchical Graph Network (HGN): (i) we complete the original hierarchical structure by introducing new edges between the query and context sentence nodes; (ii) in the graph propagation step, we propose a novel extension to Hierarchical Graph Attention Network GATH (Graph ATtention with Hierarchies) that makes use of the graph hierarchy to update the node representations in a sequential fashion. Experiments on HotpotQA demonstrate the efficiency of the proposed modifications and support our assumptions about the effects of model related variables.
@article{he2023graphattentionhierarchiesmultihop, title = {Graph Attention with Hierarchies for Multi-hop Question Answering}, author = {He, Yunjie and Gorinski, Philip John and Staliunaite, Ieva and Stenetorp, Pontus}, journal = {arXiv pre-print}, year = {2023}, }

2022

ACL
Hierarchical Recurrent Aggregative Generation for Few-Shot NLG

Giulio Zhou , Gerasimos Lampouras , and Ignacio Iacobacci

In Findings of the Association for Computational Linguistics - ACL, 2022

Abs Bib PDF

Large pretrained models enable transfer learning to low-resource domains for language generation tasks. However, previous end-to-end approaches do not account for the fact that some generation sub-tasks, specifically aggregation and lexicalisation, can benefit from transfer learning in different extents. To exploit these varying potentials for transfer learning, we propose a new hierarchical approach for few-shot and zero-shot generation. Our approach consists of a three-moduled jointly trained architecture: the first module independently lexicalises the distinct units of information in the input as sentence sub-units (e.g. phrases), the second module recurrently aggregates these sub-units to generate a unified intermediate output, while the third module subsequently post-edits it to generate a coherent and fluent final text. We perform extensive empirical analysis and ablation studies on few-shot and zero-shot settings across 4 datasets. Automatic and human evaluation shows that the proposed hierarchical approach is consistently capable of achieving state-of-the-art results when compared to previous work.
@article{zhou2022hierarchical, title = {Hierarchical Recurrent Aggregative Generation for Few-Shot NLG}, author = {Zhou, Giulio and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics - ACL}, year = {2022}, }
arXiv
PanGu-Coder: Program synthesis with function-level language modeling

Fenia Christopoulou , Gerasimos Lampouras , Milan Gritta , and 8 more authors

arXiv pre-print, 2022

Abs arXiv Bib

We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.
@article{christopoulou2022pangu, title = {PanGu-Coder: Program synthesis with function-level language modeling}, author = {Christopoulou, Fenia and Lampouras, Gerasimos and Gritta, Milan and Zhang, Guchun and Guo, Yinpeng and Li, Zhongqi and Zhang, Qi and Xiao, Meng and Shen, Bo and Li, Lin and others}, journal = {arXiv pre-print}, year = {2022}, }
EMNLP
Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU

Fenia Christopoulou , Gerasimos Lampouras , and Ignacio Iacobacci

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022

Abs arXiv Bib

Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend with the aim of accelerating convergence and improving generalisability. Current approaches for Natural Language Understanding (NLU) tasks use CL to improve in-distribution data performance often via heuristic-oriented or task-agnostic difficulties. In this work, instead, we employ CL for NLU by taking advantage of training dynamics as difficulty metrics, i.e., statistics that measure the behavior of the model at hand on specific task-data instances during training and propose modifications of existing CL schedulers based on these statistics. Differently from existing works, we focus on evaluating models on in-distribution (ID), out-of-distribution (OOD) as well as zero-shot (ZS) cross-lingual transfer datasets. We show across several NLU tasks that CL with training dynamics can result in better performance mostly on zero-shot cross-lingual transfer and OOD settings with improvements up by 8.5% in certain cases. Overall, experiments indicate that training dynamics can lead to better performing models with smoother training compared to other difficulty metrics while being 20% faster on average. In addition, through analysis we shed light on the correlations of task-specific versus task-agnostic metrics.
@article{christopoulou2022training, title = {Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU}, author = {Christopoulou, Fenia and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Proceedings of the Conference on Empirical Methods in Natural Language Processing}, year = {2022}, }
EMNLP
Topic-aware response generation in task-oriented dialogue with unstructured knowledge access

Yue Feng , Gerasimos Lampouras , and Ignacio Iacobacci

In Findings of the Association for Computational Linguistics - EMNLP, 2022

Abs arXiv Bib

To alleviate the problem of structured databases’ limited coverage, recent task-oriented dialogue systems incorporate external unstructured knowledge to guide the generation of system responses. However, these usually use word or sentence level similarities to detect the relevant knowledge context, which only partially capture the topical level relevance. In this paper, we examine how to better integrate topical information in knowledge grounded task-oriented dialogue and propose “Topic-Aware Response Generation” (TARG), an end-to-end response generation model. TARG incorporates multiple topic-aware attention mechanisms to derive the importance weighting scheme over dialogue utterances and external knowledge sources towards a better understanding of the dialogue history. Experimental results indicate that TARG achieves state-of-the-art performance in knowledge selection and response generation, outperforming previous state-of-the-art by 3.2, 3.6, and 4.2 points in EM, F1 and BLEU-4 respectively on Doc2Dial, and performing comparably with previous work on DSTC9; both being knowledge-grounded task-oriented dialogue datasets.
@article{feng2022topic, title = {Topic-aware response generation in task-oriented dialogue with unstructured knowledge access}, author = {Feng, Yue and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics - EMNLP}, year = {2022}, }
EMNLP
EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching

Chenxi Whitehouse , Fenia Christopoulou , and Ignacio Iacobacci

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022

Abs arXiv Bib

Accurate alignment between languages is fundamental for improving cross-lingual pre-trained language models (XLMs). Motivated by the natural phenomenon of code-switching (CS) in multilingual speakers, CS has been used as an effective data augmentation method that offers language alignment at the word- or phrase-level, in contrast to sentence-level via parallel instances. Existing approaches either use dictionaries or parallel sentences with word alignment to generate CS data by randomly switching words in a sentence. However, such methods can be suboptimal as dictionaries disregard semantics, and syntax might become invalid after random word switching. In this work, we propose EntityCS, a method that focuses on Entity-level Code-Switching to capture fine-grained cross-lingual semantics without corrupting syntax. We use Wikidata and English Wikipedia to construct an entity-centric CS corpus by switching entities to their counterparts in other languages. We further propose entity-oriented masking strategies during intermediate model training on the EntityCS corpus for improving entity prediction. Evaluation of the trained models on four entity-centric downstream tasks shows consistent improvements over the baseline with a notable increase of 10% in Fact Retrieval. We release the corpus and models to assist research on code-switching and enriching XLMs with external knowledge.
@article{whitehouse2023entitycsimprovingzeroshotcrosslingual, title = {EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching}, author = {Whitehouse, Chenxi and Christopoulou, Fenia and Iacobacci, Ignacio}, journal = {In Proceedings of the Conference on Empirical Methods in Natural Language Processing}, year = {2022}, }
arXiv
Relational Graph Convolutional Neural Networks for Multihop Reasoning: A Comparative Study

Ieva Staliūnaitė , Philip John Gorinski , and Ignacio Iacobacci

arXiv pre-print, 2022

Abs arXiv Bib

Multihop Question Answering is a complex Natural Language Processing task that requires multiple steps of reasoning to find the correct answer to a given question. Previous research has explored the use of models based on Graph Neural Networks for tackling this task. Various architectures have been proposed, including Relational Graph Convolutional Networks (RGCN). For these many node types and relations between them have been introduced, such as simple entity co-occurrences, modelling coreferences, or "reasoning paths" from questions to answers via intermediary entities. Nevertheless, a thoughtful analysis on which relations, node types, embeddings and architecture are the most beneficial for this task is still missing. In this paper we explore a number of RGCN-based Multihop QA models, graph relations, and node embeddings, and empirically explore the influence of each on Multihop QA performance on the WikiHop dataset.
@article{staliūnaitė2022relationalgraphconvolutionalneural, title = {Relational Graph Convolutional Neural Networks for Multihop Reasoning: A Comparative Study}, author = {Staliūnaitė, Ieva and Gorinski, Philip John and Iacobacci, Ignacio}, journal = {arXiv pre-print}, year = {2022}, }
ACL
CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding

Milan Gritta , Ruoyu Hu , and Ignacio Iacobacci

In Findings of the Association for Computational Linguistics, 2022

Abs arXiv Bib

Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One of the challenges of making neural dialogue systems available to more users is the lack of training data for all but a few languages. Zero-shot methods try to solve this issue by acquiring task knowledge in a high-resource language such as English with the aim of transferring it to the low-resource language(s). To this end, we introduce CrossAligner, the principal method of a variety of effective approaches for zero-shot cross-lingual transfer based on learning alignment from unlabelled parallel data. We present a quantitative analysis of individual methods as well as their weighted combinations, several of which exceed state-of-the-art (SOTA) scores as evaluated across nine languages, fifteen test sets and three benchmark multilingual datasets. A detailed qualitative error analysis of the best methods shows that our fine-tuned language models can zero-shot transfer the task knowledge better than anticipated.
@article{gritta2022crossalignercozeroshot, title = {CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding}, author = {Gritta, Milan and Hu, Ruoyu and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics}, year = {2022}, }
arXiv
Structured Q-learning For Antibody Design

Alexander I. Cowen-Rivers , Philip John Gorinski , Aivar Sootla , and 5 more authors

arXiv pre-print, 2022

Abs arXiv Bib

Multi-hop QA (Question Answering) is the task of finding the answer to a question across multiple documents. In recent years, a number of Deep Learning-based approaches have been proposed to tackle this complex task, as well as a few standard benchmarks to assess models Multi-hop QA capabilities. In this paper, we focus on the well-established HotpotQA benchmark dataset, which requires models to perform answer span extraction as well as support sentence prediction. We present two extensions to the SOTA Graph Neural Network (GNN) based model for HotpotQA, Hierarchical Graph Network (HGN): (i) we complete the original hierarchical structure by introducing new edges between the query and context sentence nodes; (ii) in the graph propagation step, we propose a novel extension to Hierarchical Graph Attention Network GATH (Graph ATtention with Hierarchies) that makes use of the graph hierarchy to update the node representations in a sequential fashion. Experiments on HotpotQA demonstrate the efficiency of the proposed modifications and support our assumptions about the effects of model related variables.
@article{cowenrivers2022structuredqlearningantibodydesign, title = {Structured Q-learning For Antibody Design}, author = {Cowen-Rivers, Alexander I. and Gorinski, Philip John and Sootla, Aivar and Khan, Asif and Furui, Liu and Wang, Jun and Peters, Jan and Ammar, Haitham Bou}, journal = {arXiv pre-print}, year = {2022}, }

2021

TACL
Conversation graph: Data augmentation, training, and evaluation for non-deterministic dialogue management

Milan Gritta , Gerasimos Lampouras , and Ignacio Iacobacci

In Transactions of the Association for Computational Linguistics (TACL)., 2021

Abs arXiv Bib

Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size considering the complexity of the dialogues. Additionally, conventional training signal inference is not suitable for non-deterministic agent behaviour, i.e. considering multiple actions as valid in identical dialogue states. We propose the Conversation Graph (ConvGraph), a graph-based representation of dialogues that can be exploited for data augmentation, multi-reference training and evaluation of non-deterministic agents. ConvGraph generates novel dialogue paths to augment data volume and diversity. Intrinsic and extrinsic evaluation across three datasets shows that data augmentation and/or multi-reference training with ConvGraph can improve dialogue success rates by up to 6.4%.
@article{gritta2021conversation, title = {Conversation graph: Data augmentation, training, and evaluation for non-deterministic dialogue management}, author = {Gritta, Milan and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Transactions of the Association for Computational Linguistics (TACL).}, year = {2021}, }
ACL-IJCNLP
Generalising multilingual concept-to-text NLG with language agnostic delexicalisation

Giulio Zhou , and Gerasimos Lampouras

In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2021

Abs arXiv Bib

Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially for low resource languages.
@article{zhou2021generalising, title = {Generalising multilingual concept-to-text NLG with language agnostic delexicalisation}, author = {Zhou, Giulio and Lampouras, Gerasimos}, journal = {In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing}, year = {2021}, }
EMNLP
Informed sampling for diversity in concept-to-text NLG

Giulio Zhou , and Gerasimos Lampouras

In Findings of the Association for Computational Linguistics - EMNLP, 2021

Abs arXiv Bib

Deep-learning models for language generation tasks tend to produce repetitive output. Various methods have been proposed to encourage lexical diversity during decoding, but this often comes at a cost to the perceived fluency and adequacy of the output. In this work, we propose to ameliorate this cost by using an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output. We focus our experiments on concept-to-text generation where models are sensitive to the inclusion of irrelevant words due to the strict relation between input and output. Our analysis shows that previous methods for diversity underperform in this setting, while human evaluation suggests that our proposed method achieves a high level of diversity with minimal effect to the output’s fluency and adequacy.
@article{zhou2020informed, title = {Informed sampling for diversity in concept-to-text NLG}, author = {Zhou, Giulio and Lampouras, Gerasimos}, journal = {In Findings of the Association for Computational Linguistics - EMNLP}, year = {2021}, }
arXiv
Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation

Ieva Staliūnaitė , Philip John Gorinski , and Ignacio Iacobacci

arXiv pre-print, 2021

Abs arXiv Bib

Determining the plausibility of causal relations between clauses is a commonsense reasoning task that requires complex inference ability. The general approach to this task is to train a large pretrained language model on a specific dataset. However, the available training data for the task is often scarce, which leads to instability of model training or reliance on the shallow features of the dataset. This paper presents a number of techniques for making models more robust in the domain of causal reasoning. Firstly, we perform adversarial training by generating perturbed inputs through synonym substitution. Secondly, based on a linguistic theory of discourse connectives, we perform data augmentation using a discourse parser for detecting causally linked clauses in large text, and a generative language model for generating distractors. Both methods boost model performance on the Choice of Plausible Alternatives (COPA) dataset, as well as on a Balanced COPA dataset, which is a modified version of the original data that has been developed to avoid superficial cues, leading to a more challenging benchmark. We show a statistically significant improvement in performance and robustness on both datasets, even with only a small number of additionally generated data points.
@article{staliūnaitė2021improvingcommonsensecausalreasoning, title = {Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation}, author = {Staliūnaitė, Ieva and Gorinski, Philip John and Iacobacci, Ignacio}, journal = {arXiv pre-print}, year = {2021}, }
ACL
Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning

Benjamin Minixhofer , Milan Gritta , and Ignacio Iacobacci

In Findings of the Association for Computational Linguistics, 2021

Abs arXiv Bib

Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained on intermediate (supervised) tasks that are similar to the target task. For small Natural Language Inference (NLI) datasets, language modelling is typically followed by pretraining on a large (labelled) NLI dataset before fine-tuning with each NLI subtask. In this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative to the commonly used Multi-Layer Perceptron (MLP) classification head. GBDTs have desirable properties such as good performance on dense, numerical features and are effective where the ratio of the number of samples w.r.t the number of features is low. We then introduce FreeGBDT, a method of fitting a GBDT head on the features computed during fine-tuning to increase performance without additional computation by the neural network. We demonstrate the effectiveness of our method on several NLI datasets using a strong baseline model (RoBERTa-large with MNLI pretraining). The FreeGBDT shows a consistent improvement over the MLP classification head.
@article{Minixhofer_2021, title = {Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning}, author = {Minixhofer, Benjamin and Gritta, Milan and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics}, year = {2021}, }
ACL
XeroAlign: Zero-Shot Cross-lingual Transformer Alignment

Milan Gritta , and Ignacio Iacobacci

In Findings of the Association for Computational Linguistics, 2021

Abs arXiv Bib Code

Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained on intermediate (supervised) tasks that are similar to the target task. For small Natural Language Inference (NLI) datasets, language modelling is typically followed by pretraining on a large (labelled) NLI dataset before fine-tuning with each NLI subtask. In this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative to the commonly used Multi-Layer Perceptron (MLP) classification head. GBDTs have desirable properties such as good performance on dense, numerical features and are effective where the ratio of the number of samples w.r.t the number of features is low. We then introduce FreeGBDT, a method of fitting a GBDT head on the features computed during fine-tuning to increase performance without additional computation by the neural network. We demonstrate the effectiveness of our method on several NLI datasets using a strong baseline model (RoBERTa-large with MNLI pretraining). The FreeGBDT shows a consistent improvement over the MLP classification head.
@article{gritta2021xeroalignzeroshotcrosslingualtransformer, title = {XeroAlign: Zero-Shot Cross-lingual Transformer Alignment}, author = {Gritta, Milan and Iacobacci, Ignacio}, journal = {In Findings of the Association for Computational Linguistics}, year = {2021}, }

2020

DSTC
Show us the way: Learning to manage dialog from demonstrations

Gabriel Gordon-Hall , Philip John Gorinski , Gerasimos Lampouras , and 1 more author

In Proceedings of the Eighth Dialog System Technology Challenge at AAAI, 2020

Abs arXiv Bib

We present our submission to the End-to-End Multi-Domain Dialog Challenge Track of the Eighth Dialog System Technology Challenge. Our proposed dialog system adopts a pipeline architecture, with distinct components for Natural Language Understanding, Dialog State Tracking, Dialog Management and Natural Language Generation. At the core of our system is a reinforcement learning algorithm which uses Deep Q-learning from Demonstrations to learn a dialog policy with the help of expert examples. We find that demonstrations are essential to training an accurate dialog policy where both state and action spaces are large. Evaluation of our Dialog Management component shows that our approach is effective - beating supervised and reinforcement learning baselines.
@article{gordon2020show, title = {Show us the way: Learning to manage dialog from demonstrations}, author = {Gordon-Hall, Gabriel and Gorinski, Philip John and Lampouras, Gerasimos and Iacobacci, Ignacio}, journal = {In Proceedings of the Eighth Dialog System Technology Challenge at AAAI}, year = {2020}, }
EMNLP
Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA

Ieva Staliūnaitė , and Ignacio Iacobacci

In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2020

Abs arXiv Bib

Many NLP tasks have benefited from transferring knowledge from contextualized word embeddings, however the picture of what type of knowledge is transferred is incomplete. This paper studies the types of linguistic phenomena accounted for by language models in the context of a Conversational Question Answering (CoQA) task. We identify the problematic areas for the finetuned RoBERTa, BERT and DistilBERT models through systematic error analysis - basic arithmetic (counting phrases), compositional semantics (negation and Semantic Role Labeling), and lexical semantics (surprisal and antonymy). When enhanced with the relevant linguistic knowledge through multitask learning, the models improve in performance. Ensembles of the enhanced models yield a boost between 2.2 and 2.7 points in F1 score overall, and up to 42.1 points in F1 on the hardest question classes. The results show differences in ability to represent compositional and lexical information between RoBERTa, BERT and DistilBERT.
@article{staliūnaitė2020compositionallexicalsemanticsroberta, title = {Compositional and Lexical Semantics in RoBERTa, BERT and DistilBERT: A Case Study on CoQA}, author = {Staliūnaitė, Ieva and Iacobacci, Ignacio}, journal = {In Proceedings of the Conference on Empirical Methods in Natural Language Processing}, year = {2020}, }
ICASSP
Auxiliary Capsules for Natural Language Understanding

Ieva Staliūnaitė , and Ignacio Iacobacci

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020

Abs Bib PDF

Lately, joint training of Intent detection and Slot filling has become the best-performing approach in the field of Natural Language Understanding (NLU). In this work we extend the newly introduced application of Capsule Networks for NLU to a multi-task learning environment, using relevant auxiliary tasks. Specifically, our models perform joint Intent classification and Slot filling with the aid of Named Entity Recognition (NER) and Part of Speech (POS) tagging tasks. This allows us to exploit the hierarchical relationships between the Intents of the utterances and the different features of input text, not only Slots but also Named Entity mentions, Parts of Speech, quantity indications, etc. The models developed in this work are evaluated on standard benchmarks, achieving state-of-the-art results on the SNIPS dataset while outperforming the best commercial systems on several low-resource datasets.
@article{9053899, title = {Auxiliary Capsules for Natural Language Understanding}, author = {Staliūnaitė, Ieva and Iacobacci, Ignacio}, journal = {In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing}, year = {2020}, }
ACL
Learning Dialog Policies from Weak Demonstrations

Gabriel Gordon-Hall , Philip John Gorinski , and Shay B. Cohen

In Proceedings of the Association for Computational Linguistics - ACL, 2020

Abs arXiv Bib

Deep reinforcement learning is a promising approach to training a dialog manager, but current methods struggle with the large state and action spaces of multi-domain dialog systems. Building upon Deep Q-learning from Demonstrations (DQfD), an algorithm that scores highly in difficult Atari games, we leverage dialog data to guide the agent to successfully respond to a user’s requests. We make progressively fewer assumptions about the data needed, using labeled, reduced-labeled, and even unlabeled data to train expert demonstrators. We introduce Reinforced Fine-tune Learning, an extension to DQfD, enabling us to overcome the domain gap between the datasets and the environment. Experiments in a challenging multi-domain dialog system framework validate our approaches, and get high success rates even when trained on out-of-domain data.
@article{gordonhall2020learningdialogpoliciesweak, title = {Learning Dialog Policies from Weak Demonstrations}, author = {Gordon-Hall, Gabriel and Gorinski, Philip John and Cohen, Shay B.}, journal = {In Proceedings of the Association for Computational Linguistics - ACL}, year = {2020}, }