Fall 2024 | The Boulder Natural Language Processing Group | University of Colorado Boulder

Date	Event	Speaker	Abstract/Details
08/28/2024	Planning, introductions, welcome!
09/04/2024	Brunch Social
09/11/2024	Watch and discuss NLP keynote "Are LLMs Narrowing our Horizon? Let's Embrace Variation in NLP!"	Barbara Plank
09/18/2024	CLASIC presentations
09/25/2024	Invited talks/discussions from Leeds and Anschutz folks	Liu Liu, Abe Handler, Yanjun Gao
10/02/2024	Testing GPT4's interpretation of the Caused-Motion Construction	Martha Palmer, Annie Zaenen, Susan Brown, Alexis Cooper	The fields of Artificial Intelligence and Natural Language Processing have been revolutionized by the advent of Large Language Models such as GPT4. They are perceived as being language experts and there is a lot of speculation about how intelligent they are, with claims being made about “Sparks of General Artificial Intelligence.” This talk will describe in detail an English linguistic construction, the Caused Motion Construction, and compare prior interpretation approaches with current LLM interpretations. The prior approaches are based on VerbNet. It’s unique contributions to prior approaches will be outlined. Then the results of a recent preliminary study probing GPT4’s analysis of the same constructions will be presented. Not surprisingly, this analysis illustrates both strengths and weaknesses of GPT4’s ability to interpret Caused Motion Constructions and to generalize this interpretation.
10/09/2024	NAACL Paper Clinic: Come get feedback on your submission drafts!
10/16/2024	Senior Thesis Proposals	Alexandra Barry	Benchmarking LLM Handling of Cross-Dialectal Spanish: This proposal introduces current issues and gaps in crossdialectal NLP in Spanish as well as the lack of resources available for Latin American dialects. The presentation will cover past work in dialect detection, translation, and benchmarking in order to build a foundation for a proposal that aims to create a benchmark that analyses LLM robustness across a series of tasks in different Spanish dialects
10/16/2024	Senior Thesis Proposals	Tavin Turner	Agreeing to Disagree: Statutory Relational Stance Modeling: Policy division deeply affects which bills get passed in legislature, and how. So far, statutory NLP has predicted voting breakdowns, interpreted stakeholder benefit, informed legal decision support systems, and much more. In practice, legislation demands compromise and concession to pass important policy, yet models often struggle to reason over the whole act. Leveraging neuro-symbolic models, we seek to intermediate this challenge with relational structures of statutes’ sectional stances – modeling stance agreement, exception, etc. Beyond supporting downstream statutory analysis tasks, these structures could help stakeholders understand how a bill impacts them, litmus the cooperation within a legislature, and reveal patterns of compromise that aid a bill through ratification.
10/23/2024	Reliable Language Technology for Classroom Dialog Understanding	Ananya Ganesh (PhD Dissertation Proposal)	In this proposal, I will lay out how NLP models can be developed to address realistic use cases in analyzing classroom dialogue. Towards this goal, I will first introduce a new task and corresponding dataset, focused on detecting off-task utterances in small-group discussions. I will then propose a method to solve this task that considers how the inherent structure in the dialog can be used to learn richer representations of the dialog context. Next, I will introduce preliminary work on applying LLMs in the in-context learning setting for a broad range of tasks pertaining to qualitative coding of classroom dialog, and discuss potential follow-up work. Finally, keeping in mind our goals of serving many independent stakeholders, I will propose a study to incorporate differing stake-holder’s subjective judgments while curating gold-standard data for classroom discourse analysis.
10/30/2024	Adapting AMR Metrics to UMR Graphs	Marie McGregor (Area Exam)	Uniform Meaning Representation (UMR) expands on the capabilities of Abstract Meaning Representation (AMR) by supporting document-level annotation, suitability for low-resource languages, and support for logical inference. As a framework for any sort of representation is developed, a way to measure the similarities or differences between two representations must be developed in tandem to support the creation of parsers and for computing inner-annotator agreement (IAA). Fortunately, there exists robust research into metrics to assess the similarity of AMR graphs. The usefulness of these metrics to UMRs depends on four key aspects: scalability, correctness, interpretability, and cross-lingual suitability. This paper investigates the applicability of AMR metrics to UMR graphs along these aspects in order to create useful and reliable UMR metrics.
11/06/2024	Short presentations / discussions	Curry Guinn, Yifu Wu, Kevin Stowe
11/13/2024	SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models	Nick Dronen	Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs’ algorithmic abilities under simple lexical or semantic variations. To this end, we present the SETLEXSEM CHALLENGE, a synthetic benchmark that evaluates the performance of LLMs on set operations. SETLEXSEM assesses the robustness of LLMs’ instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SETLEXSEM, we find that they exhibit poor robustness to variation in both operation and operands. We show – via the framework’s systematic sampling of set members along lexical and semantic dimensions – that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently.
11/20/2024	Extending Benchmarks and Multilingual Models to Truly Low-Resource Languages	Abteen Ebrahimi (PhD Dissertation Proposal)	Driven by successes in large-scale data collection and training efforts, the field of natural language processing (NLP) has seen a dramatic surge in model performance. However, the vast majority of the roughly 7,000 languages spoken across the globe do not have the necessary amounts of easily available text resources and have not been able to share in these advancements. In this proposal, we focus on how best to improve pretrained model performance for these languages, which we refer to as truly low-resource. First, we discuss model adaptation techniques which leverage unlabeled data and discuss experiments which evaluate these approaches in a realistic setting. Next, we address a limitation of prior work, and describe two data collection efforts for lowresource languages. We further present a synthetic evaluation resource which tests a model's understanding of specific linguistic phenomenon: lexical gaps. Finally, we propose additional analysis experiments we aim to address disagreements across prior work, and extend these experiments to include low-resource languages.
12/04/2024	PhD Area Exam	Enora Rice