Our lab

The Cornell Computational Linguistics Lab is a research and educational lab in the Department of Linguistics and Computing and Information Science. It is a venue for lab sessions for classes, computational dissertation research by graduate students, undergraduate research projects, and grant research.

The lab collaborates with a large group at Cornell, including faculty and students in Cognitive Science, Computer Science, Psychology, and Information Science. The Department of Computing and Information Science provides system administration support for the lab, and some computational work is done on hardware at the Department of Computer Science and the Centre for Advanced Computing .

Faculty

Graduate Students

Undergraduate Research Assistants

Lab Alumni

Projects

Students and faculty are currently working on diverse projects in computational phonetics, phonology, syntax, and semantics.

Finite-state phonology

Mats Rooth • Simone Harmath-de Lemos • Shohini Bhattasali • Anna Choi

In this project we train a finite state model to detect prosodic cues in a speech corpus. We are specifically interested in detecting stress cues in Brazilian Portuguese and Bengali and finding empirical evidence for current theoretical views.

Xenophobia and dog whistle detection in social media

Kaelyn Lamp • Marten van Schijndel

In this collaboration with the Cornell Xenophobia Meter Project, we study the linguistic properties of social media dog whistles to better identify extremist trends before they gain traction.

Models of code-switching

Marten van Schijndel • Debasmita Bhattacharya • Vinh Nguyen • Andrew Xu

In this work, we study bilingual code-switching (that is, where two languages are used interchangably in a single utterance). We are particularly interested in how information flows across code-switch boundaries; how information from a span in Language 1 can influence productions in and comprehension of spans in Language 2. We are also studying what properties influence code-switching and whether they occur mainly to ease production for the speaker or whether they mainly serve to ease comprehension for the listener.

Representation sharing in neural networks

Marten van Schijndel • Forrest Davis • Debasmita Bhattacharya • William Timkey

Much work has gone into studying which linguistic surface patterns are captured by neural networks, and in this work we are interested in studying how various surface patterns are grouped into larger linguistic abstractions within the networks and in studying how those abstractions interact. Is each instance of a linguistic phenomenon, like filler-gap, related to other instances of that phenomenon (i.e. models encode a filler-gap abstraction) or is each contextual occurrence encoded as a separate phenomenon?

Summarization as Linguistic Compression

Marten van Schijndel • Fangcong Yin • William Timkey

In this work, we conceptualize summarization as a linguistic compression task. We study how different levels of linguistic information are compressed during summarization and whether automatic summarization models learn similar compression functions. We also study how each aspect of linguistic compression is correlated with various measures of summary quality according to human raters.

Mitigating bias in automatic speech recognition

Anna (Seo Gyeong) Choi

Automatic speech recognition (ASR) systems have seen remarkable improvements in the past years, but this success does not extend to all speech groups. Recent research shows that disparities exist against different types of speech, including accented, dialectal, and impaired speech, and even cross-linguistically. To insure the improvements for inclusive speech technologies, we audit commercial speech-to-text services provided by Google, Microsoft, Amazon as well as open source toolkits such as Open AI and Rev AI and compare their transcription results among different speech groups. We then explore whether the crucial problem is coming from the Acoustic Model, Language Model, or the training data itself, and intend on achieving algorithmic fairnes

Fairness in speech recognition through ethical speech data collection

Anna (Seo Gyeong) Choi

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets that both practitioners and users can, in a constructive cyclic manner, refer to when building speech datasets.

Vowel Duration as a predictor of primary stress placement in Brazilian Portuguese

Simone De Lemos

I work with speech corpora to investigate how compressed models of the speech signal (specifically MFCCs) can be used to support research in phonetics and phonology. I am currently working on a project that seeks to understand if vowel duration (as generated by forced aligners) would be a better predictor to primary stress placement in Brazilian Portuguese (BP) than spectral and energy features (as represented by MFCCs). This project also aims at further comparing pretonic, stressed, and posttonic vowels in BP, both syntagmatically and paradigmatically. In parallel, I am using the same method to look at possible spectral, energy, and durational differences between vowels in contexts where differences in syntactic attachment in a pair of words are said to trigger stress shift in BP. In a third thread of the work, I am investigating whether different combinations of bilingual acoustic models have an impact on the forced alignment of a small speech corpus of Bororo (Bororoan, Central Brazil).

Recent publications (2023, 2022, 2021, and 2020)

Here are some selected publications from recent work by faculty and graduate students:

Recent Courses:

If you are interested in computational linguistics, these classes are a great way to get started in this area:

LING 2264: Language, Mind, and Brain

An introduction to neurolinguistics, this course surveys topics such as aphasia, hemispheric lateralization and speech comprehension as they are studied via neuroimaging, intracranial recording and other methods. A key focus is the relationship between these data, linguistic theories, and more general conceptions of the mind. Appropriate for students from any major.

LING 4424: Computational Linguistics I

Computational models of natural languages. Topics are drawn from: tree syntax and context free grammar, finite state generative morpho-phonology, feature structure grammars, logical semantics, tabular parsing, Hidden Markov models, categorial and minimalist grammars, text corpora, information-theoretic sentence processing, discourse relations, and pronominal coreference.

LING 4434: Computational Linguistics II

An in-depth exploration of modern computational linguistic techniques. A continuation of LING 4424 - Computational Linguistics I. Whereas LING 4424 covers foundational techniques in symbolic computational modeling, this course will cover a wider range of applications as well as coverage of neural network methods. We will survey a range of neural network techniques that are widely used in computational linguistics and natural language processing as well as a number of techniques that can be used to probe the linguistic information and language processing strategies encoded in computational models. We will examine ways of mapping this linguistic information both to linguistic theory as well as to measures of human processing (e.g., neuroimaging data and human behavioral responses).

LING 4485/6485: Topics in Computational Linguistics

Current topics in computational linguistics. Recent topics include computational models for Optimality Theory and finite state models.

Resources

Access to Cornell's G2 Computing Cluster
More than 870 Language Corpora in 60+ languages (e.g.news text, dialogue corpora, television transcripts, etc)

Useful Downloads (GitHub and Zenodo repositories)

Kaldi Utilities

Kaldi-alignments-matlab: Read, display, and play Kaldi phone alignments in Matlab

A Truly Cleaned and Filtered Subset of The Pile corpus

The Pudding repo contains code that creates a truly cleaned and filtered subset of The 800GB Pile corpus , parsed into CONLL-U format

Tools for LSTM (long short-term memory) models

LSTM (long short-term memory) toolkit that can estimate incremental processing difficulty


Left-corner parsing toolkit that can estimate incremental processing difficulty


125 pre-trained English LSTM (Long short-term memory) models - this repository contains the 125 LSTM models analyzed in van Schijndel, Mueller, and Linzen (2019) "Quantity doesn't buy quality syntax with neural language models"


.

Useful links

Cornell Department of Linguistics
Cornell Natural Language Processing group
Cornell Cognitive Science program
Association of Computational Linguistics
Cornell Linguistics Circle