Our lab

The Cornell Computational Linguistics Lab is a research and educational lab in the Department of Linguistics and Computing and Information Science. It is a venue for lab sessions for classes, computational dissertation research by graduate students, undergraduate research projects, and grant research.

The lab collaborates with a large group at Cornell, including faculty and students in Cognitive Science, Computer Science, Psychology, and Information Science. The Department of Computing and Information Science provides system administration support for the lab, and some computational work is done on hardware at the Department of Computer Science and the Centre for Advanced Computing.

In addition to this website, we encourage you to visit the affiliated Cornell Computational Psycholinguistic Discussions (C.Psyd) website. C.Psyd is a linguistics research group that uses computational models to study the intersection of computational linguistics and psycholinguists. By modeling human language processing behavior (e.g., reading times) one can identify linguistic features that impact human processing decisions. Relatedly, C.Psyd members use psycholinguistic techniques to study the strategies used by neural networks to produce high accuracy in different language contexts, which gives us insights as to when different strategies might be employed by humans.

Finally, both the Computational Lingustics Lab and the C.Psyd group are affiliated with the Cornell Natural Language Processing Group.

Faculty

Graduate Students

Undergraduate Research Assistants

Lab Alumni

Projects

Students and faculty are currently working on diverse projects in computational phonetics, phonology, syntax, and semantics.

Finite-state phonology

Mats Rooth • Simone Harmath-de Lemos

In this project we train a finite state model to detect prosodic cues in a speech corpus. We are specifically interested in detecting stress cues in Brazilian Portuguese and Bengali and finding empirical evidence for current theoretical views.

Xenophobia and dog whistle detection in social media

Marten van Schijndel

In this collaboration with the Cornell Xenophobia Meter Project, we study the linguistic properties of social media dog whistles to better identify extremist trends before they gain traction.

Models of code-switching

Marten van Schijndel • Ashlyn Winship

In this work, we study bilingual code-switching (that is, where two languages are used interchangably in a single utterance). We are particularly interested in how information flows across code-switch boundaries; how information from a span in Language 1 can influence productions in and comprehension of spans in Language 2. We are also studying what properties influence code-switching and whether they occur mainly to ease production for the speaker or whether they mainly serve to ease comprehension for the listener.

Representation sharing in neural networks

Marten van Schijndel • Jacob Matthews (Romance Languages) • John Starr

Much work has gone into studying which linguistic surface patterns are captured by neural networks, and in this work we are interested in studying how various surface patterns are grouped into larger linguistic abstractions within the networks and in studying how those abstractions interact. Is each instance of a linguistic phenomenon, like filler-gap, related to other instances of that phenomenon (i.e. models encode a filler-gap abstraction) or is each contextual occurrence encoded as a separate phenomenon?

Summarization as Linguistic Compression

Marten van Schijndel • Angelina Chen • Anna Lin • Diana Vazquez Palma

In this work, we conceptualize summarization as a linguistic compression task. We study how different levels of linguistic information are compressed during summarization and whether automatic summarization models learn similar compression functions. We also study how each aspect of linguistic compression is correlated with various measures of summary quality according to human raters.

Vowel Duration as a predictor of primary stress placement in Brazilian Portuguese

Simone De Lemos

  • Simone De Lemos (2021): Detecting word-level stress in continuous speech : A case study of Brazilian Portuguese Journal of Portuguese Linguistics, 20(1), 3.DOI

  • I work with speech corpora to investigate how compressed models of the speech signal (specifically MFCCs) can be used to support research in phonetics and phonology. I am currently working on a project that seeks to understand if vowel duration (as generated by forced aligners) would be a better predictor to primary stress placement in Brazilian Portuguese (BP) than spectral and energy features (as represented by MFCCs). This project also aims at further comparing pretonic, stressed, and posttonic vowels in BP, both syntagmatically and paradigmatically. In parallel, I am using the same method to look at possible spectral, energy, and durational differences between vowels in contexts where differences in syntactic attachment in a pair of words are said to trigger stress shift in BP. In a third thread of the work, I am investigating whether different combinations of bilingual acoustic models have an impact on the forced alignment of a small speech corpus of Bororo (Bororoan, Central Brazil).

    Recent publications & dissertations (2024, 2023, 2022, 2021, 2020)

    Here are some selected publications from recent work by faculty and graduate students:

    Recent Courses:

    If you are interested in computational linguistics, these classes are a great way to get started in this area:

    LING 1170: Introduction to Cognitive Science (Fall, Summer)

    This course provides an introduction to the science of the mind. Everyone knows what it's like to think and perceive, but this subjective experience provides little insight into how minds emerge from physical entities like brains. To address this issue, cognitive science integrates work from at least five disciplines: Psychology, Neuroscience, Computer Science, Linguistics, and Philosophy. This course introduces students to the insights these disciplines offer into the workings of the mind by exploring visual perception, attention, memory, learning, problem solving, language, and consciousness.

    LING 3344 (Spring): Superlinguistics: Comics, Signs and Other Sequential Images

    Super-linguistics is a subfield of linguistics that applies techniques used for analyzing natural language to non-linguistic materials. This course uses linguistic tools from semantics, pragmatics and syntax to study sequential images found in comics, films, and children’s books. We will also study multimedia, gestures, and static images such as instruction signs, emoji, and paintings. Linguistic topics include anaphora, implicature, tense and aspect, attitudes and embedding, indirect discourse, and dynamic semantics. We introduce linguistic accounts of each of the topics and apply them to pictorial data.

    LING 4424/6424 (Spring): Computational Linguistics I

    Computational models of natural languages. Topics are drawn from: tree syntax and context free grammar, finite state generative morpho-phonology, feature structure grammars, logical semantics, tabular parsing, Hidden Markov models, categorial and minimalist grammars, text corpora, information-theoretic sentence processing, discourse relations, and pronominal coreference.

    LING 4434/6634 (Fall): Computational Linguistics II

    An in-depth exploration of modern computational linguistic techniques. A continuation of LING 4424 - Computational Linguistics I. Whereas LING 4424 covers foundational techniques in symbolic computational modeling, this course will cover a wider range of applications as well as coverage of neural network methods. We will survey a range of neural network techniques that are widely used in computational linguistics and natural language processing as well as a number of techniques that can be used to probe the linguistic information and language processing strategies encoded in computational models. We will examine ways of mapping this linguistic information both to linguistic theory as well as to measures of human processing (e.g., neuroimaging data and human behavioral responses).

    LING 4474/6674 (Spring): Natural Language Processing

    This course constitutes an introduction to natural language processing (NLP), the goal of which is to enable computers to use human languages as input, output, or both. NLP is at the heart of many of today’s most exciting technological achievements, including machine translation, question answering and automatic conversational assistants. The course will introduce core problems and methodologies in NLP, including machine learning, problem design, and evaluation methods.

    LING 4485/6485: Topics in Computational Linguistics

    Current topics in computational linguistics.

    LING 6693 - Computational Psycholinguistics Discussion (Fall, Spring)

    This seminar provides a venue for feedback on research projects, invited speakers, and paper discussions within the area of computational psycholinguistics

    LING 7710 (Fall) - Computational Seminar (Fall)

    Addresses current theoretical and empirical issues in computational linguistics

    Resources

    Access to Cornell's G2 Computing Cluster
    More than 900 Language Corpora in 60+ languages (e.g.news text, dialogue corpora, television transcripts, etc)

    Useful Downloads (GitHub and Zenodo repositories)

    Kaldi Utilities

    Kaldi-alignments-matlab: Read, display, and play Kaldi phone alignments in Matlab

    A Truly Cleaned and Filtered Subset of The Pile corpus

    The Pudding repocontains code that creates a truly cleaned and filtered subset ofThe 800GB Pile corpus, parsed intoCONLL-U format

    Tools for LSTM (long short-term memory) models

    LSTM (long short-term memory) toolkit that can estimate incremental processing difficulty


    Left-corner parsing toolkit that can estimate incremental processing difficulty


    125 pre-trained English LSTM (Long short-term memory) models- this repository contains the 125 LSTM models analyzed invan Schijndel, Mueller, and Linzen (2019) "Quantity doesn't buy quality syntax with neural language models"


    .

    Useful links

    Cornell Department of Linguistics
    Cornell Natural Language Processing group
    Cornell Cognitive Science program
    Association of Computational Linguistics
    Cornell Linguistics Circle