Students and faculty are currently working on diverse projects in computational phonetics, phonology, syntax, and semantics.
Mats Rooth • Simone Harmath-de Lemos • Shohini Bhattasali • Anna Choi
In this project we train a finite state model to detect prosodic cues in a speech corpus. We are specifically interested in detecting stress cues in Brazilian Portuguese and Bengali and finding empirical evidence for current theoretical views.
Xenophobia and dog whistle detection in social media
Kaelyn Lamp • Marten van Schijndel
In this collaboration with the Cornell Xenophobia Meter Project, we study the linguistic properties of social media dog whistles to better identify extremist trends before they gain traction.
Models of code-switching
Marten van Schijndel • Debasmita Bhattacharya • Vinh Nguyen • Andrew Xu
In this work, we study bilingual code-switching (that is, where two languages are used interchangably in a single utterance). We are particularly interested in how information flows across code-switch boundaries; how information from a span in Language 1 can influence productions in and comprehension of spans in Language 2. We are also studying what properties influence code-switching and whether they occur mainly to ease production for the speaker or whether they mainly serve to ease comprehension for the listener.
Representation sharing in neural networks
Marten van Schijndel • Forrest Davis • Debasmita Bhattacharya • William Timkey
Much work has gone into studying which linguistic surface patterns are captured by neural networks, and in this work we are interested in studying how various surface patterns are grouped into larger linguistic abstractions within the networks and in studying how those abstractions interact. Is each instance of a linguistic phenomenon, like filler-gap, related to other instances of that phenomenon (i.e. models encode a filler-gap abstraction) or is each contextual occurrence encoded as a separate phenomenon?
Summarization as Linguistic Compression
Marten van Schijndel • Fangcong Yin • William Timkey
In this work, we conceptualize summarization as a linguistic compression task. We study how different levels of linguistic information are compressed during summarization and whether automatic summarization models learn similar compression functions. We also study how each aspect of linguistic compression is correlated with various measures of summary quality according to human raters.
Mitigating bias in automatic speech recognition
Anna (Seo Gyeong) Choi
Automatic speech recognition (ASR) systems have seen remarkable improvements in the past years, but this success does not extend to all speech groups. Recent research shows that disparities exist against different types of speech, including accented, dialectal, and impaired speech, and even cross-linguistically. To insure the improvements for inclusive speech technologies, we audit commercial speech-to-text services provided by Google, Microsoft, Amazon as well as open source toolkits such as Open AI and Rev AI and compare their transcription results among different speech groups. We then explore whether the crucial problem is coming from the Acoustic Model, Language Model, or the training data itself, and intend on achieving algorithmic fairnes
Fairness in speech recognition through ethical speech data collection
Anna (Seo Gyeong) Choi
Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets that both practitioners and users can, in a constructive cyclic manner, refer to when building speech datasets.
Vowel Duration as a predictor of primary stress placement in Brazilian Portuguese
Simone De Lemos
I work with speech corpora to investigate how compressed models of the speech signal (specifically MFCCs) can be used to support research in phonetics and phonology. I am currently working on a project that seeks to understand if vowel duration (as generated by forced aligners) would be a better predictor to primary stress placement in Brazilian Portuguese (BP) than spectral and energy features (as represented by MFCCs). This project also aims at further comparing pretonic, stressed, and posttonic vowels in BP, both syntagmatically and paradigmatically. In parallel, I am using the same method to look at possible spectral, energy, and durational differences between vowels in contexts where differences in syntactic attachment in a pair of words are said to trigger stress shift in BP. In a third thread of the work, I am investigating whether different combinations of bilingual acoustic models have an impact on the forced alignment of a small speech corpus of Bororo (Bororoan, Central Brazil).
- Dorit Abusch, Mats Rooth. (2023) Parallel and differential contributions from language and image in the discourse representation of picturebooks., To appear in proceedings of Sinn und Bedeutung 27 (preprint)
- Dorit Abusch, Mats Rooth. (2022) Pictorial free perception, Linguistics and Philosophy, 1-52, 05, December 2022
- Dorit Abusch, Mats Rooth. (2022) Temporal and intensional pictorial conflation, Proceedings of Sinn und Bedeutung 26, Published 2022-12-22
- Sidharth Ranjan, Marten van Schijndel, Sumeet Agarwal, Rajakrishnan Rajkumar (2022), Dual Mechanism Priming Effects in Hindi Word Order, arXiv:2210.13938, submitted 25 Oct 2022, Accepted to AACL (American Association for Corpus Linguistics) 2022
- Sidharth Ranjan, Marten van Schijndel, Sumeet Agarwal, Rajakrishnan Rajkumar (2022), Discourse Context Predictability Effects in Hindi Word Order, arXiv:2210.13940, submitted 25 Oct 2022
Sidharth Ranjan, Marten van Schijndel, Sumeet Agarwal, and Rajakrishnan Rajkumar. (2022), Dual Mechanism Priming Effects in Hindi Word Order
Proceedings of The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2022)
Sidharth Ranjan, Marten van Schijndel, Sumeet Agarwal, and Rajakrishnan Rajkumar. (2022), Discourse Context Predictability Effects in Hindi Word Order
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)
- Dorit Abusch, Mats Rooth. (2021) Modalized normality in pictorial narratives, Proceedings of Sinn und Bedeutung 25, Published 2021-09-17
- Mats Rooth. (2021) On Lewis’s “Adverbs of Quantification", A Reader's Guide to Classic Papers in Formal Semantics, pp 295-310, First Online 14 November 2021
Eric Campbell and Mats Rooth. (2021), Epistemic semantics in guarded string models
Proceedings of the Society for Computation in Linguistics (SCiL).2021
Marten van Schijndel and Tal Linzen. (2021), Single-stage prediction models do not explain the magnitude of syntactic disambiguation difficulty
Cognitive Science, 45(6): e12988, 2021
- Matthew Wilber, William Timkey Marten van Schijndel, (2021), To Point or Not to Point: Understanding How Abstractive Summarizers Paraphrase Text, arXiv:2106.01581, Computer Science, Submitted 3 Jun 2021
Forrest Davis and Gerry T.M. Altmann. (2021), Finding Event Structure in Time: What Recurrent Neural Networks can tell us about Event Structure in Mind
Cognition, 213: 104651, 2021
Simone Harmath-de Lemos. (2021), Detecting word-level stress in continuous speech: A case study of Brazilian Portuguese
Journal of Portuguese Linguistics 20.1, 2021
William Timkey and Marten van Schijndel. (2021), All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2021
Forrest Davis and Marten van Schijndel. (2021), Uncovering Constraint-Based Behavior in Neural Models via Targeted Fine-Tuning
In Proceedings of the 2021 Annual Conference of the Association for Computational Linguistics (ACL). 2021
Matt Wilber, William Timkey, and Marten van Schijndel. (2021), Understanding How Abstractive Summarizers Paraphrase Text
Proceedings of the 2021 Findings of the ACL. 2021
Samuel Ryb and Marten van Schijndel. (2021), Analytical, Symbolic and First-Order Reasoning within Neural Architectures
Proceedings of the 2021 Workshop on Computing Semantics with Types, Frames and Related Structures. 2021.
- Mats Rooth. (2020) Is there reference to questions in the grammar of focus, SALT 20, Cornell University, August 17-20, 2020
- Matthew Wilber, William Timkey, Martin van Schijndel. (2020) To Point or Not to Point: Understanding How Abstractive Summarizers Paraphrase Text, arXiv:2016.0158[v], 3 June 2021
Cory Shain, Idan Blank, Marten van Schijndel. (2020), William Schuler, and Evelina Fedorenko. (2020) fMRI reveals language-specific predictive coding during naturalistic sentence comprehension.
Neuropsychologia, 138:107307. 2020
Forrest Davis and Marten van Schijndel. (2020) Recurrent neural network language models always learn English-like relative clause attachment.
Proceedings of the 2020 Annual Conference of the Association for Computational Linguistics (ACL). 2020.
Forrest Davis and Marten van Schijndel. (2020) Discourse structure interacts with reference but not syntax in neural language models.
24th Conference on Computational Natural Language Learning (CoNLL). 2020.
Debasmita Bhattacharya and Marten van Schijndel. (2020) Filler-gaps that neural networks fail to generalize.
24th Conference on Computational Natural Language Learning (CoNLL). 2020.
Forrest Davis and Marten van Schijndel. (2020) Interaction with context during recurrent neural network sentence processing.
Proceedings of the 42nd Annual Virtual Meeting of the Cognitive Science Society (CogSci). 2020.
Publications prior to 2020
- Dorit Abusch, Mats Rooth. (2019) Indexing Across Media Proceedings
of the 22nd Amsterdam Colloquium. ILLC, University of Amsterdam.9
Marten van Schijndel, Aaron Mueller, and Tal Linzen. (2019) Quantity doesn't buy quality syntax with neural language models.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCAI). 2019.
Grusha Prasad, Marten van Schijndel, and Tal Linzen. (2019) Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models.
Proceedings of the 2019 Conference on Computational Natural Language Learning (CoNLL). 2019.
Forrest Davis and Abby Cohn. (2019) Effects of lexical frequency and compositionality on phonological reduction in English compounds.
25th Architectures and Mechanisms of Language Processing conference (AMLaP 2019)
Jacob Collard. (2018) Finite State Reasoning for Presupposition Satisfaction.
Proceedings of the First International Workshop on Language Cognition and Computational Models (COLING 2018)
Shohini Bhattasali, Murielle Fabre, John Hale. (2018) Processing MWEs: Neurocognitive Bases of Verbal MWEs and Lexical Cohesiveness within MWEs.
Proceedings of the 14th Workshop on Multiword Expressions (COLING 2018)
Simone Harmath-de Lemos. (2018) What Automatic Speech Recognition Can Tell Us About Stress and Stress Shift in Continuous Speech.
Proceedings of the 9th International Conference on Speech Prosody 2018
Jixing Li, Murielle Fabre, Wen-Ming Luh, John Hale. (2018) Modeling Brain Activity Associated with Pronoun Resolution in English and Chinese.
Proceedings of NAACL Workshop on Computational Models of Reference, Anaphora, and Coreference (CRAC 2018)
Jacob Collard. (2018) A Naturalistic Inference Learning Algorithm.
Linguistic Society of America (LSA 2018)
Shohini Bhattasali, John Hale, Christophe Pallier, Jonathan R. Brennan, Wen-Ming Luh, R. Nathan Spreng. (2018) Differentiating Phrase Structure Parsing and Memory Retrieval in the Brain.
Proceedings of the Society for Computation in Linguistics (SCiL 2018)
Mats Rooth. (2017) Finite-state intensional semantics.
12th International Conference on Computational Semantics (IWCS 2017)
- Matthew Nelson, Imen El Karoui, Kristof Giber, Xiaofang Yang, Laurent Cohen, Hilda Koopman, Sydney S. Cash, Lionel Naccache, John Hale, Christophe Pallier, Stanislas Dehaune. (2017) Neurophysiological dynamics of phrase-structure building during sentence processing.
Proceedings of the National Academy of Sciences
- Matthew Nelson, Stanislas Dehaene, Christophe Pallier, and John Hale. (2017). Entropy Reduction correlates with Temporal Lobe Activity.
Proceedings of the 7th Workshop on Cognitive Modelling and Computational Linguistics (CMCL 2017)
Jacob Collard. (2016) Inferring Necessary Categories in CCG.
9th International Conference on the Logical Aspects of Computational Linguistics (LACL 2016)
- Jonathan Howell, Mats Rooth, and Michael Wagner. (2016). Acoustic classification of focus: on the web and in the lab. Doi 1813/42538
- Jonathan R. Brennan, Edward P. Stabler, Sarah E. Van Wagenen, Wen-Ming Luh, and John T. Hale. (2016) Abstract linguistic structure correlates with temporal activity during naturalistic comprehension. Brain and Language 157, 81-94.
John Hale. (2016). Information-theoretical complexity metrics."
Language and Linguistic Compass
Jixing Li, Jonathan Brennan, Adam Mahar, and John Hale. (2016). Temporal lobes as combinatory engines for both form and meaning.
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC 2016)
John T. Hale, David E. Lutz, Wenming Luh, and Jonathan R. Brennan. (2015). Modeling fMRI time courses with linguistic structure at various grain sizes.
Proceedings of CMCL 2015
Shohini Bhattasali, Jeremy Cytryn, Elana Feldman, and Joonsuk Park. (2015). Automatic identification of rhetorical questions.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)
LING 2264: Language, Mind, and Brain
An introduction to neurolinguistics, this course surveys topics such as aphasia, hemispheric lateralization and speech comprehension as they are studied via neuroimaging, intracranial recording and other methods. A key focus is the relationship between these data, linguistic theories, and more general conceptions of the mind. Appropriate for students from any major.
LING 4424: Computational Linguistics I
Computational models of natural languages. Topics are drawn from: tree syntax and context free grammar, finite state generative morpho-phonology, feature structure grammars, logical semantics, tabular parsing, Hidden Markov models, categorial and minimalist grammars, text corpora, information-theoretic sentence processing, discourse relations, and pronominal coreference.
LING 4434: Computational Linguistics II
An in-depth exploration of modern computational linguistic techniques. A continuation of LING 4424 - Computational Linguistics I. Whereas LING 4424 covers foundational techniques in symbolic computational modeling, this course will cover a wider range of applications as well as coverage of neural network methods. We will survey a range of neural network techniques that are widely used in computational linguistics and natural language processing as well as a number of techniques that can be used to probe the linguistic information and language processing strategies encoded in computational models. We will examine ways of mapping this linguistic information both to linguistic theory as well as to measures of human processing (e.g., neuroimaging data and human behavioral responses).
LING 4485/6485: Topics in Computational Linguistics
Current topics in computational linguistics. Recent topics include computational models
for Optimality Theory and finite state models.