Anca Chereches | Cornell Linguistics

Audio file manipulation

Shell scripts.

Batch convert a list of Sphere files (e.g., from the Switchboard corpus) to mp3s. [script] [sample input]
Uses the sph2pipe tool from the Linguistic Data Consortium and lame.
Batch replace with noise any number of segments of a Sphere-format audio file, at user-provided timestamps. Convert modified audio file to mp3. [script] [sample input]
Uses SoX, lame, and the sph2pipe tool from the Linguistic Data Consortium.

Matlab code for pitch processing.

Function which takes in a wav file and plots Voicebox pitch track (with and without outliers removed) for a combination of three different values, each potentially taking three different parameters. Goal is to fine-tune Voicebox parameters to improve pitch tracking output. [code] [sample output]
Function reads in a data structure and for each element, plots waveform, Praat pitch track, Voicebox pitch track, and segmentation. Meant for comparing Praat pitch tracker output to Voicebox pitch tracker output. [code] [sample output]

Transcripts and annotations for web-harvested speech corpora with EZRA, a web application developed by David Lutz at the computational linguistics lab at Cornell.

Switchboard corpus analysis

[Python code on Github] In this project, I extract all instances of "some" in Switchboard, along with a variety of contextual information. I use lxml.etree for xml traversal.

Romanian

I am hoping this will grow into a list of useful Romanian resources for linguists interested in working on the language. Please email me with suggestions and comments! I'd especially be interested in recordings with transcripts (podcasts, interviews, etc.).

Resources and tools for Romanian NLP from Rada Mihalcea, at University of North Texas. Includes an unbalanced corpus (newspaper, literary magazine, 2 novels in translation; no diacritics except for the literary magazine), ~50 mil words.
European Parliament Proceedings Parallel Corpus, Romanian-English. ~ 10 mil. words.
HMM-based POS-tagger for Romanian.
Tagged newspaper (Adevarul Economic, business news 1998-2005, with diacritics; ~21 mil. words). Tags for morphology and syntax; available through Syddansk Universitet with Corpus Eye, with an online concordancer.
ROMBAC. Balanced corpus of Romanian containing text samples from literature, European legislation, news and medical texts. See Radu Ion, Elena Irimia, Dan Stefanescu, Dan Tufis. ROMBAC: The Romanian Balanced Annotated Corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 21-27 May, 2012.
Romanian Wordnet, a lexical ontology following the Princeton WordNet (PWN) organizational principles. Tufis, D., Barbu Mititelu, V., Stefanescu, D., Ion, R. (2012). The Lexical Ontology for Romanian. In Nicoletta Calzolari and Nancy Ide (eds.): Language Resources and Evaluation, Special Issue on Wordnets, Springer 2012.
Dan Tufis's website. Yes, this is just a list of publications, but it gave me an idea of what tools were available.
Dexonline. The online version of a number of Romanian-Romanian dictionaries, created by volunteers manually typing in definitions. Their database is easily accessible, in XML format. A number of offline clients are available through their website.
MetaShare has a repository of Romanian resources. That's where I found ROMBAC and the Romanian Wordnet, and there are certainly many more interesting things there. MetaShare is part of META, aka Multilingual Europe Technology Alliance.
ConsILR is supposed to be a website of linguistic resources for Romanian, but I don't know if it's still being maintained. I've never managed to get anywhere with it. But they do seem to host a recurring conference on the topic of linguistic resources for Romanian.
SRoL: Voiced Sounds of the Romanian Language. An archive of over 500 distinct recordings, from basic sounds of Romanian to short sentences or phrase segments. See also their list of references on Romanian phonetics. It is rather extensive.
RSS database (Romanian Speech Synthesis). Contains over 3 hours of high-quality recordings of a native Romanian female speaker, originally created for the use of speech synthesis researchers. Registration required for download and the website does not seem to be responsive.
European Language Resources Association has a number of resources on Romanian, including a spoken language corpus. However, most resources are not free of charge.
CLARIN. Common Language Resources and Technology Infrastructure. Do a search for "Romanian". I don't think they host anything, but they have records for a whole number of interesting resources.

And here is a random selection of written resources, mostly books and collections.

Romanian Grammar, Dana Cojocaru, SEELRC 2003. A free, detailed, easily accessible descriptive grammar of Romanian.
Dobrovie-Sorin, Carmen. 1994. The Syntax of Romanian: Comparative Studies in Romance, Mouton de Gruyter, 316 p. (traduction roumaine : Sintaxa limbii romane, Editura Univers, Bucuresti, 2000, 317 p.) A generative approach.
Motapanyane, Virginia (ed). 2000. Comparative studies in Romanian syntax. Oxford: Elsevier. They also have a list of references on Romanian generative syntax.
The Essential Grammar of the Romanian Language. In preparation.
Chitoran, Ioana. 2002. The Phonology of Romanian: A Constraint-Based Approach. Studies in Generative Grammar, 56.

Finally, an ever-growing list of dissertations that I could track down (coming soon).

Stablerian Minimalist Grammars

I wrote two Stablerian Minimalist Grammars in Stabler's MG CKY parser and utilities in SWI Prolog. I was modelling the (in)famous German nested and Dutch crossed dependencies, taking what seemed to me to be the core, implementable idea of Koopman & Szabolcsi's Verbal Complexes. Their analysis is full of remnant movement, so the derivation trees get ugly... fast. But given the fact that my Dutch grammar needed to be able to generate "infinitely long" crossed dependencies, remnant movement was unavoidable. A more coherent summary is in the Research section. As proof that the grammars did work (albeit sluggishly), below are my fantastic derivations with exactly one level of embedding.

German "dp1 dp2 dp3 v2 v1"	Dutch "dp1 dp2 dp3 v1 v2"

Yes, all of that. Sigh... But, it works! This being said, writing these grammars was a little bit like writing code, but it wasn't immediately clear to me initially how I would go about debugging them. Here is what I learned about how I could find out where derivation crashes when the parser does not give me much information. This is very specific to Stablerian MGs and Stabler's Prolog tools for MGs, but here it goes:

Change the start symbol (this one courtesy of Kyle Grove). The parser only tries to go that high, so you'll get an idea of where it's working and where it's starting to break down. Note that this only works if you don't have a ton of recursivity that only ends under the main clause CP -- as in my case, unfortunately.
Good-ol' paper and pencil. Follow a derivation bottom-up using the rules you wrote down and keeping track of concatenated constituents that will move together.
Use the results that the parser DOES give you. More precisely, it gives you the number of chart items. Now you can go back and comment certain rules that you expect to be used lower in the tree. This should stop derivation and reduce the # of chart items. This way you know it wasn't the rule you commented that it got stuck on. Uncomment it and move forward, commenting a rule you expect to be used higher up in the tree. When you see that the # of chart items stops going down you know where the derivation crashed.

Tips about stuff that got me into trouble:

When using recursive stuff, remember you need to have a beginning and an end - so 3 cases.
When using recursive stuff meant to tackle more than one constituent in a single clause, remember the SMC (shortest move constraint). Always remember to consider any constituents left over form a lower clause as well.
Agreement stuff - can't tell you how many times i've been annoyed about licensing not happening or movement not happening. because only a head can agree with its own specifiers or complements or specs/complements below it. if the head itself gets moved, it doesn't seem like it can agree any more (no longer in head position).
Using capitals confuses prolog b/c capitals are reserved for variables. if you want DP instead of dp, you need to write your category as 'DP'.
Make sure you've loaded the right grammar!

Anca Cherecheș ac872 @ cornell . edu Cornell University Department of Linguistics About Research Resources Teaching Service Not linguistics

Audio file manipulation

Switchboard corpus analysis

Romanian

Stablerian Minimalist Grammars

Anca Cherecheș

ac872 @ cornell . edu

Cornell University
Department of Linguistics

About

Research

Resources

Teaching

Service

Not linguistics