CoLA: The Corpus of Linguistic Acceptability (with added annotations)

Warstadt, Alex; Singh, Amanpreet; Bowman, Samuel R.

Full metadata record

DC Field	Value	Language
dc.contributor.author	Warstadt, Alex	-
dc.contributor.author	Singh, Amanpreet	-
dc.contributor.author	Bowman, Samuel R.	-
dc.date.accessioned	2019-10-10T16:47:40Z	-
dc.date.available	2019-10-10T16:47:40Z	-
dc.date.issued	2019	-
dc.identifier.uri	http://hdl.handle.net/2451/60441	-
dc.description.abstract	[Primary paper:] This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.	en
dc.description.sponsorship	This project has benefited from help and feedback at various stages from Chris Barker, Pablo Gonzalez, Shalom Lappin, Omer Levy, Marie-Catherine de Marneffe, Alex Wang, Alexander Clark, everyone in the Deep Learning in Semantics seminar at NYU, and three anonymous TACL reviewers. This project has benefited from financial support to SB by Google, Tencent Holdings, and Samsung Research. This material is based upon work supported by the National Science Foundation under Grant No. 1850208. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.	en
dc.language.iso	en	en
dc.title	CoLA: The Corpus of Linguistic Acceptability (with added annotations)	en
dc.title.alternative	Neural Network Acceptability Judgments	en
dc.title.alternative	Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments	en
dc.title.alternative	CoLA 1.1	en
dc.type	Dataset	en
Appears in Collections:	Machine Learning for Language Lab

Files in This Item:

File	Description	Size	Format
1805.12471.pdf	The paper describing the CoLA corpus.	712.7 kB	Adobe PDF	View/Open
1901.03438.pdf	The paper describing the additional validation set annotations.	1.08 MB	Adobe PDF	View/Open
CoLA_grammatical_annotations_major_features.txt	The annotated validation set with phenomenon-level annotations.	85.47 kB	Text	View/Open
CoLA_grammatical_annotations_minor_features.txt	The annotated validation set with phenomenon-level annotations.	183.74 kB	Text	View/Open
Primary Corpus README.txt	Additional documentation for the CoLA 1.1 corpus. Only the raw version is distributed here.	6.42 kB	Text	View/Open
in_domain_dev.txt	The primary CoLA 1.1 corpus.	25.35 kB	Text	View/Open
in_domain_train.txt	The primary CoLA 1.1 corpus.	418.53 kB	Text	View/Open
out_of_domain_dev.txt	The primary CoLA 1.1 corpus.	27.1 kB	Text	View/Open

Show simple item record