CoLA: The Corpus of Linguistic Acceptability (with added annotations)
|Other Titles:||Neural Network Acceptability Judgments|
Linguistic Analysis of Pretrained Sentence Encoders with Acceptability Judgments
Bowman, Samuel R.
|Abstract:||[Primary paper:] This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.|
|Appears in Collections:||Machine Learning for Language Lab|
Files in This Item:
|1805.12471.pdf||The paper describing the CoLA corpus.||712.7 kB||Adobe PDF||View/Open|
|1901.03438.pdf||The paper describing the additional validation set annotations.||1.08 MB||Adobe PDF||View/Open|
|CoLA_grammatical_annotations_major_features.txt||The annotated validation set with phenomenon-level annotations.||85.47 kB||Text||View/Open|
|CoLA_grammatical_annotations_minor_features.txt||The annotated validation set with phenomenon-level annotations.||183.74 kB||Text||View/Open|
|Primary Corpus README.txt||Additional documentation for the CoLA 1.1 corpus. Only the raw version is distributed here.||6.42 kB||Text||View/Open|
|in_domain_dev.txt||The primary CoLA 1.1 corpus.||25.35 kB||Text||View/Open|
|in_domain_train.txt||The primary CoLA 1.1 corpus.||418.53 kB||Text||View/Open|
|out_of_domain_dev.txt||The primary CoLA 1.1 corpus.||27.1 kB||Text||View/Open|
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.