Source and instructions required for replicating the CLAMS paper (Cross-Linguistic Syntactic Evaluation of Word Prediction Models, Mueller et al., ACL 2020).
In this repository, we provide the following:
- Our data sets for training and testing the LSTM language models
- Syntactic evaluation sets
- English, French, German, Hebrew, Russian
- Attribute-varying grammars (AVGs)
- Code for replicating the CLAMS paper, including:
- A modified form of Rebecca Marvin's syntactic evaluation code for testing on CLAMS evaluation sets
- A modified form of Yoav Goldberg's BERT-Syntax code, as well as code to prepare CLAMS evaluation sets for running in it
- English: train / valid / test / vocab
- From Gulordava et al. (2018)
- French: train / valid / test / vocab
- German: train / valid / test / vocab
- Hebrew: train / valid / test / vocab
- From Gulordava et al. (2018)
- Russian: train / valid / test / vocab
To replicate the multilingual corpora, simply concatenate the training, validation, and test corpora for each language. The multilingual vocabulary is the concatenation of each language's monolingual vocabulary.
These are used to generate syntactic evaluation sets by varying attributes. This generates sets of grammatical and ungrammatical examples in a controlled manner.
The behavior of this system is defined in grammar.py
. The idea is quite similar to context-free grammars, but with an added vary statement which defines which preterminals and attributes to vary to generate the desired incorrect examples. See the CLAMS paper for more detail.
The generation procedure we use is defined in generator.py
. We give the script a directory of grammars, wherein each file contains one syntactic test case. We also define a common.avg
grammar for each language, which contains terminals shared by all other grammars in the directory. You can also check whether all tokens in your grammars are contained in your language model's vocabulary by using the --check_vocab
argument, which takes a text file of line-separated tokens.
Example usage:
python generator.py --grammars fr_replication --check_vocab data/fr/vocab.txt
The evaluation sets we use in the CLAMS paper are present in the *_evalset
folders. They are formatted as tab-separated tables, where the first column is a Boolean representing the grammaticality of the sentence, and the second is the evaluation case.
Note that the AVGs generate examples which have a minimal amount of preprocessing---most tokens are lowercase, and by default, they do not contain punctuation or end-of-sentence markers. This is meant to keep them modular. We provide a preproc.py
script which changes the format of the examples to better fit our training domain, and this should be modified to make the evaluation sets look more like your training sets (if you so choose). We use the --eos
setting to obtain the results in Table 2 of our paper; we use both the --eos
and --capitalize
settings to obtain the results in Table 4. The postproc.sh
script simply renames the files generated by preproc.py
to replace the original un-processed files.
Example usage:
python preproc.py --evalsets fr_evalset --eos
./postproc.sh fr_evalset
Requirements:
- Python 3.6.9+
- PyTorch 1.1.0
- CUDA 9.0
We modify the code of van Schijndel, Mueller & Linzen (2019), which itself is a modification of the code from Marvin & Linzen (2018). This code was written to run on a particular SLURM-enabled grid setup. We highly encourage pull requests containing code updated to run on more recent PyTorch/CUDA versions, as well as code meant to run on more types of systems.
To train an LSTM language model, run train_{en,fr,de,ru,he}.sh
in LM_syneval/example_scripts
.
To obtain model perplexities on a test corpus, run the following (in LM_syneval/word-language-model
):
python main.py --test --lm_data $corpus_dir/ --save models/$model_pt --save_lm_data models/$model_bin --testfname $test_file
There is a test script in LM_syneval/example_scripts
.
To obtain word-by-word model surprisals on the syntactic evaluation sets, run the following (in LM_syneval/word-language-model
):
./test.sh $evalset_dir $model_dir $test_case
To evaluate on every test case in a directory of evaluation sets, pass all
as the $test_case
argument.
The above script outputs a series of files with the extension .$model_dir.wordscores
in the same directory as the evaluation sets.
Then, to analyze these word-by-word scores and obtain scores per-case, run the following (in LM_syneval/word-language-model
):
python analyze_results.py --score_dir $score_dir --case $case
where $score_dir
is a directory containing .wordscores
files, and case
refers to the syntactic evaluation case (e.g., obj_rel_across_anim
. By default, --case
is all
; this will give scores on every stimulus type in the specified directory.
By default, the above script compares the probability of entire grammatical and ungrammatical sentences when obtaining accuracies. To calculate accuracies based solely on the individual varied words, pass the --word_compare
argument to analyze_results.py
.
We provide a very slightly modified version of Yoav Goldberg's BERT-Syntax code. Additionally, we provide scripts for pre-processing the syntactic evaluation sets generated by AVGs into the format required by BERT-Syntax.
The model loaded in eval_bert.py
is now bert-base-multilingual-cased
. Additionally, the script is now able to handle input other than cases from the English Marvin & Linzen set.
To pre-process an evaluation set for BERT or mBERT, copy the make_for_bert.py
script to the folder containing the evaluation set and then run it from that directory. This will produce a forbert.tsv
file which you can then pass as input to the eval_bert.py
script.
Example usage:
python eval_bert.py marvin > results/$lang_results_multiling.txt
CLAMS is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.