Work presented at STIL 2019. This repository contains
- dataset files:
data
directory - scripts for dataset generation:
scripts
directory - files used for model definition + training:
models
,pos_tagger
andpostagger.py
- logs of some training experiments (test accuracies too):
runs
directory - pretrained models:
pretrained
directory - pdf of the paper:
STIL2019.pdf
python v3.6.3
pytorch v1.2
: https://pytorch.org/tqdm
: https://github.com/tqdm/tqdm- Very basic knowledge of
python
is needed in order to fill thepos_tagger/parameters.py
file.
- MacMorpho: http://nilc.icmc.usp.br/macmorpho/
- GSD: https://github.com/UniversalDependencies/UD_Portuguese-GSD
- Bosque-UD: https://github.com/UniversalDependencies/UD_Portuguese-Bosque
- Bosque-LT: https://www.linguateca.pt/Floresta/ficheiros/Bosque_CP_8.0.ad.txt (at https://www.linguateca.pt/Floresta/corpus.html#bosque)
Script for extracting sentences and their POS tags from a file with ad
formatting and generating a new file with the extracted samples, following the Mac-Morpho formatting. In order to execute the script, run
python ad2mm.py PATH_TO_AD_FILE PATH_TO_NEW_FILE
Script for extracting sentences and their POS tags from a file with conllu
formatting and generating a new file with the extracted samples, following the Mac-Morpho formatting. In order to execute the script, run
python ad2mm.py PATH_TO_CONLLU_FILE PATH_TO_NEW_FILE
Script for generating the lgtc
(Linguateca) datasets. Since there can be a huge intersection between different sets (eg. train and test) between Bosque-UD and the generated Linguateca (Bosque-LT) splits, a more cautious split is needed. By generating the files with this script, there will only be intersections between the same sets (train-train, dev-dev, test-test).
The parameters need to be hardcoded:
DATA_PATH
: path to directory with input filesUD_TRAIN_FILE
: Bosque-UD train fileUD_DEV_FILE
: Bosque-UD dev fileUD_TEST_FILE
: Bosque-UD test fileFILE_LGTC
: Linguateca file with samples at the Mac-Morpho formattingDEST_LGTC_TRAIN
: Destination path to the Bosque-LT train fileDEST_LGTC_DEV
: Destination path to the Bosque-LT dev fileDEST_LGTC_TEST
: Destination path to the Bosque-LT test file Then run
python build_lgtc.py
Follow the instructions at the file to fill it.
Execute the main
file
python postagger.py
Only log messages with rank <= LOG_LEVEL
will be printed on the terminal.
rank=0
messages: erros, warnings, train and test output, tqdmrank=1
messages: success messages, descriptive log
A log file with all the log messages will be generated.
A file with the samples form the validation set, along with their tags prediction, will be generated for each dataset.
Script used for checking the intersection of sentences between all the files from the dataset. The path of the files must be hardcoded for the variable FILES
. To execute the script, run
python intersect.py
If you used our model, please cite our paper:
- LINK TODO
bibtex TODO