Skip to content

Latest commit

 

History

History
114 lines (78 loc) · 4.57 KB

README.md

File metadata and controls

114 lines (78 loc) · 4.57 KB

Hidden Markov models for word representations

Learn discrete and continuous word representations with Hidden Markov models, including variants defined over unlabeled and labeled parse trees.

Author: Simon Šuster

If you use this code for any of the tree HMM types, please cite our paper:

Word Representations, Tree Models and Syntactic Functions. Simon Šuster, Gertjan van Noord and Ivan Titov. arXiv preprint arXiv:1508.07709, 2015. bibtex

Features

Architectures and training

  • Sequential HMM trained with the Baum-Welch algorithm.
  • Tree HMM trained with sum-product message passing (belief propagation)
    • unlabeled
    • labeled (includes syntactic functions as additional observed variables, effectively an Input- Output HMM)

Training regimes

  • Batch EM
  • Online (mini-batch step-wise) EM (Liang and Klein 2009)

Inference and decoding options

  • Viterbi/Max-product message passing
  • Posterior decoding (aka Minimum risk)
  • Posterior distribution with inference
  • Averaged posterior distribution per word type
  • Maximum emission decoding (ignoring transitions)

Implemented refinements

  • Approximation of belief vectors (regularization) (Grave et al. 2013)
  • Initialization of model parameters with Brown clusters
  • Splitting and merging of states for progressive introduction of complexity (Petrov et al. 2006)

Implementation

  • Logspace approach instead of Rabiner rescaling: a C extension is used for fast log-sum exponent calculation
  • Parallelization on the sentence level

Despite the mentioned, the running time is relatively slow and is especially sensitive to the number of states. A speed-up would be possible through the use of sparse matrices, but at several places the replacement is not trivial.

Input format

  • For sequential HMM: plain text, one sentence per line, space-separated tokens
  • For tree HMMs: CoNLL format

Usage notes

The three main classes define the type of architecture that can be used:

  • hmm.py for Hidden Markov models
  • hmtm.py for Hidden Markov tree models
  • hmrtm.py for Hidden Markov tree models with syntactic relations (functions)

Models can be trained by invoking run.py. Use --tree to train a Hidden Markov tree model, and --rel to train a Hidden Markov tree model with syntactic relations. A full list of options can be found by running:

python3.3 run.py --help

Example runs

First prepare the data that you plan to use. This will replace unfrequent tokens with *unk*:

python3.3 corpus_normalize.py --dataset data/sample.en --output $DATASET --freq_thresh 1

Then, to train a sequential Hidden Markov model:

python3.3 run.py --dataset $DATASET --desired_n_states 60 --max_iter 2 --n_proc 1 --approx

This will create an output directory starting with hmm_....

The following configuration will train the same model, but with the splitting procedure and Brown initialization:

python3.3 run.py --dataset $DATASET --start_n_states 30 --desired_n_states 60 -brown sample.en.30.paths --max_iter 2 --n_proc 1 --approx

To train a tree model, again start by normalizing the parsed data with *unk*:

python3.3 corpus_normalize.py --dataset data/sample.en.conll --output $DATASET_TREE --freq_thresh 1 --conll

Then to train a Hidden Markov tree model:

python3.3 run.py --tree --dataset $DATASET_TREE --start_n_states 30 --desired_n_states 60 
 --max_iter 2 --n_proc 1 --approx

To train a Hidden Markov tree model with syntactic relations:

python3.3 run.py --rel --dataset $DATASET_TREE --desired_n_states 60 --max_iter 2 --n_proc 1 --approx

Evaluating the representations

To carry out NER evaluation on the Conll2003 datasets for English:

python3.3 eng_ner_run.py -d posterior -rep hmm_.../ -o out --n_epochs 2

This runs an averaged structured perceptron, an adaptation of LXMLS's implementation, for 2 epochs (more needed in practice). The training time is about 1 minute per epoch. Note that prior to training the perceptron, word representations are first inferred or decoded, which takes a couple of minutes as well. We choose here posterior (for posterior decoding) that produces comparable results to Viterbi. If you choose the posterior_cont and posterior_cont_type decoding methods, please have in mind that they are very memory intensive.

See the scripts under output for model introspection utilities.

Requirements