Cross-lingual structured flow model for zero-shot syntactic transfer

This is PyTorch implementation of the paper:

Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections
Junxian He, Zhisong Zhang, Taylor Berg-Kiripatrick, Graham Neubig
ACL 2019

The structured flow model is a generative model that can be trained in an supervised fashion on labeled data in another language, but also perform unsupervised training to directly maximize likelihood of the target language. In this way, it is able to transfer shared linguistic knowledge from the source language as well as learning language-specific knowledge on the unlabeled target language.

Please concact junxianh@cs.cmu.edu if you have any questions.

Requirements

Python >= 3.6
PyTorch >= 0.4

Additional requirements can be installed via:

pip install -r requirements.txt

Data

Download the Universal Dependencies 2.2 here (ud-treebanks-v2.2.tgz), put file ud-treebanks-v2.2.tgz into the top-level directory of this repo, and run:

$ tar -xvzf ud-treebanks-v2.2.tgz
$ rm ud-treebanks-v2.2.tgz

Prepare Embeddings

fastText

The fastText embeddings can be downloaded in the Facebook fastText repo (Note that there are different versions of pretrained fastText embeddings in the fastText repo, but the embeddings must be downloaded from the given link since the alignment matrices (from here) we used are learned on this specific version of fastText embeddings). Download the fastText model bin file and put it into the fastText_data folder.

Take English language as an example to download and preprocess the fastText embeddings:

$ cd fastText_data
$ wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
$ unzip wiki.en.zip
$ cd ..

# create a subset of embedding dict for faster embedding loading and memory efficiency
$ python scripts/compress_vec_dict.py --lang en

$ rm wiki.en.vec
$ rm wiki.en.zip

lingua-flow

MoltPulse

Cross-lingual structured flow model for zero-shot syntactic transfer

Requirements

Data

Prepare Embeddings

fastText

multilingual BERT (mBERT)

POS Tagging

Dependency Parsing

Acknowledgement

Reference

Ecosystem Role

Embed Badge