This repository has been archived by the owner on Jun 19, 2024. It is now read-only.
forked from THUNLP-MT/Mask-Align
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtrain-mask-align.txt
53 lines (49 loc) · 3.13 KB
/
train-mask-align.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
0. Install requirements
- Python 3.9
- CUDA Driver
- Google Sentencepiece
- Install pip with the following requirements
tensorFlow==2.5.0
six==1.15.0
regex==2021.7.6
numpy==1.19.5
torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
nltk==3.6.2
1. Prepare a corpus (this is an example for the europarl corpus)
- Download from https://www.statmt.org/europarl/
- Remove sentences that don't have a pair (sentences that correspond with an empty line)
- Remove sencences of length 1
- Remove sentences that contain tags (characters "<" and ">")
- Split the corpus in test, train and val. A good size can be 500 sentences for test and 2000 for val.
- Name the files corpus.es, validacion.es, test.es, corpus.en, validacion.en and test.en
- Run the following commands, to learn the vocabulary and tokenize the files
spm_train --input=corpus.en --model_prefix=en --vocab_size=32000 --character_coverage=1.0 --model_type=unigram
spm_train --input=corpus.es --model_prefix=es --vocab_size=32000 --character_coverage=1.0 --model_type=unigram
spm_encode --model=en.model --output_format=piece < corpus.en > corpus.32k.en
spm_encode --model=en.model --output_format=piece < validation.en > validation.32k.en
spm_encode --model=en.model --output_format=piece < test.en > test.32k.en
spm_encode --model=es.model --output_format=piece < corpus.es > corpus.32k.es
spm_encode --model=es.model --output_format=piece < validation.es > validation.32k.es
spm_encode --model=es.model --output_format=piece < test.es > test.32k.es
- Run: python3 thualign/scripts/shuffle_corpus.py --corpus corpus.32k.es corpus.32k.en
- In the vocabulary files, replace the tags <s> and </s> for <eos> and <pad>
2. Train the model
- In the config file thualign/configs/user/spanish.config, specify the location of the following files:
- corpus.32k.es.shuf
- corpus.32k.en.shuf
- validation.32k.es
- validation.32k.en
- test.32k.es
- test.32k.en
- es.vocab
- en.vocab
- In device_list specify the number of GPUs
- In batch_size choose the highest value that doesn't make the training stop due to lack of memory (try different numbers)
- The value of update_cycle must be 36000 / batch_size
- To train, run: bash thualign/bin/train.sh -s spanish
- The model is saved in a folder created in the root directory of the repository
3. Test the model
- After finishing the training (50000 steps is the default), run the following command to test it:
bash thualign/bin/generate.sh -s spanish -gvt
- The alignments are generated in a file named "alignments.txt", inside the folder named "test" where the model was saved
- To see the alignments in an interactive way, run: python thualign/scripts/visualize.py spanish/output/test/alignment_vizdata.pt