Skip to content

Commit

Permalink
Multilingual training for NMT (NVIDIA#2160)
Browse files Browse the repository at this point in the history
* mnmt on fresh main

Signed-off-by: Abhinav Khattar <[email protected]>

* push for test

Signed-off-by: Abhinav Khattar <[email protected]>

* debug

Signed-off-by: Abhinav Khattar <[email protected]>

* check

Signed-off-by: Abhinav Khattar <[email protected]>

* cleanup

Signed-off-by: Abhinav Khattar <[email protected]>

* minor fix

Signed-off-by: Abhinav Khattar <[email protected]>

* more minor fixes

Signed-off-by: Abhinav Khattar <[email protected]>

* fix for test

Signed-off-by: Abhinav Khattar <[email protected]>

* fix list size error

Signed-off-by: Abhinav Khattar <[email protected]>

* multilingual in infer

Signed-off-by: Abhinav Khattar <[email protected]>

* changes

Signed-off-by: Abhinav Khattar <[email protected]>

* tar creation with multilingual

Signed-off-by: Abhinav Khattar <[email protected]>

* fix

Signed-off-by: Abhinav Khattar <[email protected]>

* changes + parallelism + bug fix

Signed-off-by: Abhinav Khattar <[email protected]>

* small fix

Signed-off-by: Abhinav Khattar <[email protected]>

* multilingual preprocessor fix

Signed-off-by: Abhinav Khattar <[email protected]>

* globally unique fragment names in tarred dataset

Signed-off-by: Abhinav Khattar <[email protected]>

* minor changes

Signed-off-by: Abhinav Khattar <[email protected]>

* rm load_from_cached_dataset

Signed-off-by: Abhinav Khattar <[email protected]>

* minor config change

Signed-off-by: Abhinav Khattar <[email protected]>

* rm unsued import

Signed-off-by: Abhinav Khattar <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
  • Loading branch information
2 people authored and mousebaiker committed Jul 8, 2021
1 parent 24f2f40 commit c95fb61
Show file tree
Hide file tree
Showing 8 changed files with 469 additions and 199 deletions.
4 changes: 4 additions & 0 deletions examples/nlp/machine_translation/conf/aayn_base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ do_testing: False # set to True to run evaluation on test data after training
model:
beam_size: 4
len_pen: 0.6
multilingual: False
max_generation_delta: 5
label_smoothing: 0.1
shared_tokenizer: True # train tokenizer model across src and tgt train data
Expand Down Expand Up @@ -36,6 +37,9 @@ model:
drop_last: false
pin_memory: false
num_workers: 8
concat_sampling_technique: temperature # only used with ConcatTranslationDataset
concat_sampling_temperature: 5 # only used with ConcatTranslationDataset
concat_sampling_probabilities: null # only used with ConcatTranslationDataset

validation_ds:
src_file_name: ???
Expand Down
98 changes: 0 additions & 98 deletions examples/nlp/machine_translation/preprocess_dataset.py

This file was deleted.

1 change: 1 addition & 0 deletions nemo/collections/nlp/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
)
from nemo.collections.nlp.data.language_modeling.sentence_dataset import SentenceDataset, TarredSentenceDataset
from nemo.collections.nlp.data.machine_translation.machine_translation_dataset import (
ConcatTranslationDataset,
TarredTranslationDataset,
TranslationDataset,
)
Expand Down
1 change: 1 addition & 0 deletions nemo/collections/nlp/data/machine_translation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.

from nemo.collections.nlp.data.machine_translation.machine_translation_dataset import (
ConcatTranslationDataset,
TarredTranslationDataset,
TranslationDataset,
)
Loading

0 comments on commit c95fb61

Please sign in to comment.