You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl lang direction, config.yaml at the output of the filtering pipeline lists absolutely other lang pairs.
My dataset consists of 2 files (FTData is a root directory for my dataset): FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz, FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz.
Then I run: python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary,
where bt and mined are empty directories (since I have initially only my own texts without any preprocessing),
then: python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores,
where flores is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path is a required param to run compute_length_factors.py, so I think I can indicate an arbitrary directory there),
and lastly: python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput.
My FTFiltered/config.yaml file looks as follows:
As you can see, eng_Latn-lij_Latn and eng_Latn-scn_Latn are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl in my config, but this lang pair is required for me.
Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml.
The text was updated successfully, but these errors were encountered:
Your config is not good, remove those 2 directions and add eng_Latn-rus_Cyrl (or more likely I think only eng-rus will work). Same for the opposite direction that you need.
You need to add your corpora as well.
Download the flores dataset and use it to prepare those 2 configs, I end up with something like below:
I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies
eng_Latn-rus_Cyrl
lang direction,config.yaml
at the output of the filtering pipeline lists absolutely other lang pairs.My dataset consists of 2 files (FTData is a root directory for my dataset):
FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz
,FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz
.Then I run:
python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary
,where
bt
andmined
are empty directories (since I have initially only my own texts without any preprocessing),then:
python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores
,where
flores
is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but--flores-path
is a required param to runcompute_length_factors.py
, so I think I can indicate an arbitrary directory there),and lastly:
python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput
.My
FTFiltered/config.yaml
file looks as follows:As you can see,
eng_Latn-lij_Latn
andeng_Latn-scn_Latn
are not contained in my dataset but I got them. In the same time, there is noeng_Latn-rus_Cyrl
in my config, but this lang pair is required for me.Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my
config.yaml
.The text was updated successfully, but these errors were encountered: