162x Filetype PDF File size 0.56 MB Source: www.pure.ed.ac.uk
Edinburgh Research Explorer The University of Edinburgh’s Submissions to the WMT19 News Translation Task Citation for published version: Bawden, R, Bogoychev, N, Germann, U, Grundkiewicz, R, Kirefu, F, Miceli Barone, A & Birch-Mayne, A 2019, The University of Edinburgh’s Submissions to the WMT19 News Translation Task. in Proceedings of the Fourth Conference on Machine Translation: Volume 2: Shared Task Papers. vol. 2, WMT0004, Association for Computational Linguistics, Florence, Italy, pp. 302–314, ACL 2019 Fourth Conference on Machine Translation, Florence, Italy, 1/08/19. https://doi.org/10.18653/v1/W19-5304 Digital Object Identifier (DOI): 10.18653/v1/W19-5304 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Proceedings of the Fourth Conference on Machine Translation General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 21. Sep. 2022 TheUniversity of Edinburgh’s Submissions to the WMT19NewsTranslationTask Rachel Bawden Nikolay Bogoychev Ulrich Germann RomanGrundkiewicz FaheemKirefu Antonio Valerio Miceli Barone Alexandra Birch School of Informatics, University of Edinburgh, Scotland rachel.bawden@ed.ac.uk Abstract Exploiting non-parallel resources For all lan- The University of Edinburgh participated in guage directions, we create additional, synthetic the WMT19SharedTaskonNewsTranslation parallel training data. in six language directions: English↔Gujarati, For the high resource language pairs, we look English↔Chinese, German→English, and at ways of effectively using large quantities of English→Czech. For all translation direc- backtranslated data. For example, for DE→EN, tions, we created or used back-translations we investigated the most effective way of com- of monolingual data in the target language bining genuine parallel data with larger quantities as additional synthetic training data. For of synthetic parallel data and for CS→EN, we fil- English↔Gujarati, we also explored semi- ter backtranslated data by re-scoring translations supervised MT with cross-lingual language using the MT model for the opposite direction. model pre-training, and translation pivoting throughHindi. FortranslationtoandfromChi- Thechallenge for our low resource pair, EN↔GU, nese, we investigated character-based tokeni- is producing sufficiently good models for back- sation vs. sub-word segmentation of Chinese translation, which we achieve by training semi- text. For German→English,westudiedtheim- supervised MT models with cross-lingual language pact of vast amounts of back-translated train- model pre-training (Lample and Conneau, 2019). ing data on translation quality, gaining a few Weusethesametechniquetotranslate additional additional insights over Edunov et al. (2018). data from a related language, Hindi. For English→Czech, we compared different pre-processing and tokenisation regimes. NMTTraining settings In all experiments, we 1 Introduction test state-of-the-art training techniques, including The University of Edinburgh participated in using ultra-large mini-batches for DE→EN and the WMT19 Shared Task on News Transla- EN↔ZH,implementedasoptimiserdelay. tion in six language directions: English-Gujarati Results summary Automatic evaluation results (EN↔GU),English-Chinese(EN↔ZH),German- for all final systems on the WMT19testsetaresum- English (DE→EN)andEnglish-Czech (EN→CS). marised in Table 1. Throughout the paper, BLEU All our systems are neural machine translation is calculated using SACREBLEU2 (Post, 2018) un- (NMT)systemstrained in constrained data condi- less otherwise indicated. A selection of our final 1 3 tions with the Marian toolkit (Junczys-Dowmunt models are available to download. et al., 2018). The different language pairs pose very 2 Gujarati ↔ English different challenges, due to the characteristics of the languages involved and arguably more impor- Oneofthemainchallenges for translation between tantly, due to the amount of training data available. English↔Gujarati is that it is a low-resource lan- Pre-processing For EN↔ZH, we investigate guage pair; there is little openly available paral- character-level pre-processing for Chinese com- lel data and much of this data is domain-specific pared with subword segmentation. For EN→CS, and/or noisy (cf. Section 2.1). Our aim was there- weshowthatitispossible in high resource settings fore to experiment how additional available data to simplify pre-processing by removing steps. 2https://github.com/mjpost/sacreBLEU 3See data.statmt.org/wmt19_systems/forour 1https://marian-nmt.github.io released models and running scripts. 302 Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 302–314 c Florence, Italy, August 1-2, 2019. 2019 Association for Computational Linguistics Lang. direction BLEU Ranking from the Moses toolkit (Koehn et al., 2007): nor- EN→GU 16.4 1 malisation, tokenisation, cleaning (of training data GU→EN 21.4 2 only, with a maximum sentence length of 80 to- EN→ZH 34.4 7 kens) and true-casing for English data, using a ZH→EN 27.7 6 model trained on all available news data. The DE→EN 35.0 9 EN→CS 27.9 3 Gujarati data was additionally pre-tokenised using the IndicNLP tokeniser4 before Moses tokenisation Table 1: Final BLEU score results and system rank- was applied. We also applied subword segmenta- ings amongst constrained systems according to auto- tion using BPE (Sennrich et al., 2016b), with joint matic evaluation metrics. subword vocabularies. We experimented with dif- ferent numbers of BPE operations during training. can help us to improve translation quality: large quantities of monolingual text for both English and 2.2 Creation of synthetic parallel data Gujarati, and resources from Hindi (a language re- Data augmentation techniques such as backtransla- lated to Gujarati) in the form of monolingual Hindi tion (Sennrich et al., 2016a; Edunov et al., 2018), data and a parallel Hindi-English corpus. We ap- which can be used to produce additional synthetic plied semi-supervised translation, backtranslation parallel data from monolingual data, are standard and pivoting techniques to create a large synthetic in MT. However they require a sufficiently good parallel corpus from these resources (Section 2.2), intermediate MTmodeltoproducetranslationsthat which we used to augment the small available par- are of reasonable quality to be useful for training allel training corpus, enabling us to train our final (Hoang et al., 2018). This is extremely hard to supervised MT models (Section 2.3). achieve for this language pair. Our preliminary 2.1 Dataandpre-processing attempt at parallel-only training yielded a very Wetrained our models using only data listed for low BLEU score of 7.8 on the GU→EN devel- the task (cf. Table 2). Note that we did not have opment set using a Nematus-trained shallow RNN 5 access to the corpora provided by the Technology with heavy regularisation, and similar scores were Development for Indian Languages Programme, as found for a Moses phrase-based translation system. they were only available to Indian citizens. Oursolution was to train models for the creation of synthetic data that exploit both monolingual and Lang(s) Corpus #sents Ave. len. parallel data during training. Parallel data 2.2.1 Semi-supervised MT with cross-lingual EN-GU Software data 107,637 7.0 language model pre-training Wikipedia 18,033 21.1 Wefollowed the unsupervised training approach in Wiki titles v1 11,671 2.1 (Lample and Conneau, 2019) to train two MT sys- Govin 10,650 17.0 Bilingual dictionary 9,979 1.5 6 Bible 7,807 26.4 tems, one for EN↔GUandasecondforHI→GU. Emille 5,083 19.1 This involves training unsupervised NMT models GU-HI Emille 7,993 19.1 with an additional supervised MT training step. Ini- EN-HI BombayIIT 1.4M 13.4 tialisation of the models is done by pre-training Monolingual data parameters using a masked language modelling EN News 200M 23.6 objective as in Bert (Devlin et al., 2019), individ- GU Commoncrawl 3.7M 21.9 ually for each language (MLM, which stands for Emille 0.9M 16.6 maskedlanguagemodelling)and/orcross-lingually Wiki-dump 0.4M 17.7 News 0.2M 15.4 (TLM,whichstandsfor translation language mod- HI BombayIIT 45.1M 18.7 elling). The TLM objective is the MLM objective News 23.6M 17.0 4 anoopkunchukuttan.github.io/indic_ Table 2: EN-GU Parallel training data used. Average nlp_library/ 5 −4 length is calculated in number of tokens per sentence. Learning rate: 5 × 10 , word dropout (Gal and Ghahra- For the parallel corpora, this is calculated for the first mani, 2016): 0.3, hidden state and embedding dropout: 0.5, language indicated (i.e. EN, GU, then EN) batch tokens: 1000, BPE vocabulary threshold 50, label smoothing: 0.2. 6We used the code available at https://github. Wepre-processed all data using standard scripts com/facebookresearch/XLM 303 applied to the concatenation of parallel sentences. See (Lample and Conneau, 2019) for more details. 2.2.2 ENandGUbacktranslation 10 Wetrained a single MT model for both language score directions EN→GU and GU→EN using this ap- U E 5 proach. For pre-training we used all available L B 2k 5k data in Table 2 (both the parallel and monolin- 10k 20k gual datasets) with MLM and TLM objectives. 50k 80k The same data was then used to train the semi- supervised MT model, which achieved a BLEU 0 10 20 30 40 50 60 score of 22.1 for GU→EN and 12.6 for EN→GU Iteration number on the dev set (See the first row in Table 5). This model was used to backtranslate 7.3M of mono- Figure 1: The effect of the number of subword op- lingual English news data into Gujarati and 5.1M erations on BLEU score during training for EN→GU 7 monolingual Gujarati sentences into English. (calculated on the newsdev2019 dataset). System and training details Weusedefault ar- chitectures for both pre-training and translation: 6 Ourparallel Gujarati-Hindi data consisted of ap- layers with 8 transformer heads, embedding dimen- proximately 8,000 sentences from the Emille cor- sions of 1024. Training parameters are also as per pus. After transliterating the Hindi, we found that the default: batch size of 32, dropout and attention 9% of Hindi tokens (excluding punctuation and dropout of 0.1, Adam optimisation (Kingma and English words) were an exact match to the corre- Ba, 2015) with a learning rate of 0.0001. sponding Gujarati tokens. However, we did have Degreeofsubwordsegmentation Wetestedthe access to large quantities of monolingual data in impact of varying degrees of subword segmenta- both Gujarati and Hindi (see Table 2), which we tion on translation quality (See Figure 1). Contrary pre-processed in the same way. to our expectation that a higher degree of segmen- The semi-supervised HI↔GU system was tation (i.e. with a very small number of merge oper- trained using the MLM pre-training objective de- ations) would produce better results, as is often the scribed in Section 2.1 and the same model architec- case with very low resource pairs, the best tested ture as the EN↔GU model in Section 2.2.2. For value was 20k joint BPE operations. The reason for the MT step, we trained on 6.5k parallel sentences, this could be the extremely limited shared vocabu- reserving the remaining 1.5k as a development set. lary between the two languages8 or that training on Aswith the EN↔GUmodel, we investigated the large quantities of monolingual data turns the low effect of different BPE settings (5k, 10k, 20k and resource task into a higher one. 40k merge operations) on the translation quality. 2.2.3 HI→GUtranslation Surprisingly, just as with EN↔GU, 20k BPE op- erations performed best (cf. Table 3), and so we TransliterationofHinditoGujaratiscript We used the model trained in this setting to translate first transliterated all of the Hindi characters into the Hindi side of the IIT Bombay English-Hindi Gujarati characters to encourage vocabulary shar- Corpus, which we refer to as HI2GU-EN. ing. As there are slightly more Hindi unicode char- acters than Gujarati, Hindi characters with no cor- BPE 5k 10k 20k 40k responding Gujarati characters and all non-Hindi BLEU 15.4 16.0 16.3 14.6 characters were simply copied across. Once transliterated, there is a high degree of Table 3: The influence of number of BPE merge opera- overlap between the transliterated Hindi (HG) and tions on HI→GU BLEU score measured using BLEU the corresponding Gujarati sentence, which is scores on the development set demonstrated by the example in Figure 2. 7We were unable to translate all available monolingual 2.2.4 Finalisation of training data data due to time constraints and limits to GPU resources. 8Except for occasional Arabic numbers and romanised Thefinaltraining data for each model was the con- proper names in Gujarati texts. catenation of this parallel data, the HI2GU-EN 304
no reviews yet
Please Login to review.