Linguistics Pdf 103429 | The University Bawden Doa070619 Vor Cc By

Partial capture of text on file.
                                                                   
      
        Edinburgh Research Explorer 
      
        The University of Edinburgh’s Submissions to the WMT19 News
        Translation Task
      
        Citation for published version:
        Bawden, R, Bogoychev, N, Germann, U, Grundkiewicz, R, Kirefu, F, Miceli Barone, A & Birch-Mayne, A
        2019, The University of Edinburgh’s Submissions to the WMT19 News Translation Task. in Proceedings of
        the Fourth Conference on Machine Translation: Volume 2: Shared Task Papers. vol. 2, WMT0004,
        Association for Computational Linguistics, Florence, Italy, pp. 302–314, ACL 2019 Fourth Conference on
        Machine Translation, Florence, Italy, 1/08/19. https://doi.org/10.18653/v1/W19-5304
        Digital Object Identifier (DOI):
        10.18653/v1/W19-5304
      
        Link:
        Link to publication record in Edinburgh Research Explorer
      
        Document Version:
        Publisher's PDF, also known as Version of record
      
        Published In:
        Proceedings of the Fourth Conference on Machine Translation
      
      
      
      
      
      
      
      
      
      
      
        General rights
        Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
        and / or other copyright owners and it is a condition of accessing these publications that users recognise and
        abide by the legal requirements associated with these rights.
      
        Take down policy                                                    
        The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
        content complies with UK legislation. If you believe that the public display of this file breaches copyright please
        contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
        investigate your claim.
                                                                            
                                                                            
        Download date: 21. Sep. 2022
                                         TheUniversity of Edinburgh’s Submissions
                                             to the WMT19NewsTranslationTask
                      Rachel Bawden          Nikolay Bogoychev           Ulrich Germann          RomanGrundkiewicz
                               FaheemKirefu          Antonio Valerio Miceli Barone             Alexandra Birch
                                        School of Informatics, University of Edinburgh, Scotland
                                                      rachel.bawden@ed.ac.uk
                                       Abstract                            Exploiting non-parallel resources        For all lan-
                      The University of Edinburgh participated in          guage directions, we create additional, synthetic
                      the WMT19SharedTaskonNewsTranslation                 parallel training data.
                      in six language directions: English↔Gujarati,           For the high resource language pairs, we look
                      English↔Chinese,     German→English, and             at ways of effectively using large quantities of
                      English→Czech.     For all translation direc-        backtranslated data. For example, for DE→EN,
                      tions, we created or used back-translations          we investigated the most effective way of com-
                      of monolingual data in the target language           bining genuine parallel data with larger quantities
                      as additional synthetic training data.   For         of synthetic parallel data and for CS→EN, we ﬁl-
                      English↔Gujarati, we also explored semi-             ter backtranslated data by re-scoring translations
                      supervised MT with cross-lingual language            using the MT model for the opposite direction.
                      model pre-training, and translation pivoting
                      throughHindi. FortranslationtoandfromChi-            Thechallenge for our low resource pair, EN↔GU,
                      nese, we investigated character-based tokeni-        is producing sufﬁciently good models for back-
                      sation vs. sub-word segmentation of Chinese          translation, which we achieve by training semi-
                      text. For German→English,westudiedtheim-             supervised MT models with cross-lingual language
                      pact of vast amounts of back-translated train-       model pre-training (Lample and Conneau, 2019).
                      ing data on translation quality, gaining a few       Weusethesametechniquetotranslate additional
                      additional insights over Edunov et al. (2018).       data from a related language, Hindi.
                      For English→Czech, we compared different
                      pre-processing and tokenisation regimes.             NMTTraining settings        In all experiments, we
                  1   Introduction                                         test state-of-the-art training techniques, including
                  The University of Edinburgh participated in              using ultra-large mini-batches for DE→EN and
                  the WMT19 Shared Task on News Transla-                   EN↔ZH,implementedasoptimiserdelay.
                  tion in six language directions: English-Gujarati        Results summary       Automatic evaluation results
                  (EN↔GU),English-Chinese(EN↔ZH),German-                   for all ﬁnal systems on the WMT19testsetaresum-
                  English (DE→EN)andEnglish-Czech (EN→CS).                 marised in Table 1. Throughout the paper, BLEU
                  All our systems are neural machine translation           is calculated using SACREBLEU2 (Post, 2018) un-
                  (NMT)systemstrained in constrained data condi-           less otherwise indicated. A selection of our ﬁnal
                                         1                                                                     3
                  tions with the Marian toolkit (Junczys-Dowmunt           models are available to download.
                  et al., 2018). The different language pairs pose very    2   Gujarati ↔ English
                  different challenges, due to the characteristics of
                  the languages involved and arguably more impor-          Oneofthemainchallenges for translation between
                  tantly, due to the amount of training data available.    English↔Gujarati is that it is a low-resource lan-
                  Pre-processing     For EN↔ZH, we investigate             guage pair; there is little openly available paral-
                  character-level pre-processing for Chinese com-          lel data and much of this data is domain-speciﬁc
                  pared with subword segmentation. For EN→CS,              and/or noisy (cf. Section 2.1). Our aim was there-
                  weshowthatitispossible in high resource settings         fore to experiment how additional available data
                  to simplify pre-processing by removing steps.               2https://github.com/mjpost/sacreBLEU
                                                                              3See data.statmt.org/wmt19_systems/forour
                     1https://marian-nmt.github.io                         released models and running scripts.
                                                                       302
                     Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pages 302–314
                                                                  c
                                    Florence, Italy, August 1-2, 2019. 
2019 Association for Computational Linguistics
                               Lang. direction    BLEU Ranking                      from the Moses toolkit (Koehn et al., 2007): nor-
                               EN→GU                16.4           1                malisation, tokenisation, cleaning (of training data
                               GU→EN                21.4           2                only, with a maximum sentence length of 80 to-
                               EN→ZH                34.4           7                kens) and true-casing for English data, using a
                               ZH→EN                27.7           6                model trained on all available news data. The
                               DE→EN                35.0           9
                               EN→CS                27.9           3                Gujarati data was additionally pre-tokenised using
                                                                                    the IndicNLP tokeniser4 before Moses tokenisation
                    Table 1:   Final BLEU score results and system rank-            was applied. We also applied subword segmenta-
                    ings amongst constrained systems according to auto-             tion using BPE (Sennrich et al., 2016b), with joint
                    matic evaluation metrics.                                       subword vocabularies. We experimented with dif-
                                                                                    ferent numbers of BPE operations during training.
                    can help us to improve translation quality: large
                    quantities of monolingual text for both English and             2.2    Creation of synthetic parallel data
                    Gujarati, and resources from Hindi (a language re-              Data augmentation techniques such as backtransla-
                    lated to Gujarati) in the form of monolingual Hindi             tion (Sennrich et al., 2016a; Edunov et al., 2018),
                    data and a parallel Hindi-English corpus. We ap-                which can be used to produce additional synthetic
                    plied semi-supervised translation, backtranslation              parallel data from monolingual data, are standard
                    and pivoting techniques to create a large synthetic             in MT. However they require a sufﬁciently good
                    parallel corpus from these resources (Section 2.2),             intermediate MTmodeltoproducetranslationsthat
                    which we used to augment the small available par-               are of reasonable quality to be useful for training
                    allel training corpus, enabling us to train our ﬁnal            (Hoang et al., 2018). This is extremely hard to
                    supervised MT models (Section 2.3).                             achieve for this language pair. Our preliminary
                    2.1    Dataandpre-processing                                    attempt at parallel-only training yielded a very
                    Wetrained our models using only data listed for                 low BLEU score of 7.8 on the GU→EN devel-
                    the task (cf. Table 2). Note that we did not have               opment set using a Nematus-trained shallow RNN
                                                                                                                  5
                    access to the corpora provided by the Technology                with heavy regularisation, and similar scores were
                    Development for Indian Languages Programme, as                  found for a Moses phrase-based translation system.
                    they were only available to Indian citizens.                       Oursolution was to train models for the creation
                                                                                    of synthetic data that exploit both monolingual and
                       Lang(s)    Corpus                  #sents    Ave. len.       parallel data during training.
                                           Parallel data                            2.2.1    Semi-supervised MT with cross-lingual
                       EN-GU      Software data          107,637          7.0                language model pre-training
                                  Wikipedia               18,033        21.1        Wefollowed the unsupervised training approach in
                                  Wiki titles v1          11,671          2.1       (Lample and Conneau, 2019) to train two MT sys-
                                  Govin                   10,650        17.0
                                  Bilingual dictionary     9,979          1.5                                                                 6
                                  Bible                    7,807        26.4        tems, one for EN↔GUandasecondforHI→GU.
                                  Emille                   5,083        19.1        This involves training unsupervised NMT models
                       GU-HI      Emille                   7,993        19.1        with an additional supervised MT training step. Ini-
                       EN-HI      BombayIIT                1.4M         13.4        tialisation of the models is done by pre-training
                                         Monolingual data                           parameters using a masked language modelling
                       EN         News                     200M         23.6        objective as in Bert (Devlin et al., 2019), individ-
                       GU         Commoncrawl              3.7M         21.9        ually for each language (MLM, which stands for
                                  Emille                   0.9M         16.6        maskedlanguagemodelling)and/orcross-lingually
                                  Wiki-dump                0.4M         17.7
                                  News                     0.2M         15.4        (TLM,whichstandsfor translation language mod-
                       HI         BombayIIT               45.1M         18.7        elling). The TLM objective is the MLM objective
                                  News                    23.6M         17.0
                                                                                        4     anoopkunchukuttan.github.io/indic_
                    Table 2: EN-GU Parallel training data used. Average             nlp_library/
                                                                                        5                     −4
                    length is calculated in number of tokens per sentence.               Learning rate: 5 × 10  , word dropout (Gal and Ghahra-
                    For the parallel corpora, this is calculated for the ﬁrst       mani, 2016): 0.3, hidden state and embedding dropout: 0.5,
                    language indicated (i.e. EN, GU, then EN)                       batch tokens: 1000, BPE vocabulary threshold 50, label
                                                                                    smoothing: 0.2.
                                                                                        6We used the code available at https://github.
                       Wepre-processed all data using standard scripts              com/facebookresearch/XLM
                                                                                303
                   applied to the concatenation of parallel sentences.
                   See (Lample and Conneau, 2019) for more details.
                   2.2.2    ENandGUbacktranslation                                       10
                   Wetrained a single MT model for both language                    score
                   directions EN→GU and GU→EN using this ap-                        U
                                                                                    E     5
                   proach.     For pre-training we used all available               L
                                                                                    B                                      2k        5k
                   data in Table 2 (both the parallel and monolin-                                                        10k       20k
                   gual datasets) with MLM and TLM objectives.                                                            50k       80k
                   The same data was then used to train the semi-
                   supervised MT model, which achieved a BLEU                                   0     10    20    30     40    50     60
                   score of 22.1 for GU→EN and 12.6 for EN→GU                                             Iteration number
                   on the dev set (See the ﬁrst row in Table 5). This
                   model was used to backtranslate 7.3M of mono-                  Figure 1:   The effect of the number of subword op-
                   lingual English news data into Gujarati and 5.1M               erations on BLEU score during training for EN→GU
                                                                       7
                   monolingual Gujarati sentences into English.                  (calculated on the newsdev2019 dataset).
                   System and training details          Weusedefault ar-
                   chitectures for both pre-training and translation: 6             Ourparallel Gujarati-Hindi data consisted of ap-
                   layers with 8 transformer heads, embedding dimen-              proximately 8,000 sentences from the Emille cor-
                   sions of 1024. Training parameters are also as per             pus. After transliterating the Hindi, we found that
                   the default: batch size of 32, dropout and attention           9% of Hindi tokens (excluding punctuation and
                   dropout of 0.1, Adam optimisation (Kingma and                  English words) were an exact match to the corre-
                   Ba, 2015) with a learning rate of 0.0001.                      sponding Gujarati tokens. However, we did have
                   Degreeofsubwordsegmentation Wetestedthe                        access to large quantities of monolingual data in
                   impact of varying degrees of subword segmenta-                 both Gujarati and Hindi (see Table 2), which we
                   tion on translation quality (See Figure 1). Contrary           pre-processed in the same way.
                   to our expectation that a higher degree of segmen-               The semi-supervised HI↔GU system was
                   tation (i.e. with a very small number of merge oper-           trained using the MLM pre-training objective de-
                   ations) would produce better results, as is often the          scribed in Section 2.1 and the same model architec-
                   case with very low resource pairs, the best tested             ture as the EN↔GU model in Section 2.2.2. For
                   value was 20k joint BPE operations. The reason for             the MT step, we trained on 6.5k parallel sentences,
                   this could be the extremely limited shared vocabu-             reserving the remaining 1.5k as a development set.
                   lary between the two languages8 or that training on           Aswith the EN↔GUmodel, we investigated the
                   large quantities of monolingual data turns the low             effect of different BPE settings (5k, 10k, 20k and
                   resource task into a higher one.                              40k merge operations) on the translation quality.
                   2.2.3    HI→GUtranslation                                      Surprisingly, just as with EN↔GU, 20k BPE op-
                                                                                  erations performed best (cf. Table 3), and so we
                   TransliterationofHinditoGujaratiscript                We       used the model trained in this setting to translate
                   ﬁrst transliterated all of the Hindi characters into           the Hindi side of the IIT Bombay English-Hindi
                   Gujarati characters to encourage vocabulary shar-              Corpus, which we refer to as HI2GU-EN.
                   ing. As there are slightly more Hindi unicode char-
                   acters than Gujarati, Hindi characters with no cor-                      BPE         5k    10k    20k     40k
                   responding Gujarati characters and all non-Hindi                          BLEU 15.4        16.0   16.3   14.6
                   characters were simply copied across.
                      Once transliterated, there is a high degree of             Table 3: The inﬂuence of number of BPE merge opera-
                   overlap between the transliterated Hindi (HG) and              tions on HI→GU BLEU score measured using BLEU
                   the corresponding Gujarati sentence, which is                  scores on the development set
                   demonstrated by the example in Figure 2.
                       7We were unable to translate all available monolingual     2.2.4   Finalisation of training data
                   data due to time constraints and limits to GPU resources.
                       8Except for occasional Arabic numbers and romanised       Theﬁnaltraining data for each model was the con-
                   proper names in Gujarati texts.                                catenation of this parallel data, the HI2GU-EN
                                                                             304
The words contained in this file might help you see if this file matches what you are looking for:

...Edinburgh research explorer the university of s submissions to wmt news translation task citation for published version bawden r bogoychev n germann u grundkiewicz kirefu f miceli barone a birch mayne in proceedings fourth conference on machine volume shared papers vol association computational linguistics florence italy pp acl https doi org v w digital object identifier link publication record document publisher pdf also known as general rights copyright publications made accessible via is retained by author and or other owners it condition accessing these that users recognise abide legal requirements associated with take down policy has every reasonable effort ensure content complies uk legislation if you believe public display this file breaches please contact openaccess ed ac providing details we will remove access work immediately investigate your claim download date sep theuniversity wmtnewstranslationtask rachel nikolay ulrich romangrundkiewicz faheemkirefu antonio valerio alexa...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area