Wanlp 21

Partial capture of text on file.
                                              ARAGPT2:Pre-TrainedTransformer
                                                  for Arabic Language Generation
                                            WissamAntoun and FadyBaly and HazemHajj
                                                         American University of Beirut
                                                       {wfa07, fbg06, hh63}@aub.edu.lb
                                        Abstract                             BERT(Devlinetal., 2019), and ELECTRA (Clark
                       Recently, pre-trained transformer-based archi-        et al., 2020b), and sentence completion with GPT-
                       tectures have proven to be very efﬁcient at           2(Radford et al., 2019), GROVER (Zellers et al.,
                       language modeling and understanding, given            2019), and CTRL (Keskar et al., 2019). Recent
                       that they are trained on a large enough cor-          works have shown that larger models pre-trained
                       pus. Applications in language generation for          onlarger datasets can further improve performance
                       Arabic are still lagging in comparison to other       i.e. RoBERTa (Liu et al., 2019), and XLM-R (Con-
                       NLP advances primarily due to the lack of             neau et al., 2019).
                       advanced Arabic language generation models.              On the other hand, work on Arabic language
                       In this paper, we develop the ﬁrst advanced           modeling has mostly targeted natural language
                       Arabic language generation model, AraGPT2,            understanding (NLU) by pre-training transformer-
                       trained from scratch on a large Arabic corpus
                       of internet text and news articles. Our largest       based models using the Masked Language Model-
                       model, ARAGPT2-MEGA,has1.46billionpa-                 ing (MLM) task i.e. ARABERT (Antoun et al.,
                       rameters, which makes it the largest Arabic           2020a).    In contrast, Arabic text generation or
                       language model available. The MEGA model              causal language modeling hasn’t received much
                       wasevaluatedandshowedsuccessondifferent               attention. Few works such as hULMonA (ElJundi
                       tasks including synthetic news generation, and        et al., 2019) used next word prediction as a pre-
                       zero-shot question answering. For text gener-         training task in for transfer learning in Arabic text
                       ation, our best model achieves a perplexity of
                       29.8 on held-out Wikipedia articles. A study          classiﬁcation. (Khooli, 2020) and (Doiron, 2020)
                       conducted with human evaluators showed the            leveraged the existing GPT2 English model and
                       signiﬁcant success of AraGPT2-mega in gen-            adapted it for Arabic using text from the Arabic
                       erating news articles that are difﬁcult to dis-       Wikipedia dumps, which is sub-optimal for Arabic.
                       tinguish from articles written by humans. We             In this paper, the ﬁrst advanced language gener-
                       thus develop and release an automatic discrim-        ation models built from the grounds up on Arabic
                       inator model with a 98% percent accuracy in           language have been developed. The process of pre-
                       detecting model-generated text. The models
                                                 1                           training ARAGPT2, a GPT-2 transformer model
                       are also publicly available , hoping to encour-
                       age new research directions and applications          for the Arabic language is described. The model
                       for Arabic NLP.                                       comesin4sizevariants: base (135M2), medium
                                                                                                                         3
                  1    Introduction                                          (370M), large (792M) and mega (1.46B ), which
                                                                             allowstheexplorationof ARAGPT2inmultipleap-
                  Few years ago, Natural language processing                 plications with different data availability and com-
                  (NLP) was revolutionized with the introduction             putational constraints. The perplexity measure is
                  of multi-head self-attention transformer architec-         used to automatically evaluate ARAGPT2. Fur-
                  ture (Vaswani et al., 2017).        The transformer        thermore, a human-based evaluation is provided,
                  achieved superior performance compared to recur-           which highlights the ability of ARAGPT2 to de-
                  rent neural networks several NLP tasks including           ceive human evaluators. Finally, an ARAELEC-
                  machine translation, sentence classiﬁcation with           TRA(Antounetal.,2020b)baseddetectorisdevel-
                      1Pretrained variants of ARAGPT2 (base, medium,
                  large, mega) and discriminator are publicly available on      2Million Parameters
                  github.com/aub-mind/arabert/tree/master/aragpt2               3Billion Parameters
                                                                         196
                                   Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 196–207
                                                        Kyiv, Ukraine (Virtual), April 19, 2021.
                   opedandreleased. It is able to consistently identify          Theadvantage of scaling model sizes and training
                   newsarticles written by ARAGPT2. Making such                   datasets comes with drawbacks, particularly the
                   powerful models publicly available to the Arabic               high computational cost, in addition to the huge
                   research community enables research in rising Ara-             corpora required for pre-training. It was estimated
                   bic NLP ﬁelds i.e Conversational Agents (Naous                 that training GPT-2 and GPT-3 costs $43K and
                   et al., 2020), Detection of Automatic News Gener-              $4.6Mrespectively, without any hyper-parameter
                   ation Detection (Harrag et al., 2020)...                       tuning. These drawbacks restricted the availabil-
                      Ourcontributions can be summarized as follows:              ity of large pre-trained models to English mainly
                       • A methodology to pre-train a billion-size                and a handful of other languages i.e. ruGPT35 for
                         GPT2modelonalarge-scaleArabiccorpus.                     Russian, and Chinese 1.5B GPT2 (Zhang, 2019).
                       • An automatic discriminator that achieves a
                         98%accuracy in detecting model-generated                 2.2   Arabic Language modeling
                         synthetic text.                                         Work on Arabic causal language modeling has
                       • The four variants of ARAGPT2arereleased                  been mostly limited to automatic speech recogni-
                         on popular NLP libraries, along with the auto-           tion (ASR) systems. Since the language modeling
                         matic ARAGPT2discriminator.                              componentinASRsystemsisakeymodulethaten-
                      The rest of the paper is structured as follows.             sures that the output text adheres with the statistical
                   Section 2 provides a concise review of previous                structure of language. Work on Arabic language
                   literature on Arabic language modeling.              Sec-      models in ASR systems has mostly relied on N-
                   tion 3 details the methodology used in developing              grams language models. (Ali et al., 2014) built
                   ARAGPT2.Section4describestheexperimental                       an N-grams language model (LM) using GALE
                   setup, evaluation procedures and results. In addi-             training data transcripts of 1.4M words. More re-
                   tion, the approach to build a machine-generated                cent work in Arabic ASR implemented a recur-
                   text discriminator is presented in Section 5. Fi-              rent neural network as an LM, using 130M tokens,
                   nally, a conclusion of the work and implications               and achieved a perplexity of 481 compared to 436
                   are mentioned in Section 6.                                    for a 4-gram LM (Khurana et al., 2019). Hamed
                   2    Related Works                                             et al. (2017) developed a code-switched Arabic-
                                                                                  English language model using tri-gram LM and
                   2.1    English and Non-Arabic Language                         provided performance superior compared to two
                          modeling                                                separatemonolingualLMs. Thecode-switchedLM
                                                                                 wastrained on 2.3M sentences or 13M words and
                   GPT-1(Radford et al., 2018) showed that Causal                 achieved a perplexity of 275.
                                           4
                   Language Modeling is an effective pre-training                   With the rising popularity of transfer learning in
                   technique that improves a model’s generalization               NLP,ArabicCLMwasusedasapre-trainingtask
                   capabilities. GPT-2 then showed that using a larger            for an Arabic universal LM, hULMonA (ElJundi
                   model trained on a larger dataset surpasses the                et al., 2019). The model was then ﬁne-tuned on
                   state-of-the-art of many tasks in a zero-shot setting,         different downstream text classiﬁcation tasks. hUL-
                   where a model solves a task without receiving any                                                    6
                                                                                  MonAisa3stackofAWD-LSTM layers(Howard
                   training on that task. Taking the scaling approach             and Ruder, 2018), trained on 600K Wikipedia arti-
                   to the extreme led to the creation of GPT-3 (Brown             cle pre-segmented using the MADAMIRA Arabic
                   et al., 2020), with 175 billion parameter model,               morphological analyzer and disambiguator (Pasha
                   also trained with CLM using terabytes of internet              et al., 2014).
                   text. GPT-3 explored the idea of few-shot learning,              Masked Language Modeling (MLM) has been
                   where a model is given examples from a new task                useful as a pre-training task for several Arabic NLU
                   as a text prompt, which unlocks new capabilities               models. Masked Language Modeling (MLM) is
                   at test time. It was later shown that a carefully              a slightly different objective than CLM that re-
                   designed GPT-3 prompt allows the model to gener-               quires a system to predict a masked word within
                   ate website designs, scramble/unscramble words...              a sequence compared to CLM which predicts the
                       4This is the regular Language Modeling objective where     missing word at the end of a sequence. MLM
                   the model learns the probability of a word given the previ-       5
                   ous context. The CLM acronym is used to distinguish from           https://github.com/sberbank-ai/ru-gpts/
                   maskedlanguage modeling (MLM).                                    6ASGDWeight-DroppedLSTM
                                                                             197
                 was used in models such as ARABERT (An-                 linearly with the number of parameters. The lim-
                 toun et al., 2020a), Arabic-BERT (Safaya et al.,        itations were overcome by following the training
                                           7
                 2020), Arabic-ALBERT , GigaBERT (Lan et al.,            procedure of the GROVER model (Zellers et al.,
                 2020), MarBERT(Abdul-Mageedetal.,2020),and              2019) by using the Adafactor optimizer (Shazeer
                 QARiB(Chowdhuryetal.,2020). Only two works              and Stern, 2018), which reduces memory require-
                 have attempted to create an Arabic transformer          ments by factoring the second-order momentum
                 causal language model. Khooli (2020) and Doiron         parameters into a tensor product of two vectors.
                 (2020) ﬁnetuned the OpenAI GPT2-base model on           TheGROVERarchitecturewasalsousedinstead
                 Arabic Wikipedia, which was mainly trained on           of GPT2’s, in which the layer normalization order
                 English text. Doiron (2020) also continued training     in the transformer block is changed.
                 onacollection of dialectal Arabic datasets, in order
                 to create a dialectal Arabic GPT2. While this ap-       3.2   Dataset
                 proach has shown the capability to generate Arabic
                 text, it is sub-optimal for Arabic and is useful in     Thetraining dataset is a collection of the publicly
                 cases where the training data is scarce.                available Arabic corpora listed below:
                    Our proposed model is hence, the ﬁrst Arabic                                                        ´
                 transformer-based causal language model trained            • The unshufﬂed OSCAR corpus (Ortiz Suarez
                 fromscratchonthelargestArabiccorporaavailable                et al., 2020).
                 at the time of writing.                                    • The Arabic Wikipedia dump from September
                                                                              2020.
                 3ARAGPT2:Methodology                                       • The 1.5B words Arabic Corpus (El-Khair,
                                                                              2016).
                 ARAGPT2isastackedtransformer-decodermodel                  • The OSIANcorpus(Zeroual et al., 2019).
                 trained using the causal language modeling objec-          • Newsarticles provided by As-saﬁr newspaper.
                 tive. The model is trained on 77GB of Arabic text.
                 ARAGPT2comes in four variants as detailed in            Preprocessing     First, the corpus was ﬁltered by
                 Table 1, with the smallest model, base, having the      removing short documents with less than 3 sen-
                 samesize as ARABERT-base which makes it ac-             tences, and documents with more than 20% re-
                 cessible for the larger part of researchers. Larger     peated sentences. URLs, emails, and user men-
                 model variants (medium, large, xlarge) offer im-        tions were also replaced with special tokens. All
                 proved performance but are harder to ﬁne-tune and       diacritics, and elongations were removed as well,
                 computationally more expensive. The ARAGPT2-            while punctuation and non-alphabetic characters
                 detector is based on the pre-trained ARAELEC-           were padded with white-spaces. Moreover, the
                 TRAmodelﬁne-tunedonthesynthetically gener-             ‘<|endoftext|> token is appended at the
                 ated dataset. More details on the training procedure    end of each document. The total dataset size is
                 and dataset are provided in the following sections.                            8
                                                                         77GBwith8.8Bwords . Themajority of the train-
                 3.1   Model                                             ing data is comprised of Arabic news article, which
                 ARAGPT2closelyfollows GPT2’s variant archi-             is mostly written in MSA. The corpus also contains
                 tectures and training procedure. Table 1 shows the      a small set of English words i.e. named entities,
                 different model sizes, number of heads, number          whicharekeptwithoutlower-casing. Subsequently,
                 of layers, parameter count, and optimizer used for      a Byte-level byte-pair-encoding (BPE) tokenizer is
                 each model variant. All models are trained with         trained with 64000 vocabulary size on all of our
                 context sizes of 1024 tokens. The LAMB (You             preprocessed dataset, using the optimized BPE im-
                 et al., 2019) optimizer is used in the base and         plementation from the HuggingFace library (Wolf
                 mediummodels only, since it allows using large          et al., 2020). Finally, the BPE encoding is applied
                 batch sizes without worrying about training diver-      onthepreprocessed dataset, which results in a to-
                 gence. Using LAMBandAdam(KingmaandBa,                   tal of 9.7M training examples with 1024 sub-word
                 2014) to train the large and mega variants isn’t        tokens each.
                 possible on TPUv3 due to the optimizer’s high
                 memory requirements, since memory cost scales              8Word count was done after preprocessing, where white
                                                                         space is inserted before and after punctuations, brackets, num-
                    7https://github.com/KUIS-AI-Lab/Arabic-ALBERT/       bers... which increased the total word count
                                                                     198
                              Model      Size    Architecture    Context Size   Emb. Size    Heads    Layers    Optimizer
                             Base        135M        GPT2            1024           768        12        12       LAMB
                             Medium 370M             GPT2            1024          1024        16        24       LAMB
                             Large       792M      GROVER            1024          1280        20        36     Adafactor
                             Mega       1.46B      GROVER            1024          1536        24        48     Adafactor
                                       Table 1: ARAGPT2modelvariants with sizes, architecture and optimizer
                                          Model      BatchSize    Learning Rate     Steps   Time(days)     PPL
                                         Base           1792          1.27e-3       120K        1.5        55.8
                                         Medium*         80             3e-4         1M          23        45.7
                                         Large           256            1e-4        220K         3         36.6
                                         Mega            256            1e-4        780K         9         29.8
                   Table 2: ARAGPT2training details and validation perplexity. *Medium was trained on a TPUv3-8 with a small
                   batch size, since the model was not converging with a large batch size
                   4   Experiments and Evaluation                             question answering, and translation. ARAGPT2-
                   4.1   Pre-training Setup                                   MEGAcorrectly answers 25% of the trivia ques-
                                                                              tions but fails in English-to-Arabic translation. De-
                   All models were trained on a TPUv3-128 slice9              tails on the datasets, prompts, and evaluation are
                   with different batch sizes and the total number            presented in Appendix B.
                   of steps as shown in Table 2. Base and mega
                   were trained for approximately 20 epochs, while            4.4   Evaluating the Human Ability to Detect
                   medium and large were trained for 10 and 6                       Machine-GeneratedText
                   epochs respectively, due to TPU access limitations.        The gold standard for evaluating a model’s lan-
                   4.2   NumericalEvaluation                                  guage generation capability is human evaluation.
                                                                              We presented 74 Arabic-speaking subjects from
                   For the validation dataset, the Arabic Wikipedia           various social media with a survey designed to
                   articles that were published after August 2020             test the average-human ability to distinguish be-
                   were used, since older articles were included in           tween machine-generated and human-written text
                   the September Wikipedia dump. The perplexity               and thus testing the model’s ability to deceive a
                   score was selected as a numerical evaluation met-          human subject. The survey had a total of 8 news
                   ric since it measures the degree of ’uncertainty’ a        articles, 4 machine-generated using ARAGPT2-
                   model has assigning probabilities to the test text.        Mega and 4 written by humans. Each category
                   Table 2 shows that, unsurprisingly, validation per-        was split into long and short text, which allows
                   plexity keeps improving with larger model sizes.           us to test the long-term generation coherency. In
                   In fact, the model is still under-ﬁtting the validation    addition, the human evaluators are allowed to add
                   set from Wikipedia. The generation capabilities of         justiﬁcation for each answer.
                   the different variants of ARAGPT2 is illustrated              The survey results, Figure 1, show that
                   through the selected examples in Appendix A.               ARAGPT2-Megasuccessfullyfooled approx. 60%
                   4.3   Zero-Shot Evaluation                                 of the respondents, with longer passages having a
                   During zero-shot task evaluation, the model is only        higher error rate than short passages. In the pro-
                   given a natural language instruction to motivate           vided explanations, some subjects relied on punc-
                   and ground the task, without any back-propagation          tuation mistakes, coherence, and repetition issues,
                   happening. The task of searching and ﬁnding the            while others spotted factual inaccuracies. However,
                   best input prompt, also known as “prompt engineer-         the results also show that humans were misclassi-
                   ing”, is hard. Since the search space is practically       fying human-written 50% the time (chance level
                   inﬁnite, and the performance is highly sensitive to        performance), while also citing factual inconsis-
                                                                              tencies, grammatical errors, and unusual writing
                   changes in the prompt. The zero-shot performance                 10
                   of ARAGPT2-Mega is evaluated on two tasks,                 styles  .
                                                                                 These surprising results show that ARAGPT2
                      9TPUv3-128 has a total of 2TB of HBM memory with        can accurately generate human-like text while
                  16GB per core. TPUs were freely provided by the TFRC
                   program.                                                     10Survey results are available on our GitHub repository.
                                                                          199
The words contained in this file might help you see if this file matches what you are looking for:

...Aragpt pre trainedtransformer for arabic language generation wissamantoun and fadybaly hazemhajj american university of beirut wfa fbg hh aub edu lb abstract bert devlinetal electra clark recently trained transformer based archi et al b sentence completion with gpt tectures have proven to be very efcient at radford grover zellers modeling understanding given ctrl keskar recent that they are on a large enough cor works shown larger models pus applications in onlarger datasets can further improve performance still lagging comparison other i e roberta liu xlm r con nlp advances primarily due the lack neau advanced hand work this paper we develop rst has mostly targeted natural model nlu by training from scratch corpus internet text news articles our largest using masked mega billionpa ing mlm task arabert antoun rameters which makes it contrast or available causal hasn t received much wasevaluatedandshowedsuccessondifferent attention few such as hulmona eljundi tasks including synthetic u...
Share

Help

Share

Share to social media

Help

Login Area