122x Filetype PDF File size 0.46 MB Source: aclanthology.org
ARAGPT2:Pre-TrainedTransformer for Arabic Language Generation WissamAntoun and FadyBaly and HazemHajj American University of Beirut {wfa07, fbg06, hh63}@aub.edu.lb Abstract BERT(Devlinetal., 2019), and ELECTRA (Clark Recently, pre-trained transformer-based archi- et al., 2020b), and sentence completion with GPT- tectures have proven to be very efficient at 2(Radford et al., 2019), GROVER (Zellers et al., language modeling and understanding, given 2019), and CTRL (Keskar et al., 2019). Recent that they are trained on a large enough cor- works have shown that larger models pre-trained pus. Applications in language generation for onlarger datasets can further improve performance Arabic are still lagging in comparison to other i.e. RoBERTa (Liu et al., 2019), and XLM-R (Con- NLP advances primarily due to the lack of neau et al., 2019). advanced Arabic language generation models. On the other hand, work on Arabic language In this paper, we develop the first advanced modeling has mostly targeted natural language Arabic language generation model, AraGPT2, understanding (NLU) by pre-training transformer- trained from scratch on a large Arabic corpus of internet text and news articles. Our largest based models using the Masked Language Model- model, ARAGPT2-MEGA,has1.46billionpa- ing (MLM) task i.e. ARABERT (Antoun et al., rameters, which makes it the largest Arabic 2020a). In contrast, Arabic text generation or language model available. The MEGA model causal language modeling hasn’t received much wasevaluatedandshowedsuccessondifferent attention. Few works such as hULMonA (ElJundi tasks including synthetic news generation, and et al., 2019) used next word prediction as a pre- zero-shot question answering. For text gener- training task in for transfer learning in Arabic text ation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study classification. (Khooli, 2020) and (Doiron, 2020) conducted with human evaluators showed the leveraged the existing GPT2 English model and significant success of AraGPT2-mega in gen- adapted it for Arabic using text from the Arabic erating news articles that are difficult to dis- Wikipedia dumps, which is sub-optimal for Arabic. tinguish from articles written by humans. We In this paper, the first advanced language gener- thus develop and release an automatic discrim- ation models built from the grounds up on Arabic inator model with a 98% percent accuracy in language have been developed. The process of pre- detecting model-generated text. The models 1 training ARAGPT2, a GPT-2 transformer model are also publicly available , hoping to encour- age new research directions and applications for the Arabic language is described. The model for Arabic NLP. comesin4sizevariants: base (135M2), medium 3 1 Introduction (370M), large (792M) and mega (1.46B ), which allowstheexplorationof ARAGPT2inmultipleap- Few years ago, Natural language processing plications with different data availability and com- (NLP) was revolutionized with the introduction putational constraints. The perplexity measure is of multi-head self-attention transformer architec- used to automatically evaluate ARAGPT2. Fur- ture (Vaswani et al., 2017). The transformer thermore, a human-based evaluation is provided, achieved superior performance compared to recur- which highlights the ability of ARAGPT2 to de- rent neural networks several NLP tasks including ceive human evaluators. Finally, an ARAELEC- machine translation, sentence classification with TRA(Antounetal.,2020b)baseddetectorisdevel- 1Pretrained variants of ARAGPT2 (base, medium, large, mega) and discriminator are publicly available on 2Million Parameters github.com/aub-mind/arabert/tree/master/aragpt2 3Billion Parameters 196 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 196–207 Kyiv, Ukraine (Virtual), April 19, 2021. opedandreleased. It is able to consistently identify Theadvantage of scaling model sizes and training newsarticles written by ARAGPT2. Making such datasets comes with drawbacks, particularly the powerful models publicly available to the Arabic high computational cost, in addition to the huge research community enables research in rising Ara- corpora required for pre-training. It was estimated bic NLP fields i.e Conversational Agents (Naous that training GPT-2 and GPT-3 costs $43K and et al., 2020), Detection of Automatic News Gener- $4.6Mrespectively, without any hyper-parameter ation Detection (Harrag et al., 2020)... tuning. These drawbacks restricted the availabil- Ourcontributions can be summarized as follows: ity of large pre-trained models to English mainly • A methodology to pre-train a billion-size and a handful of other languages i.e. ruGPT35 for GPT2modelonalarge-scaleArabiccorpus. Russian, and Chinese 1.5B GPT2 (Zhang, 2019). • An automatic discriminator that achieves a 98%accuracy in detecting model-generated 2.2 Arabic Language modeling synthetic text. Work on Arabic causal language modeling has • The four variants of ARAGPT2arereleased been mostly limited to automatic speech recogni- on popular NLP libraries, along with the auto- tion (ASR) systems. Since the language modeling matic ARAGPT2discriminator. componentinASRsystemsisakeymodulethaten- The rest of the paper is structured as follows. sures that the output text adheres with the statistical Section 2 provides a concise review of previous structure of language. Work on Arabic language literature on Arabic language modeling. Sec- models in ASR systems has mostly relied on N- tion 3 details the methodology used in developing grams language models. (Ali et al., 2014) built ARAGPT2.Section4describestheexperimental an N-grams language model (LM) using GALE setup, evaluation procedures and results. In addi- training data transcripts of 1.4M words. More re- tion, the approach to build a machine-generated cent work in Arabic ASR implemented a recur- text discriminator is presented in Section 5. Fi- rent neural network as an LM, using 130M tokens, nally, a conclusion of the work and implications and achieved a perplexity of 481 compared to 436 are mentioned in Section 6. for a 4-gram LM (Khurana et al., 2019). Hamed 2 Related Works et al. (2017) developed a code-switched Arabic- English language model using tri-gram LM and 2.1 English and Non-Arabic Language provided performance superior compared to two modeling separatemonolingualLMs. Thecode-switchedLM wastrained on 2.3M sentences or 13M words and GPT-1(Radford et al., 2018) showed that Causal achieved a perplexity of 275. 4 Language Modeling is an effective pre-training With the rising popularity of transfer learning in technique that improves a model’s generalization NLP,ArabicCLMwasusedasapre-trainingtask capabilities. GPT-2 then showed that using a larger for an Arabic universal LM, hULMonA (ElJundi model trained on a larger dataset surpasses the et al., 2019). The model was then fine-tuned on state-of-the-art of many tasks in a zero-shot setting, different downstream text classification tasks. hUL- where a model solves a task without receiving any 6 MonAisa3stackofAWD-LSTM layers(Howard training on that task. Taking the scaling approach and Ruder, 2018), trained on 600K Wikipedia arti- to the extreme led to the creation of GPT-3 (Brown cle pre-segmented using the MADAMIRA Arabic et al., 2020), with 175 billion parameter model, morphological analyzer and disambiguator (Pasha also trained with CLM using terabytes of internet et al., 2014). text. GPT-3 explored the idea of few-shot learning, Masked Language Modeling (MLM) has been where a model is given examples from a new task useful as a pre-training task for several Arabic NLU as a text prompt, which unlocks new capabilities models. Masked Language Modeling (MLM) is at test time. It was later shown that a carefully a slightly different objective than CLM that re- designed GPT-3 prompt allows the model to gener- quires a system to predict a masked word within ate website designs, scramble/unscramble words... a sequence compared to CLM which predicts the 4This is the regular Language Modeling objective where missing word at the end of a sequence. MLM the model learns the probability of a word given the previ- 5 ous context. The CLM acronym is used to distinguish from https://github.com/sberbank-ai/ru-gpts/ maskedlanguage modeling (MLM). 6ASGDWeight-DroppedLSTM 197 was used in models such as ARABERT (An- linearly with the number of parameters. The lim- toun et al., 2020a), Arabic-BERT (Safaya et al., itations were overcome by following the training 7 2020), Arabic-ALBERT , GigaBERT (Lan et al., procedure of the GROVER model (Zellers et al., 2020), MarBERT(Abdul-Mageedetal.,2020),and 2019) by using the Adafactor optimizer (Shazeer QARiB(Chowdhuryetal.,2020). Only two works and Stern, 2018), which reduces memory require- have attempted to create an Arabic transformer ments by factoring the second-order momentum causal language model. Khooli (2020) and Doiron parameters into a tensor product of two vectors. (2020) finetuned the OpenAI GPT2-base model on TheGROVERarchitecturewasalsousedinstead Arabic Wikipedia, which was mainly trained on of GPT2’s, in which the layer normalization order English text. Doiron (2020) also continued training in the transformer block is changed. onacollection of dialectal Arabic datasets, in order to create a dialectal Arabic GPT2. While this ap- 3.2 Dataset proach has shown the capability to generate Arabic text, it is sub-optimal for Arabic and is useful in Thetraining dataset is a collection of the publicly cases where the training data is scarce. available Arabic corpora listed below: Our proposed model is hence, the first Arabic ´ transformer-based causal language model trained • The unshuffled OSCAR corpus (Ortiz Suarez fromscratchonthelargestArabiccorporaavailable et al., 2020). at the time of writing. • The Arabic Wikipedia dump from September 2020. 3ARAGPT2:Methodology • The 1.5B words Arabic Corpus (El-Khair, 2016). ARAGPT2isastackedtransformer-decodermodel • The OSIANcorpus(Zeroual et al., 2019). trained using the causal language modeling objec- • Newsarticles provided by As-safir newspaper. tive. The model is trained on 77GB of Arabic text. ARAGPT2comes in four variants as detailed in Preprocessing First, the corpus was filtered by Table 1, with the smallest model, base, having the removing short documents with less than 3 sen- samesize as ARABERT-base which makes it ac- tences, and documents with more than 20% re- cessible for the larger part of researchers. Larger peated sentences. URLs, emails, and user men- model variants (medium, large, xlarge) offer im- tions were also replaced with special tokens. All proved performance but are harder to fine-tune and diacritics, and elongations were removed as well, computationally more expensive. The ARAGPT2- while punctuation and non-alphabetic characters detector is based on the pre-trained ARAELEC- were padded with white-spaces. Moreover, the TRAmodelfine-tunedonthesynthetically gener- ‘<|endoftext|> token is appended at the ated dataset. More details on the training procedure end of each document. The total dataset size is and dataset are provided in the following sections. 8 77GBwith8.8Bwords . Themajority of the train- 3.1 Model ing data is comprised of Arabic news article, which ARAGPT2closelyfollows GPT2’s variant archi- is mostly written in MSA. The corpus also contains tectures and training procedure. Table 1 shows the a small set of English words i.e. named entities, different model sizes, number of heads, number whicharekeptwithoutlower-casing. Subsequently, of layers, parameter count, and optimizer used for a Byte-level byte-pair-encoding (BPE) tokenizer is each model variant. All models are trained with trained with 64000 vocabulary size on all of our context sizes of 1024 tokens. The LAMB (You preprocessed dataset, using the optimized BPE im- et al., 2019) optimizer is used in the base and plementation from the HuggingFace library (Wolf mediummodels only, since it allows using large et al., 2020). Finally, the BPE encoding is applied batch sizes without worrying about training diver- onthepreprocessed dataset, which results in a to- gence. Using LAMBandAdam(KingmaandBa, tal of 9.7M training examples with 1024 sub-word 2014) to train the large and mega variants isn’t tokens each. possible on TPUv3 due to the optimizer’s high memory requirements, since memory cost scales 8Word count was done after preprocessing, where white space is inserted before and after punctuations, brackets, num- 7https://github.com/KUIS-AI-Lab/Arabic-ALBERT/ bers... which increased the total word count 198 Model Size Architecture Context Size Emb. Size Heads Layers Optimizer Base 135M GPT2 1024 768 12 12 LAMB Medium 370M GPT2 1024 1024 16 24 LAMB Large 792M GROVER 1024 1280 20 36 Adafactor Mega 1.46B GROVER 1024 1536 24 48 Adafactor Table 1: ARAGPT2modelvariants with sizes, architecture and optimizer Model BatchSize Learning Rate Steps Time(days) PPL Base 1792 1.27e-3 120K 1.5 55.8 Medium* 80 3e-4 1M 23 45.7 Large 256 1e-4 220K 3 36.6 Mega 256 1e-4 780K 9 29.8 Table 2: ARAGPT2training details and validation perplexity. *Medium was trained on a TPUv3-8 with a small batch size, since the model was not converging with a large batch size 4 Experiments and Evaluation question answering, and translation. ARAGPT2- 4.1 Pre-training Setup MEGAcorrectly answers 25% of the trivia ques- tions but fails in English-to-Arabic translation. De- All models were trained on a TPUv3-128 slice9 tails on the datasets, prompts, and evaluation are with different batch sizes and the total number presented in Appendix B. of steps as shown in Table 2. Base and mega were trained for approximately 20 epochs, while 4.4 Evaluating the Human Ability to Detect medium and large were trained for 10 and 6 Machine-GeneratedText epochs respectively, due to TPU access limitations. The gold standard for evaluating a model’s lan- 4.2 NumericalEvaluation guage generation capability is human evaluation. We presented 74 Arabic-speaking subjects from For the validation dataset, the Arabic Wikipedia various social media with a survey designed to articles that were published after August 2020 test the average-human ability to distinguish be- were used, since older articles were included in tween machine-generated and human-written text the September Wikipedia dump. The perplexity and thus testing the model’s ability to deceive a score was selected as a numerical evaluation met- human subject. The survey had a total of 8 news ric since it measures the degree of ’uncertainty’ a articles, 4 machine-generated using ARAGPT2- model has assigning probabilities to the test text. Mega and 4 written by humans. Each category Table 2 shows that, unsurprisingly, validation per- was split into long and short text, which allows plexity keeps improving with larger model sizes. us to test the long-term generation coherency. In In fact, the model is still under-fitting the validation addition, the human evaluators are allowed to add set from Wikipedia. The generation capabilities of justification for each answer. the different variants of ARAGPT2 is illustrated The survey results, Figure 1, show that through the selected examples in Appendix A. ARAGPT2-Megasuccessfullyfooled approx. 60% 4.3 Zero-Shot Evaluation of the respondents, with longer passages having a During zero-shot task evaluation, the model is only higher error rate than short passages. In the pro- given a natural language instruction to motivate vided explanations, some subjects relied on punc- and ground the task, without any back-propagation tuation mistakes, coherence, and repetition issues, happening. The task of searching and finding the while others spotted factual inaccuracies. However, best input prompt, also known as “prompt engineer- the results also show that humans were misclassi- ing”, is hard. Since the search space is practically fying human-written 50% the time (chance level infinite, and the performance is highly sensitive to performance), while also citing factual inconsis- tencies, grammatical errors, and unusual writing changes in the prompt. The zero-shot performance 10 of ARAGPT2-Mega is evaluated on two tasks, styles . These surprising results show that ARAGPT2 9TPUv3-128 has a total of 2TB of HBM memory with can accurately generate human-like text while 16GB per core. TPUs were freely provided by the TFRC program. 10Survey results are available on our GitHub repository. 199
no reviews yet
Please Login to review.