Language Pdf 102335

Partial capture of text on file.
                               Automatic Difﬁculty Classiﬁcation of Arabic Sentences
                                                    NouranKhallaf, Serge Sharoff
                                               School of Languages, University of Leeds
                                                    Leeds, LS2 9JT, United Kingdom
                                                mlnak,s.sharoff@leeds.ac.uk
                                      Abstract                          the availability of a readability sentence classiﬁer
                     In this paper, we present a Modern Standard        for Arabic is vital, since this is a prerequisite for
                     Arabic (MSA) Sentence difﬁculty classiﬁer,         research on automatic text simpliﬁcation (ATS), i.e.
                     which predicts the difﬁculty of sentences for      the process of reducing text-linguistic complexity,
                     language learners using either the CEFR pro-       while maintaining its meaning (Saggion, 2017).
                     ﬁciency levels or the binary classiﬁcation as         Wefocushereonexperimentsaimedatmeasur-
                     simple or complex. We compare the use of           ing to what extent a sentence is understandable by
                     sentence embeddings of different kinds (fast-      a reader, such as a learner of Arabic as a foreign
                     Text, mBERT,XLM-RandArabic-BERT),as                language, and at exploring different methods for
                     well as traditional language features such as      readability assessment. The main aim of this paper
                     POStags,dependencytrees,readabilityscores
                     and frequency lists for language learners. Our     lies in developing and testing different sentence
                     best results have been achieved using ﬁned-        representation methodologies, which range from
                     tuned Arabic-BERT. The accuracy of our 3-          using linguistic knowledge via feature-based ma-
                     way CEFR classiﬁcation is F-1 of 0.80 and          chine learning to modern neural methods.
                     0.75 for Arabic-Bert and XLM-R classiﬁca-             In summary, the contributions of this paper are:
                     tion respectively and 0.71 Spearman correla-
                     tion for regression. Our binary difﬁculty classi-    1. We compiled a novel dataset for training on
                     ﬁerreachesF-10.94andF-10.98forsentence-
                     pair semantic similarity classiﬁer.                     the sentence level;
                 1    Introduction                                        2. We developed a range of linguistic features,
                                                                             including POS, syntax and frequency informa-
                 In the last century, measuring text readability (TR)        tion;
                 has been undertaken in education, psychology, and        3. Weevaluatedarangeofdifferentsentenceem-
                 linguistics. There appears to be some agreement             bedding approaches, such as fastText, BERT
                 that TR is the quality of a given text to be easy to        and XLM-R, and compared them to the lin-
                 comprehend by its readers in adequate time with             guistic features;
                 reasonable effort (Cavalli-Sforza et al., 2018). Re-     4. We cast the readability assessment as a re-
                 search to date has tended to focus on assigning             gression problem as well as a classiﬁcation
                 readability levels to whole text rather than to indi-       problem;
                 vidual sentences, despite the fact that any text is      5. Our model is the ﬁrst sentence difﬁculty sys-
                 composed of a number of sentences, which vary               temavailable for Arabic.
                 in their difﬁculty (Schumacher et al., 2016). As-
                 signing readability levels for a text is a challeng-   2   CorporaandTools
                 ing task and it is even more challenging on the        2.1   Dataset One: Sentence-level annotation
                 sentence level as much less information is avail-
                 able. Also, the sentence difﬁculty is inﬂuenced by     This dataset was used for Arabic sentence difﬁ-
                 manyparameters, such as, genre or topics, as well      culty classiﬁcation. We started building our own
                 grammaticalstructures, whichneedtobecombined           dataset by compiling a corpus from three available
                 in a single classiﬁer. Difﬁculty assessment at the     source classiﬁed for readability on the document
                 sentence level is a more challenging task in com-      level along with a large Arabic corpus obtained by
                 parison to the better researched text level task, but  Webcrawling.
                                                                    105
                                 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 105–114
                                                     Kyiv, Ukraine (Virtual), April 19, 2021.
                     Theﬁrst corpus source is the reading section of           CEFR            Old                 New
                            1                                                              S         T          S         T
                  the Gloss Corpus developed by the Defense Lan-
                  guage Institute (DLI). It has been treated as a gold         A         8661     187225      9030     195343
                  standard and used in the most recent studies on              B         5532     126805      5083     117825
                  documentlevelpredictions (Forsyth, 2014; Saddiki             C         8627     287275      8627     287275
                  et al., 2015; Nassiri et al., 2018a,b). Texts in Gloss       Total    22820     601305     22740     600443
                  have been annotated on a six level scale of the           Table 1: (S)sentences and (T)tokens available per each
                  Inter-Agency Language Roundtable (IL ), which             CEFRLevelinthetwoversionsofthecorpus
                  has been matched to the CEFR levels according to
                  the schema introduced by (Tschirner et al., 2015).
                  Gloss is divided according to the four competence         binary (A+B vs C) classiﬁcation tasks, but here in
                  areas (lexical, structural, socio-cultural and discur-    this presentation, we focus on the 3-way and binary
                  sive) and ten different genres (culture, economy,         (simple vs complex) classiﬁcation tasks.
                  politics, environment, geography, military, politics,     Dataset cleaning:     In our initial experiments we
                  science, security, society, and technology).              noticed unreliable sentence-level assignments in
                     The second corpus source is the ALC , which            the training corpus. Therefore, we decided to im-
                  consists of Arabic written text produced by learners      prove the quality of the training corpus by an error
                  of Arabic in Saudi Arabia collected by (Alfaiﬁ and        analysisstrategyintroducedbyDiBarietal.(2014),
                  Atwell, 2013). Each text ﬁle is annotated with a          which is based on detecting agreement between
                  proﬁciency level of the student. We mapped these          classiﬁers belonging to different Machine Learning
                  student proﬁciency levels to CEFR levels.                 paradigms. Thecaseswhenthemajorityoftheclas-
                     Our third corpus source comes from textbook            siﬁers agreed on predicting a label while the gold
                  ”Al-Kitaab ﬁi TaAallum al-Arabiyya” (Brustad              standard was different were inspected manually
                  et al., 2015) which was compiled from texts and           by a specialist in teaching Arabic. In our Dataset
                  sentences from parts one and two of the third edi-        cleaning experiment we used the following classi-
                  tion but only texts from part three third edition.        ﬁers: SVM (with the rbf kernel), Random Forest,
                  This book is widely used to teaching Arabic as a          KNeighbors, Softmax and XgBoost using linguis-
                  second language. These texts were originally clas-        tic features discussed in Section 3, trained them via
                  siﬁed according to the American Council on the            cross-validation and compared their majority vote
                  Teaching of Foreign Languages (ACTFL) guide-              to the gold standard.
                  lines which we mapped to CEFR levels.                        Wemodiﬁedthe error classiﬁcation tags intro-
                     As these corpora have been annotated on the            duced by Di Bari et al. (2014) as follows:
                  document level and not on the sentence level, we
                  assigned each sentence to the document level in           Wrong ifthe classiﬁers have wrongly labelled the
                  which it appears, by using several ﬁltering heuris-            data, and the gold standard is correct.
                  tics, such as sentence length and containment, as         Modify if the classiﬁers are correct and we need
                  well as via re-annotation through machine learning,            to modify the gold standard.
                  see the dataset cleaning procedure below.                 Ambiguous ifweconsiderbotheitherlabelispos-
                     Acounterpart corpus of texts not produced for               sible based on different perspectives.
                  language learners in mind is provided by I-AR,            False is an added label which represent the dis-
                  75,630 Arabic web pages collected by wide crawl-               agreement between the gold standard and the
                  ing (Sharoff, 2006). A random snapshot of 8627                 classiﬁers, when neither is correct.
                  sentences longer than 15 words was used to extend            For each sentence, ﬁve different predictions are
                  the limitations of C-level sentences coming from          assigned. Compared to the gold standard CEFR-
                  corpora for language learners.                            label, the classiﬁers agreed in predicting 10204 in-
                     Table 1 shows distribution of the number of            stances. Then what we need to consider is when all
                  used sentences and tokens per each Common Eu-             classiﬁers agree on the predicted label and it con-
                  ropean Framework of language proﬁciency Refer-            tradicts with the gold standard’s one. In that matter,
                  ence [CEFR] Level. In principle we have data for          the classiﬁers agreed on 1943 sentence classiﬁca-
                  5-way (A1, A2, B1, etc), 3-way (A, B or C) and            tion. We manually investigated random sentences
                     1https://gloss.dliﬂc.edu/                              andassignedtheerrorclassiﬁcation tags. We found
                                                                        106
                 that the main classiﬁcation confusion was in Level    such as ‘ /wa /= and’. According to the experience
                 Binstances. The analysis results as in Table 4 show   of language teachers such sentences do not present
                 the distribution of categories where each error type  problems for the learners.
                 occurred. In the end, 380 instances had to be as-     3.1.1   ThePOS-features
                 signed to lower level (usually from B to A).
                                                                       [Table 3 features (1-21)], these features represent
                                                                       the distribution of different word categories in
                 2.2   Dataset Two: Simpliﬁcation examples             the sentence, and the morpho-syntactic features
                                                                       of these words. According to Knowles and Don
                 Asetofsimple/complexparallelsentenceshasbeen          (2004), Arabic lemmatization, unlike that of En-
                 compiled from the internationally acclaimed Ara-      glish, is an essential process for analysing Arabic
                 bic novel “Saaq al-Bambuu” (Al-Sanousi, 2013)         text, because it is a methodology for dictionary
                 which has an authorized simpliﬁed version for         construction. Therefore, we used the Lemma/Type
                 students of Arabic as a second language (Famil-       ratio instead of Word/Type ratio. Adding features
                 iar, 2016). We assume that a successful classiﬁer     represents the different verb types (Verb pseudo,
                 should be able to detect sentences in the original    Passive verbs, Perfective verbs, Imperfective verbs
                 text that require simpliﬁcation. Dataset Two con-     and3rdperson). Asconjunctionisoneoftheimpor-
                 sists of 2980 parallel sentences Table 2.             tant features in representing sentence complexity
                                                                       in Arabic (Forsyth, 2014), we used the annotated
                         Level          Sentence     Token             discourse connectors introduced by Alsaif (2012)
                         Simple A+B          2980    34447             bysplitting this list into 23 simple connectors and
                         ComplexC            2980    46521             56complexconnectors referring to non-discourse
                         Total               5690    80968             connectors and discourse connectors respectively.
                                                                       ForPOS-featuresextraction we used MADAMIRA
                 Table 2:  Number of Sentences and Tokens available    a robust Arabic morphological analyser and part of
                 per each CEFR Level in Dataset two                    speech tagger (Pasha et al., 2014).
                                                                       3.1.2   Syntactic features
                                                                       Features (22-27) from Table 3 provide some infor-
                 3   Features and extraction methods                   mation about the sentences structures and number
                                                                       of phrases as well as phases types. These features
                                                                       are derived from a dependency grammar analy-
                 We work with following groups of features in          sis. Because dependency grammar is based on
                 Table 3: Part of speech tagging features (POS-        word-word relations, it assumes that the structure
                 features); Syntactic structure features (Syntactic-   of a sentence consists of lexical items that are at-
                 features); CEFR-level lexical features; Sentence      tached to each other by binary asymmetrical re-
                 embeddings.                                           lations, which is known as dependency relations.
                                                                       These relations will be more representative for this
                                                                       task. We used CamelParser (Shahrour et al., 2016)
                 3.1   Linguistic features                             a system for Arabic syntactic dependency analysis
                                                                       together with contextually disambiguated morpho-
                 Whilethesentence-levelclassiﬁcationtaskisnovel,       logical features which rely on the MADAMIRA
                 weborrowedsomefeaturesfromprevious studies            morphological analysis for more robust results.
                 of text-level readability (Forsyth, 2014; Saddiki     3.1.3   CEFR-level lexical features
                 et al., 2015; Nassiri et al., 2018a,b). We decided
                 to exclude the sentence length from the feature set,  Features (28-34) from Table 3 are used to assign
                 as this creates an artiﬁcial skew in understanding    each word in the sentence with an appropriate
                 what is difﬁcult: more difﬁcult writing styles are    CEFR level. For this, we created a new Arabic
                 often associated with longer sentences, but it is     word list consisting of 8834 unique lemmas la-
                 not the sentence length which makes them difﬁcult.    belled with CEFR levels. This list was a com-
                 Speciﬁcally, many long Arabic sentences contain       bination of three frequency lists, 1) Buckwalter
                 shorter ones, which are connected by conjunctions     and Parkinson 5000 frequency word list based on
                                                                   107
                                                                 POS Features
                                 1    TTRofwordforms                 12    Numeric Adj Tokens
                                 2    Morphemes word                 13    Comparative Adj Tokens
                                 3    TTRofLemma                     14    Conjunction Tokens
                                 4    Nouns Tokens                   15    Conjunction Subordination Tokens
                                 5    Verbs Tokens                   16    Proper noun Tokens
                                 6    Adj Tokens                     17    Pronoun Tokens
                                 7    Verb pseudo Tokens             18    Punc Tokens
                                 8    Passive verbs Tokens           19    Simple Connector Tokens
                                 9    Perfective verbs Tokens        20    Complex Connector Tokens
                                10    Imperfective verbs Tokens      21    All Sent Connector Tokens
                                11    3rdperson verb Verbs
                                                              Syntactic Features
                                22    Incidence of subjects          25    Incidence of coordination
                                23    Incidence of objects           26    Average phrases/sentence
                                24    Incidence of modiﬁer/root      27    Average phrases depth
                                                            CEFRWordFeatures
                                28    Incidence of Level A1          32    Incidence of Level C1
                                29    Incidence of Level A2          33    Incidence of Level C2
                                30    Incidence of Level B1          34    Wordentropy with respect to CEFR
                                31    Incidence of Level B2
                                35                        Sentence Embeddings Features
                              Table 3: The Feature set. (all measures are for the rate of tokens on the sentence levels)
                  30-million-wordcorpusofacademic/non-academic             fastText 2 tool. Using the Arabic ar.300.bin ﬁle
                  and written/spoken texts (Buckwalter and Parkin-         in which each word in WE is represented by the
                  son, 2014) KELLY’s list which is produced from           1Dvector mapped of 300 attributes (Grave et al.,
                  the Kelly project (Kilgarriff et al., 2014), which       2018). We had to normalize the sentence vectors
                  directly mapped a frequency word list to the CEFR        to have the same length with respect to dimensions.
                  levels using numerous corpora and languages, 3)          For this, we calculated tf-idf weights of each word
                  lists presented at the beginning of each chapter in      in the corpus to use them as weights:
                 ‘Al-Kitaab’ (Brustad et al., 2015). Merging the lists     s = w w ....w
                  and aligning them with the Madamira lemmatiser                  1 2       n
                  led to our new wide-coverage Arabic frequency             Embed.[s] = 1 Xtfidf[w ]∗Embed.[w ] (1)
                  list, which can be used to predict difﬁculty as En-                       n             i               i
                  tropy of the probability distribution of each label                          i
                  in a sentence. The current list shows some consis-       Universal sentence encoder       (Yang et al., 2019)
                  tency with the English proﬁle list in terms of the       This model requires modeling the meaning of word
                  percentage of words allocated to each CEFR level.        sequences rather than just individual words. Also
                                                                           it was generated mainly to be used on the sentence
                  3.2   Sentence embeddings                                level which after sentence tokenization, it encodes
                  In addition to the 34 traditional features we can        sentence to a 512-dimensional vector. We used
                                                                                                  3
                  represent sentences as embedding vectors using           here the large version .
                  different neural models as following:                    Multilingual BERT       (Devlin et al., 2018) Pre-
                                                                           trained transformers models proved their ability
                  fastText   A straightforward way to create sen-          to learn successful representations of language
                  tence representations is to take a weighted average      inspired by the transformer model presented in
                  of word embeddings (WE) of each word, for exam-             2https://fasttext.cc/docs/en/crawl-vectors.html
                  ple, using fastText vectors. This embedding was             3https://tfhub.dev/google/universal-sentence-encoder-
                  trained on Common Crawl and Wikipedia using              multilingual/1
                                                                       108
The words contained in this file might help you see if this file matches what you are looking for:

...Automatic difculty classication of arabic sentences nourankhallaf serge sharoff school languages university leeds ls jt united kingdom mlnak s ac uk abstract the availability a readability sentence classier in this paper we present modern standard for is vital since prerequisite msa research on text simplication ats i e which predicts process reducing linguistic complexity language learners using either cefr pro while maintaining its meaning saggion ciency levels or binary as wefocushereonexperimentsaimedatmeasur simple complex compare use ing to what extent understandable by embeddings different kinds fast reader such learner foreign mbert xlm randarabic bert and at exploring methods well traditional features assessment main aim postags dependencytrees readabilityscores frequency lists our lies developing testing best results have been achieved ned representation methodologies range from tuned accuracy knowledge via feature based ma way f chine learning neural r classica summary contr...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area