140x Filetype PDF File size 0.25 MB Source: aclanthology.org
Automatic Difficulty Classification of Arabic Sentences NouranKhallaf, Serge Sharoff School of Languages, University of Leeds Leeds, LS2 9JT, United Kingdom mlnak,s.sharoff@leeds.ac.uk Abstract the availability of a readability sentence classifier In this paper, we present a Modern Standard for Arabic is vital, since this is a prerequisite for Arabic (MSA) Sentence difficulty classifier, research on automatic text simplification (ATS), i.e. which predicts the difficulty of sentences for the process of reducing text-linguistic complexity, language learners using either the CEFR pro- while maintaining its meaning (Saggion, 2017). ficiency levels or the binary classification as Wefocushereonexperimentsaimedatmeasur- simple or complex. We compare the use of ing to what extent a sentence is understandable by sentence embeddings of different kinds (fast- a reader, such as a learner of Arabic as a foreign Text, mBERT,XLM-RandArabic-BERT),as language, and at exploring different methods for well as traditional language features such as readability assessment. The main aim of this paper POStags,dependencytrees,readabilityscores and frequency lists for language learners. Our lies in developing and testing different sentence best results have been achieved using fined- representation methodologies, which range from tuned Arabic-BERT. The accuracy of our 3- using linguistic knowledge via feature-based ma- way CEFR classification is F-1 of 0.80 and chine learning to modern neural methods. 0.75 for Arabic-Bert and XLM-R classifica- In summary, the contributions of this paper are: tion respectively and 0.71 Spearman correla- tion for regression. Our binary difficulty classi- 1. We compiled a novel dataset for training on fierreachesF-10.94andF-10.98forsentence- pair semantic similarity classifier. the sentence level; 1 Introduction 2. We developed a range of linguistic features, including POS, syntax and frequency informa- In the last century, measuring text readability (TR) tion; has been undertaken in education, psychology, and 3. Weevaluatedarangeofdifferentsentenceem- linguistics. There appears to be some agreement bedding approaches, such as fastText, BERT that TR is the quality of a given text to be easy to and XLM-R, and compared them to the lin- comprehend by its readers in adequate time with guistic features; reasonable effort (Cavalli-Sforza et al., 2018). Re- 4. We cast the readability assessment as a re- search to date has tended to focus on assigning gression problem as well as a classification readability levels to whole text rather than to indi- problem; vidual sentences, despite the fact that any text is 5. Our model is the first sentence difficulty sys- composed of a number of sentences, which vary temavailable for Arabic. in their difficulty (Schumacher et al., 2016). As- signing readability levels for a text is a challeng- 2 CorporaandTools ing task and it is even more challenging on the 2.1 Dataset One: Sentence-level annotation sentence level as much less information is avail- able. Also, the sentence difficulty is influenced by This dataset was used for Arabic sentence diffi- manyparameters, such as, genre or topics, as well culty classification. We started building our own grammaticalstructures, whichneedtobecombined dataset by compiling a corpus from three available in a single classifier. Difficulty assessment at the source classified for readability on the document sentence level is a more challenging task in com- level along with a large Arabic corpus obtained by parison to the better researched text level task, but Webcrawling. 105 Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 105–114 Kyiv, Ukraine (Virtual), April 19, 2021. Thefirst corpus source is the reading section of CEFR Old New 1 S T S T the Gloss Corpus developed by the Defense Lan- guage Institute (DLI). It has been treated as a gold A 8661 187225 9030 195343 standard and used in the most recent studies on B 5532 126805 5083 117825 documentlevelpredictions (Forsyth, 2014; Saddiki C 8627 287275 8627 287275 et al., 2015; Nassiri et al., 2018a,b). Texts in Gloss Total 22820 601305 22740 600443 have been annotated on a six level scale of the Table 1: (S)sentences and (T)tokens available per each Inter-Agency Language Roundtable (IL ), which CEFRLevelinthetwoversionsofthecorpus has been matched to the CEFR levels according to the schema introduced by (Tschirner et al., 2015). Gloss is divided according to the four competence binary (A+B vs C) classification tasks, but here in areas (lexical, structural, socio-cultural and discur- this presentation, we focus on the 3-way and binary sive) and ten different genres (culture, economy, (simple vs complex) classification tasks. politics, environment, geography, military, politics, Dataset cleaning: In our initial experiments we science, security, society, and technology). noticed unreliable sentence-level assignments in The second corpus source is the ALC , which the training corpus. Therefore, we decided to im- consists of Arabic written text produced by learners prove the quality of the training corpus by an error of Arabic in Saudi Arabia collected by (Alfaifi and analysisstrategyintroducedbyDiBarietal.(2014), Atwell, 2013). Each text file is annotated with a which is based on detecting agreement between proficiency level of the student. We mapped these classifiers belonging to different Machine Learning student proficiency levels to CEFR levels. paradigms. Thecaseswhenthemajorityoftheclas- Our third corpus source comes from textbook sifiers agreed on predicting a label while the gold ”Al-Kitaab fii TaAallum al-Arabiyya” (Brustad standard was different were inspected manually et al., 2015) which was compiled from texts and by a specialist in teaching Arabic. In our Dataset sentences from parts one and two of the third edi- cleaning experiment we used the following classi- tion but only texts from part three third edition. fiers: SVM (with the rbf kernel), Random Forest, This book is widely used to teaching Arabic as a KNeighbors, Softmax and XgBoost using linguis- second language. These texts were originally clas- tic features discussed in Section 3, trained them via sified according to the American Council on the cross-validation and compared their majority vote Teaching of Foreign Languages (ACTFL) guide- to the gold standard. lines which we mapped to CEFR levels. Wemodifiedthe error classification tags intro- As these corpora have been annotated on the duced by Di Bari et al. (2014) as follows: document level and not on the sentence level, we assigned each sentence to the document level in Wrong ifthe classifiers have wrongly labelled the which it appears, by using several filtering heuris- data, and the gold standard is correct. tics, such as sentence length and containment, as Modify if the classifiers are correct and we need well as via re-annotation through machine learning, to modify the gold standard. see the dataset cleaning procedure below. Ambiguous ifweconsiderbotheitherlabelispos- Acounterpart corpus of texts not produced for sible based on different perspectives. language learners in mind is provided by I-AR, False is an added label which represent the dis- 75,630 Arabic web pages collected by wide crawl- agreement between the gold standard and the ing (Sharoff, 2006). A random snapshot of 8627 classifiers, when neither is correct. sentences longer than 15 words was used to extend For each sentence, five different predictions are the limitations of C-level sentences coming from assigned. Compared to the gold standard CEFR- corpora for language learners. label, the classifiers agreed in predicting 10204 in- Table 1 shows distribution of the number of stances. Then what we need to consider is when all used sentences and tokens per each Common Eu- classifiers agree on the predicted label and it con- ropean Framework of language proficiency Refer- tradicts with the gold standard’s one. In that matter, ence [CEFR] Level. In principle we have data for the classifiers agreed on 1943 sentence classifica- 5-way (A1, A2, B1, etc), 3-way (A, B or C) and tion. We manually investigated random sentences 1https://gloss.dliflc.edu/ andassignedtheerrorclassification tags. We found 106 that the main classification confusion was in Level such as ‘ /wa /= and’. According to the experience Binstances. The analysis results as in Table 4 show of language teachers such sentences do not present the distribution of categories where each error type problems for the learners. occurred. In the end, 380 instances had to be as- 3.1.1 ThePOS-features signed to lower level (usually from B to A). [Table 3 features (1-21)], these features represent the distribution of different word categories in 2.2 Dataset Two: Simplification examples the sentence, and the morpho-syntactic features of these words. According to Knowles and Don Asetofsimple/complexparallelsentenceshasbeen (2004), Arabic lemmatization, unlike that of En- compiled from the internationally acclaimed Ara- glish, is an essential process for analysing Arabic bic novel “Saaq al-Bambuu” (Al-Sanousi, 2013) text, because it is a methodology for dictionary which has an authorized simplified version for construction. Therefore, we used the Lemma/Type students of Arabic as a second language (Famil- ratio instead of Word/Type ratio. Adding features iar, 2016). We assume that a successful classifier represents the different verb types (Verb pseudo, should be able to detect sentences in the original Passive verbs, Perfective verbs, Imperfective verbs text that require simplification. Dataset Two con- and3rdperson). Asconjunctionisoneoftheimpor- sists of 2980 parallel sentences Table 2. tant features in representing sentence complexity in Arabic (Forsyth, 2014), we used the annotated Level Sentence Token discourse connectors introduced by Alsaif (2012) Simple A+B 2980 34447 bysplitting this list into 23 simple connectors and ComplexC 2980 46521 56complexconnectors referring to non-discourse Total 5690 80968 connectors and discourse connectors respectively. ForPOS-featuresextraction we used MADAMIRA Table 2: Number of Sentences and Tokens available a robust Arabic morphological analyser and part of per each CEFR Level in Dataset two speech tagger (Pasha et al., 2014). 3.1.2 Syntactic features Features (22-27) from Table 3 provide some infor- 3 Features and extraction methods mation about the sentences structures and number of phrases as well as phases types. These features are derived from a dependency grammar analy- We work with following groups of features in sis. Because dependency grammar is based on Table 3: Part of speech tagging features (POS- word-word relations, it assumes that the structure features); Syntactic structure features (Syntactic- of a sentence consists of lexical items that are at- features); CEFR-level lexical features; Sentence tached to each other by binary asymmetrical re- embeddings. lations, which is known as dependency relations. These relations will be more representative for this task. We used CamelParser (Shahrour et al., 2016) 3.1 Linguistic features a system for Arabic syntactic dependency analysis together with contextually disambiguated morpho- Whilethesentence-levelclassificationtaskisnovel, logical features which rely on the MADAMIRA weborrowedsomefeaturesfromprevious studies morphological analysis for more robust results. of text-level readability (Forsyth, 2014; Saddiki 3.1.3 CEFR-level lexical features et al., 2015; Nassiri et al., 2018a,b). We decided to exclude the sentence length from the feature set, Features (28-34) from Table 3 are used to assign as this creates an artificial skew in understanding each word in the sentence with an appropriate what is difficult: more difficult writing styles are CEFR level. For this, we created a new Arabic often associated with longer sentences, but it is word list consisting of 8834 unique lemmas la- not the sentence length which makes them difficult. belled with CEFR levels. This list was a com- Specifically, many long Arabic sentences contain bination of three frequency lists, 1) Buckwalter shorter ones, which are connected by conjunctions and Parkinson 5000 frequency word list based on 107 POS Features 1 TTRofwordforms 12 Numeric Adj Tokens 2 Morphemes word 13 Comparative Adj Tokens 3 TTRofLemma 14 Conjunction Tokens 4 Nouns Tokens 15 Conjunction Subordination Tokens 5 Verbs Tokens 16 Proper noun Tokens 6 Adj Tokens 17 Pronoun Tokens 7 Verb pseudo Tokens 18 Punc Tokens 8 Passive verbs Tokens 19 Simple Connector Tokens 9 Perfective verbs Tokens 20 Complex Connector Tokens 10 Imperfective verbs Tokens 21 All Sent Connector Tokens 11 3rdperson verb Verbs Syntactic Features 22 Incidence of subjects 25 Incidence of coordination 23 Incidence of objects 26 Average phrases/sentence 24 Incidence of modifier/root 27 Average phrases depth CEFRWordFeatures 28 Incidence of Level A1 32 Incidence of Level C1 29 Incidence of Level A2 33 Incidence of Level C2 30 Incidence of Level B1 34 Wordentropy with respect to CEFR 31 Incidence of Level B2 35 Sentence Embeddings Features Table 3: The Feature set. (all measures are for the rate of tokens on the sentence levels) 30-million-wordcorpusofacademic/non-academic fastText 2 tool. Using the Arabic ar.300.bin file and written/spoken texts (Buckwalter and Parkin- in which each word in WE is represented by the son, 2014) KELLY’s list which is produced from 1Dvector mapped of 300 attributes (Grave et al., the Kelly project (Kilgarriff et al., 2014), which 2018). We had to normalize the sentence vectors directly mapped a frequency word list to the CEFR to have the same length with respect to dimensions. levels using numerous corpora and languages, 3) For this, we calculated tf-idf weights of each word lists presented at the beginning of each chapter in in the corpus to use them as weights: ‘Al-Kitaab’ (Brustad et al., 2015). Merging the lists s = w w ....w and aligning them with the Madamira lemmatiser 1 2 n led to our new wide-coverage Arabic frequency Embed.[s] = 1 Xtfidf[w ]∗Embed.[w ] (1) list, which can be used to predict difficulty as En- n i i tropy of the probability distribution of each label i in a sentence. The current list shows some consis- Universal sentence encoder (Yang et al., 2019) tency with the English profile list in terms of the This model requires modeling the meaning of word percentage of words allocated to each CEFR level. sequences rather than just individual words. Also it was generated mainly to be used on the sentence 3.2 Sentence embeddings level which after sentence tokenization, it encodes In addition to the 34 traditional features we can sentence to a 512-dimensional vector. We used 3 represent sentences as embedding vectors using here the large version . different neural models as following: Multilingual BERT (Devlin et al., 2018) Pre- trained transformers models proved their ability fastText A straightforward way to create sen- to learn successful representations of language tence representations is to take a weighted average inspired by the transformer model presented in of word embeddings (WE) of each word, for exam- 2https://fasttext.cc/docs/en/crawl-vectors.html ple, using fastText vectors. This embedding was 3https://tfhub.dev/google/universal-sentence-encoder- trained on Common Crawl and Wikipedia using multilingual/1 108
no reviews yet
Please Login to review.