jagomart
digital resources
picture1_Language Pdf 99174 | 854 Paper


 140x       Filetype PDF       File size 0.51 MB       Source: www.lrec-conf.org


File: Language Pdf 99174 | 854 Paper
grammarextractionfromtreebanksforhindiandtelugu prasanth kolachina sudheer kolachina anil kumar singh samar husain viswanathanaidu rajeevsangalandaksharbharati language technologies research centre iiit hyderabad india prasanth k sudheer kpg08 anil vnaidu samar research iiit ac in ...

icon picture PDF Filetype PDF | Posted on 21 Sep 2022 | 3 years ago
Partial capture of text on file.
                              GrammarExtractionfromTreebanksforHindiandTelugu
                               Prasanth Kolachina, Sudheer Kolachina, Anil Kumar Singh, Samar Husain,
                                              ViswanathaNaidu,RajeevSangalandAksharBharati
                                                            Language Technologies Research Centre,
                                                                      IIIT-Hyderabad, India
               {prasanth k, sudheer.kpg08, anil, vnaidu, samar}@research.iiit.ac.in,sangal@iiit.ac.in
                                                                        Abstract
              Grammarsplayanimportant role in many Natural Language Processing (NLP) applications. The traditional approach to creating gram-
              mars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated tree-
              banks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a
              corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian lan-
              guages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an
              approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization
              counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing
              knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we
              showthat the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.
                                 1.   Introduction                             information in the form of weights associated with the
              Large scale annotated resources such as syntactic tree-          primitive elements in the grammar (Xia, 2001).
              banks, PropBank, FrameNet, VerbNet, etc. have been at            One of the important issues with any kind of anno-
              the core of Natural Language Processing (NLP) research           tated corpora is data sparseness. Sparseness of annotated
              for quite some time.     For a language like English for         data has a detrimental effect on the performance of nat-
              which these resources were first developed, they have             ural language processing applications trained over such
              proved to be indispensable in advancing the state-of-art         corpora. In the case of syntactic annotation, information
              for hosts of applications. Following the success of efforts      about the argument structure of the verb is crucial for
              like the Penn TreeBank (PTB) (Marcus et al., 1994),              applications such as parsing.     For instance, there exist
              Prague dependency treebank (Hajicova, 1998), several             differences among individual verbs in the number of
              attempts are underwaytobuildsuchNLPresourcesfornew               their annotated instances based on the frequency of their
              languages. One such ongoing effort is to create a treebank       occurrence.   The number of annotated instances greatly
              for Hindi-Urdu (Bhatt et al., 2009; Palmer et al., 2009;         varies from verb to verb. In fact, the sparse data also poses
              Begumetal., 2008a). Begum et al. describe a dependency           a challenge for grammar extraction from treebanks. One
              annotation scheme based on the Computational Paninian            of the ways to overcome this limitation of sparse data
              Grammar or CPG (Bharati et al., 1995).        The treebank       in syntactic treebanks is through generalization of the
              being developed using this annotation scheme currently           argument structure across different verbs.     Furthermore,
              contains around 2500 sentences. Despite its modest size,         generalization based on clustering can lead to creation of
              the Hindi treebank has helped improve considerably the           verb classes based on the similarity of argument structure.
              accuracies for a variety of NLP applications, especially         In this paper, we present a basic system to extract a depen-
              parsing (Bharati et al., 2008).                                  dency grammar in the CPG formalism from treebanks for
              TheroleofgrammarsinthedevelopmentofadvancedNLP                   two languages, Hindi and Telugu. Towards this end, we
              systems is well known. Traditionally, the task of creating a     explore an approach which relies on generalization of ar-
              grammarforalanguageinvolvedselectingaformalismand                gumentstructure over verbs based on the similarity of their
              encoding the patterns in that language as rules, constraints     syntactic contexts. A grammar extracted using this system
              etc. But with the availability of large scale syntactically      cannotonlyexpandanalreadyexistingknowledgebasefor
              annotated treebanks, it is now possible to automatically         NLPtasks such as parsing, but also aid in the creation of a
              extract an approximate grammar of a language in any of           useful resource. Further, the grammar extraction process
              the existing formalisms from a corresponding treebank,           can help in identifying annotation errors and thus make the
              thus reducing human effort considerably. This method of          task of the treebank validation easier.
              extracting grammars from treebanks allows for creation                          2.    Goals of the paper
              and expansion of knowledge bases for parsing. Grammars           Themaingoalsofthis paper are as follows:
              extracted through this method can be used to evaluate
              the coverage of existing hand-crafted grammars.         The        1. TopresentasystemthatextractsgrammarsintheCPG
              extraction process itself can help detect annotation errors.          formalism from the Hindi and Telugu treebanks
              Another major advantage of extracting grammars from
              treebank as compared to the traditional approach of                2. To use the extracted grammar to improve the coverage
              handcrafting grammars is the availability of statistical              of an existing hand-crafted grammar for Hindi, which
                                                                          3803
                   is being used for parsing (Bharati et al., 2009a)            her work, Xia has demonstrated the process for treebanks
                3. To generalize verb argument structure information            of three languages: English, Chinese and Korean.          She
                   over the extracted verb frames to address sparsity in        also showed that grammars extracted using LexTract
                   the annotated corpora                                        have several applications.     They can be used as stand
                                                                                alone grammars for languages that do not have existing
                4. To aid in the validation of treebanks by detecting dif-      grammars. They can be used to enhance the coverage of
                   ferent types of annotation errors using the extracted        already existing grammars. They can be used to compare
                   grammars                                                     grammars of different languages.        The derivation trees
                                                                                extracted using LexTract can be used to train statistical
                                 3.   Related Work                              parsers and taggers. LexTract can also help detect certain
              In this section we briefly survey some of the work on gram-        kinds of annotation errors and thereby, semi-automate
              mar extraction, generalization using syntactic similarity.        the process of treebank validation. A major advantage of
              We also mention a few details about both the Indian lan-          the LexTract approach to grammar development is that it
              guage treebanks that we used. Syntactic alternation can be        can provide valuable statistical information in the form of
              an important criterion while generalizing verbs. We briefly        weights associated with primitive elements.
              discuss how syntactic alternation in Hindi differs from En-       The work we present in this paper is on the same
              glish.                                                            lines as the LexTract approach to grammar development,
              3.1.   GrammarExtraction                                          but it is on a much smaller scale. It is meant to be the
              The role of grammars in NLP is more extensive than is             first step towards building a LexTract like system for
              generally supposed. Xia (2000) points out that the task of        extracting CPG grammars for Indian languages. Since we
              treebanking for a language bears much similarity to the           worked with dependency treebanks of Hindi and Telugu,
              task of manually crafting a grammar. The treebank of a            we chose a dependency grammar formalism known as
              language contains an implicit grammar for that language.          Computational Paninian grammar (CPG). In fact, the
              Statistical NLP systems trained over a treebank make use          annotation guidelines followed to annotate the treebank are
              of this grammar implicit in the treebank.       This is why       based on this grammar (Bharati et al., 2009b). As such, the
              grammar driven approaches and data driven or statistical          grammar extraction process is much more straightforward
              approaches are not necessarily mutually exclusive.          It    than the one in LexTract. In the next section, we give a
              is well known that the traditional approach of manually           brief outline of the CPG formalism where we define the
              crafting a high quality, large coverage grammar takes             basic terminology and briefly discuss the components of a
              tremendous human effort to build and maintain.              In    CPGgrammar.
              addition, the traditional approach does not provide for           3.2.   Generalization Based on Syntactic Similarity
              flexibility, consistency and generalization. To address these
              limitations of the traditional approach to grammar develop-       The problem of sparse data in Propbank has been previ-
              ment, Xia (2001) presents two alternative approaches that         ously addressed using syntactic similarity based general-
              generate grammars automatically, one from descriptions            ization of semantic roles across verbs (Gordon and Swan-
              (LexOrg) and the other from treebanks (LexTract).                 son, 2007). We try to address the data sparseness prob-
                                                                                lembygeneralizing over argument structure across syntac-
              The LexTract system extracts explicit grammars in                 tically similar verbs to arrive at an automatic verb classifi-
              the TAG formalism from a treebank. It is not, however,            cation. Gordon and Swanson (2007) define syntactic simi-
              limited to the TAG formalism as it can also extract CFGs          larity for phrase structure trees using the notion of a parse
              from a treebank. Large scale treebanks such as the English        tree path (Gildea and Jurafsky, 2002). Gildea and Jurafsky
              Penn Treebank (PTB) are not based on existing gram-               define a parse tree path as ‘the path from the target word
              mars. Instead, they were manually annotated following             through the parse tree to the constituent in question, repre-
              the annotation guidelines. Since the process of creating          sentedasastringofparsetreenon-terminalslinkedbysym-
              annotation guidelines is similar to the process of building       bols indicating upward and downward movement through
              a grammar by hand, it can be assumed that an implicit             the tree’. This parse tree path feature is used to represent
              grammar, hidden in the annotation guidelines, generates           the syntactic relationships between a predicate and its ar-
              the structures in the treebank. This implicit grammar can         guments in a parse tree. The syntactic context of a verb is
              be called a treebank grammar. As suggested by Xia, the            extracted as the set of all possible parse tree paths from the
              task of grammar extraction using LexTract can be seen             parse trees of sentences containing that verb. The syntac-
              as the task of converting this implicit treebank grammar          tic context of a verb is then converted into a feature vector
              to an explicit TAG grammar. LexTract builds an LTAG               representation. The syntactic similarity between two verbs
              grammar in two stages. First, it converts the annotated           is calculated using different distance measures such as Eu-
              phrase structure trees in the PTB into LTAG derived trees.        clidean distance, Chi-square statistic, cosine similarity etc.
              In the second stage, it decomposes these derived trees            In our work, we present an analogous measure of syntactic
              into a set of elementary trees which form the basic units         similarity for the dependency structures in the Indian Lan-
              of an LTAG grammar. It also extracts derivation trees             guage (IL) Treebanks, which is described in section 5. We
              which provide information about the order of operations           characterize the syntactic context of a verb using a karaka
              necessary to build the corresponding derived trees.         In    framerepresentation. The notion of karakas is explained in
                                                                            3804
               the next section.                                                        In the above sentences, the nominal vibhaktis (case-
                                                                                        endings or post-positions) change according to the
               3.3.  Syntactic Alternations in Hindi                                    TAM and agreement features of the verb. This co-
               Syntactic alternations of a verb have been claimed to re-                variation of vibhaktis with verb’s inflectional features
               flect its underlying semantics properties. Levin’s classifica-             is true not only of finite verb forms but also of non-
               tion of English verbs (Levin, 1993) based on this assump-                finite verb forms. All this information is exploited in
               tion demonstrates how syntactic alternation behavior of a                the CPG formalism in a systematic way, as discussed
               verb can be correlated to its semantic properties thereby                in the next section.
               leading to a semantic classification. There have also been          3.4.   Indian Language Treebanks
               several attempts at automatically identifying distinct clus-
               ters of verbs that behave similarly using clustering algo-         In this sub-section, we give a very brief overview of the
               rithms. These empirically-derived clusters were then com-          treebanks used in our work. We worked with treebanks of
               pared against Levin’s classification (Merlo and Stevenson,          two Indian languages, Hindi and Telugu. The treebanks
               2001).                                                             for Hindi and Telugu contain 2403 and 1226 sentences re-
               The following are some linguistic aspects of verb alterna-         spectively. The development of these treebanks is an on-
               tion behavior that we encountered in Hindi:                        going effort. The Hindi treebank is part of a multi-level
                                                                                  resource development project (Bhatt et al., 2009). Some of
                 ² In Hindi, the inchoative-transitive alternation pattern        the salient features of the annotation process employed in
                    cannot be considered an alternation of the same verb          the development of these treebanks are as follows:
                    stem. The verb stems in such constructions, although
                    morphologically related, are mostly distinct. This is            ² The syntactic structure of sentences is based on the
                    illustrated in the examples below:                                  dependency representation scheme.
                    Inchoative:                                                      ² Dependency relations in the Hindi treebank are anno-
                                                                                        tated on top of a manually POS-tagged and chunked
                    darawAzA               KulA                                         corpus. In the Telugu treebank, the POS-tagging and
                    door-3PSg-Nom          open                                         chunking was not performed manually.
                    ’The door opened.’
                    Transitive:                                                      ² Dependency relations are defined between chunk
                                                                                        heads.
                    Atifa-ne               darawAzA         KolA
                    Atif-3PSg-Erg door-3PSg                open                      ² The dependency tagset used to annotate dependency
                    ’Atif opened the door.’                                             relations is based on the CPG formalismwhichwedis-
                                                                                        cuss in section 4.
                 ² Similarly, the diathesis alternation pattern discussed              4.    Computational Paninian Grammar
                    byLevinis not exhibited by Hindi verbs.
                 ² SinceHindiisamorphologicallyrich,free-wordorder                In this section, we give a brief overview of the Computa-
                    language, the alternations are not with respect to the        tional Paninian Grammar (CPG) formalism. We only out-
                    position of the constituent as is the case in English. In     line details relevant to our goal of grammar extraction. See
                    Hindi,alternationsarewithrespecttothecase-endings             Bharati et al. (1995) for a detailed discussion of the CPG
                    (or the post-positions) of the nouns, which are called        formalism and the Paninian theory on which it is based. In
                    vibhaktis in CPG.                                             subsection 4.1, we introduce the basic terminology neces-
                                                                                  sary for an overview of this formalism.
                 ² Post-positions or vibhaktis alternation is determined          4.1.   Terminology
                    by the form that the verb stem takes in a particular             ² The notion of karaka relations is central to Paninian
                    construction. In other words, the arguments of a verb               Grammar. Karaka relations are syntactico-semantic
                    are realized using different case-endings or vibhaktis              relations between the verbs and other related con-
                    based on the tense, aspect and modality (TAM) fea-                  stituents in a sentence. Each of the participants in an
                    tures of the verb. This is illustrated in the examples              activity denoted by a verbal root is assigned a distinct
                    below:                                                              karaka. There are six different types of karaka rela-
                    abhaya                 rotI     KatA hE                             tions in the Paninian grammar as listed below:
                    Abhay-Nom-3PSgM bread eat-pres.simp.-3PSgM                            1. k1: karta, participant central to the action de-
                    ’Abhay eats bread.’                                                      noted by the verb
                    abhaya-ne rotI                  KAyI                                  2. k2: karma, participant central to the result of the
                    Abhay-Erg bread-3PSgF eat-past.simp.-3PSgF                               action denoted by the verb
                    ’Abhay ate bread.’                                                    3. k3: karana, instrument essential for the action to
                    abhaya-ne rotI-ko            KAyA                                        take place
                    Abhay-Erg bread-Acc eat-past.simp.-default                            4. k4: sampradana, beneficiary/recipient of the ac-
                    ’Abhay ate bread.’                                                       tion
                                                                              3805
                                                 Figure 1: Basic demand frame for the verb ‘de’ (to give)
                                                  Figure 2: ‘yA’ transformation frame for transitive verb
                     5. k5: apadana, participant which remains station-               features assigned to the verb in a syntactic construc-
                         ary (or is the reference point) in an action involv-         tion. Therefore, it can also be referred to as the TAM
                         ing separation/movement                                      marker of the verb.
                                                                         1            In the previous example sentence, the nouns ’Atifa’
                     6. k7: adhikarana, real or conceptual space/time
                                                                                      and ’kuAM’ have the vibhaktis ’-ne’ and ’-se’ respec-
                   For example, in the following example sentence:                    tively. The vibhakti of the verb ’nikAla’ is ’yA’ which
                                                                                      is also its TAM label.
                   samIrA-ne       abhaya-ko      phUla          diyA                 Nominal vibhaktis have also been found to be impor-
                   Samira-Erg Abhay-Dat flower-Acc give.past                          tant syntactic cues for identification of semantic role
                   ------------------------------------.3PSgM                         in the CPG scheme (Bharati et al., 2008).
                   ’Samira gave a flower to Abhay.’
                                                                                4.2.   ComponentsofCPG:DemandFramesand
                   Samira is the karta (k1), the flower is the karma (k2)               Transformation Frames
                   and Abhay is the sampradana (k4). Similarly, in the          Akey aspect of Paninian grammar (CPG) is that the verb
                   following example:                                           group containing a finite verb is the most important word
                                                                                group (equivalent to the notion of a ’head’) of a sentence.
                   Atifa ne kueM se          pAnI          nikAlA               For other word groups in the sentence dependent on this
                   Atif-Erg well-Abl water-Acc draw.3PSgM                       head, the vibhakti information of the word group is used
                   ’Atif drew water from the well.’                             to map it to an appropriate karaka relation. This karaka-
                                                                                vibhakti mapping is dependent on the main verb and its
                   Atif is the karta (k1), well is the apadaana (k5) and        TAMlabel. This mapping is represented by two templates:
                   water is the karma (k2).                                     default karaka chart (also known as basic demand frames)
                   In addition to these karaka relations, there are some        and karaka chart transformation (also known as transfor-
                   additional relations in the Paninian scheme such as          mation frame). The default demand frame defines the map-
                   tadarthya (or purpose)2.                                     ping for a verb or a class of verbs with respect to a basic
                                                                                reference TAM label. It specifies the karaka relations se-
                 ² The notion of vibhakti relates to the notion of local        lected by the verb along with the vibhaktis allowed by the
                   word groups based on case ending, preposition and            basic TAMlabel. The basic reference TAM label in CPG is
                   post-position markers. For a nominal word group, vib-        chosen to be ’tA hE’ which is equivalent to Present Indefi-
                   hakti is the post-position (also known as parsarg) oc-       nite/Simple Present. For any other TAM label of that verb
                   curring after the noun. Similarly, in the case of verbal     orverbclass,atransformationruleisdefinedthatcanbeap-
                   word group, a head verb may be followed by auxil-            plied to the default demand frame to obtain the appropriate
                   iary verbs whichmayremainasseparatewordsormay                karaka-vibhakti mapping for that TAM combination. The
                   combine with the head verb. This information follow-         transformation rules can affect the default demand frames
                   ing the head verb (in other words, verb stem) is col-        in three ways, each defined as an operation in CPG:
                   lectively called the vibhakti of the verb. The vibhakti
                   of a verb contains information about the tense, aspect         1. Insert: A new karaka relation is inserted into the de-
                   and modality (TAM) and also Agreement, which are                   mandframealongwithits vibhakti mapping
                                                                                  2. Delete: Anexistingkarakarelationisdeletedfromthe
                 1In the tagset used, k7p represents spatial location, k7t repre-     default demand frame
              sentstemporallocationandk7/k7vrepresentsconceptuallocation.
                 2The    complete   tagset   can   be   found   at   http:        3. Update: A karaka-vibhakti mapping entry in the de-
              //ltrc.iiit.ac.in/MachineTrans/research/                                fault demand frame is updated by modifying the vib-
              tb/dep-tagset.pdf                                                       hakti information according to the new TAM label
                                                                            3806
The words contained in this file might help you see if this file matches what you are looking for:

...Grammarextractionfromtreebanksforhindiandtelugu prasanth kolachina sudheer anil kumar singh samar husain viswanathanaidu rajeevsangalandaksharbharati language technologies research centre iiit hyderabad india k kpg vnaidu ac in sangal abstract grammarsplayanimportant role many natural processing nlp applications the traditional approach to creating gram mars manually besides being labor intensive has several limitations with availability of large scale syntactically annotated tree banks it is now possible automatically extract an approximate grammar a any existing formalisms from corresponding treebank this paper we present basic grammars dependency treebanks two indian lan guages hindi and telugu process extraction requires generalization mechanism towards end explore which relies on argument structure over verbs based their syntactic similarity such counters effect data sparseness extracted using system can not only expand already knowledge bases for tasks as parsing but also aid cre...

no reviews yet
Please Login to review.