Quote:

The amount of happiness that you have depends on the amount of freedom you have in your heart

Monday, 23 January 2012

CS606 Compiler Constructions Assignment 5

CS606 Compiler Constructions
Assignment # 5    Fall2011


                                                                     IDEA Solution..........!!

Question Statement:                                      [20]

How the Arabic parsing investigated by comparing the lexicalised and unlexicalised parsers? What is the relation of ATB, SBJ, OBJ and PRD functional tags in this parsing?


SOLUTION:
Arabic is a Semitic language well-known for its morphological richness and syntactic complexity. Parsing Arabic sentences is a difficult task for several reasons including the relatively free word order of Arabic, the length of sentences, the omission of diacritics (vowels) in written Arabic and the frequency of pro-drop phenomena. The main objective of our research is to automatically enrich Arabic Penn Treebank (ATB) trees and ATB-trained parser output with dependency information such as ATB functional tags. Then, based on these and categorical and configurationally information, we automatically annotate trees with LFG f-structure information representing predicate argument structure relations to produce ATB-based LFG resources for parsing and generation similar to previous work on English by (Cahill and van Genabith, 2006; Cahill et al., 2008).

We investigate Arabic parsing comparing lexicalized and unlexicalised parsers. We focus on ATB grammatical function tag assignment (Habash et al., 2009) using grammar transforms for different configurations, making use of morpho-syntactic information to detect subjects, direct objects and predicates. The paper is structured as follows: Section 2 describes the general background. Section 3 presents the parsers and grammar transforms focusing on the three most frequent Functional Tags in the ATB: subject, direct object and predicate. Section 4 discusses the results from a series of experiments conducted on parsers trained on the transformed ATB, measuring quality and coverage of the output trees and the generated LFG f-structures.


Relation Between ATB, SBJ, OBJ and PRD functional tag
ATB: Penn Arabic Treebank (ATB)

The Penn Arabic Treebank (Maamouri and Bies 2004) is a corpus of 23,611 parse-annotated sentences from Arabic newswire text in Modern Standard Arabic (MSA). The ATB is a fine-grained corpus, its annotation includes 22 phrasal tags, 20 individual functional tags and 24 basic POS-tags1 (with a total of 497 different POS tags with morphological information). In addition, the ATB involves empty nodes to capture pro-drop as well as non-local dependencies (NLDs). The full POS tag set with morphological information indicates case, mood, gender, definiteness.


SBJ: As Arabic is morphologically rich, a lot of information is present at the leaves of the trees in the ATB. We percolate morphological information bottom-up in the trees to help grammatical function assignment. We focus on the three most frequent functional tags in the ATB: -SBJ, -OBJ, -PRD.3 Case percolation aims to improve the determination of subject, object and predicate constituent(s) among the syntactic structures identified in the parse tree. Arabic has three grammatical cases: nominative, genitive and accusative. Except when they are governed by an overt copula or a subordinating conjunction, -SBJ and -PRD are nominative and -OBJ is accusative (Habash et al. 2005). Adding case information to POS tag increases the size of the POS tag set to 40 tags. e.g. the POS NN is expended to NN, NN nom and NN acc). Figure 1 shows the (unmaked) output tree provided by the parser, trained on a version of the ATB which has undergone ATB function tag masking and case percolation, for sentence (1). Each node in the tree is assigned an f-structure equation using A3. The subject NP receives ‘" SUBJ = #’ and the predicate which subcategories for a copula complement receives ‘" PRED = ‘null be’, " XCOMP = # , " SUBJ= # SUBJ’. The resolution of the equations produces the f-structure shown in Figure 2. Note that the subject of the matrix clause is co indexed with the subject of the embedded clause.



No comments:

Post a Comment

“You can't change the past, but you can ruin the present by worrying about the future”