How the Arabic parsing investigated by
comparing the lexicalised and unlexicalised parsers? What is the relation of
ATB, SBJ, OBJ and PRD functional tags in this parsing?
SOLUTION:
Arabic is a Semitic language
well-known for its morphological richness and syntactic complexity. Parsing
Arabic sentences is a difficult task for several reasons including the
relatively free word order of Arabic, the length of sentences, the omission of
diacritics (vowels) in written Arabic and the frequency of pro-drop phenomena.
The main objective of our research is to automatically enrich Arabic Penn
Treebank (ATB) trees and ATB-trained parser output with dependency information
such as ATB functional tags. Then, based on these and categorical and configurationally
information, we automatically annotate trees with LFG f-structure information
representing predicate argument structure relations to produce ATB-based LFG resources
for parsing and generation similar to previous work on English by (Cahill and
van Genabith, 2006; Cahill et al., 2008).
We investigate Arabic parsing
comparing lexicalized and unlexicalised parsers. We focus on ATB grammatical function
tag assignment (Habash et al., 2009) using grammar transforms for different
configurations, making use of morpho-syntactic information to detect subjects, direct
objects and predicates. The paper is structured as follows: Section 2 describes
the general background. Section 3 presents the parsers and grammar transforms
focusing on the three most frequent Functional Tags in the ATB: subject, direct
object and predicate. Section 4 discusses the results from a series of
experiments conducted on parsers trained on the transformed ATB, measuring
quality and coverage of the output trees and the generated LFG f-structures.
Relation Between ATB, SBJ, OBJ and PRD
functional tag
ATB: Penn Arabic
Treebank (ATB)
The Penn Arabic Treebank (Maamouri and
Bies 2004) is a corpus of 23,611 parse-annotated sentences from Arabic newswire
text in Modern Standard Arabic (MSA). The ATB is a fine-grained corpus, its
annotation includes 22 phrasal tags, 20 individual functional tags and 24 basic
POS-tags1 (with a total of 497 different POS tags with morphological information).
In addition, the ATB involves empty nodes to capture pro-drop as well as
non-local dependencies (NLDs). The full POS tag set with morphological information
indicates case, mood, gender, definiteness.
SBJ: As Arabic is
morphologically rich, a lot of information is present at the leaves of the
trees in the ATB. We percolate morphological information bottom-up in the trees
to help grammatical function assignment. We focus on the three most frequent
functional tags in the ATB: -SBJ, -OBJ, -PRD.3 Case percolation aims to improve
the determination of subject, object and predicate constituent(s) among the
syntactic structures identified in the parse tree. Arabic has three grammatical
cases: nominative, genitive and accusative. Except when they are governed by an
overt copula or a subordinating conjunction, -SBJ and -PRD are nominative and
-OBJ is accusative (Habash et al. 2005). Adding case information to POS tag
increases the size of the POS tag set to 40 tags. e.g. the POS NN is expended
to NN, NN nom and NN acc). Figure 1 shows the (unmaked) output tree provided by
the parser, trained on a version of the ATB which has undergone ATB function
tag masking and case percolation, for sentence (1). Each node in the tree is
assigned an f-structure equation using A3. The subject NP receives ‘" SUBJ
= #’ and the predicate which subcategories for a copula complement receives
‘" PRED = ‘null be’, " XCOMP = # , " SUBJ= # SUBJ’. The
resolution of the equations produces the f-structure shown in Figure 2. Note
that the subject of the matrix clause is co indexed with the subject of the
embedded clause.
No comments:
Post a Comment
“You can't change the past, but you can ruin the present by worrying about the future”