CFGs are normally not considered UBGs, because they have atomic categories, and the process of unifying atomic entities is merely a simple equality check. However, as is shown in this section, indexing methods can be successfully applied to CFGs, with the results demonstrating significant improvement, especially for large-scale CFGs.
The index key for each daughter is the daughter's category itself (a
category index). The indexing scheme
contains only the
daughters that are guaranteed to match with a specific
(thus
creating a ``perfect'' index). This indexing scheme is preferred to a
positional index, since it is usual for CFGs to have a large number of
rules, with a significant number of daughters. As a result, a
positional index would simply need too many index entries (a number
equal to the total number of daughters in the grammar). Using the
categories as index keys limits the number of entries to the number of
different categories.
The search takes place only in the hash entry associated with that daughter's category. This increases to 100% the ratio of successful unifications7.5, representing a significant gain in terms of parsing times. Table 7.5 illustrates the significance of this gain by presenting the successful unification rate for the non-indexed CFG parser.
Fifteen CFGs with atomic categories were built from the Wall Street
Journal (Penn Tree Bank v. 2) annotated parse trees, by constructing a
rule from each sub-tree of every parse tree, and removing the
duplicates. The grammars were extracted from the first ten directories
of the Wall Street Journal collection (00-09),
consisting of the files wsj_0001.mrg to
wsj_0999.mrg, as shown in Table 7.6.
Each of the fifteen grammars contains rules extracted from the first
files in wsj_0001.mrg...wsj_0999.mrg, with
chosen as: 5, 10, 15, 30, 50, 100, 150, 200, 250, 300, 350, 400,
500, 700, 900.
The test set contained 5 sentences of lengths between 13 and 18 words. The sentences were chosen from the first 5 files, to ensure that they will be parsed by every extracted grammar. Figure 7.8 shows that even for smaller numbers of rules, the indexed parser outperforms the non-indexed version. The improvements in parsing times for the indexed parser over those for the non-indexed parser range from a minimum of 25% to a maximum of 62%. All grammars with more than 3000 rules (11 out of the total 15 grammars) exhibit improvements in parsing times of more than 50%, while only the smallest grammar shows improvements under 30%. Although ``unification'' costs are small for atomic CFGs, using an indexing method is well justified. The difference in improvements from Penn Tree Bank CFG to MERGE TFSG is explained by the large branching factor of the CFG rules, by the ``perfect'' index, and by the sheer size of the charts in number of edges. All of these, as mentioned in Section 5.3, are important factors in an indexing method's success.