next up previous contents
Next: Conclusions and Future Work Up: Evaluation on Other UBGs Previous: The Alvey Grammar   Contents


Penn Treebank CFG

CFGs are normally not considered UBGs, because they have atomic categories, and the process of unifying atomic entities is merely a simple equality check. However, as is shown in this section, indexing methods can be successfully applied to CFGs, with the results demonstrating significant improvement, especially for large-scale CFGs.

The index key for each daughter is the daughter's category itself (a category index). The indexing scheme $ L(Mother)$ contains only the daughters that are guaranteed to match with a specific $ Mother$ (thus creating a ``perfect'' index). This indexing scheme is preferred to a positional index, since it is usual for CFGs to have a large number of rules, with a significant number of daughters. As a result, a positional index would simply need too many index entries (a number equal to the total number of daughters in the grammar). Using the categories as index keys limits the number of entries to the number of different categories.


Table 7.5: Successful unification rate for the non-indexed CFG parser.
Number Successful Failed Success
of rules unifications unifications rate (%)
124 104 1,766 5.56
736 2,904 189,528 1.51
3196 25,416 3,574,138 0.71
9008 195,382 56,998,866 0.34
14193 655,403 250,918,711 0.26
20999 1,863,523 847,204,674 0.21


The search takes place only in the hash entry associated with that daughter's category. This increases to 100% the ratio of successful unifications7.5, representing a significant gain in terms of parsing times. Table 7.5 illustrates the significance of this gain by presenting the successful unification rate for the non-indexed CFG parser.

Fifteen CFGs with atomic categories were built from the Wall Street Journal (Penn Tree Bank v. 2) annotated parse trees, by constructing a rule from each sub-tree of every parse tree, and removing the duplicates. The grammars were extracted from the first ten directories of the Wall Street Journal collection (00-09), consisting of the files wsj_0001.mrg to wsj_0999.mrg, as shown in Table 7.6. Each of the fifteen grammars contains rules extracted from the first $ N$ files in wsj_0001.mrg...wsj_0999.mrg, with $ N$ chosen as: 5, 10, 15, 30, 50, 100, 150, 200, 250, 300, 350, 400, 500, 700, 900.


Table 7.6: The grammars extracted from the Wall Street Journal directories of the PTB II.
Grammar WSJ Number of Lexicon Number of
no. directories WSJ files size Rules
1 00 5 188 124
2 00 10 756 473
3 00 15 1167 736
4 00 30 2335 1369
5 00 50 5645 3196
6 00-01 100 8948 5246
7 00-01 129 11242 6853
8 00-02 200 13164 7984
9 00-02 250 14730 9008
10 00-03 300 17555 10834
11 00-03 350 18861 11750
12 00-04 400 20359 12696
13 00-05 481 20037 13159
14 00-07 700 27404 17682
15 00-09 901 32422 20999


Figure 7.8: Parsing times for EFD and EFD-indexing applied to CFGs with atomic categories.
\includegraphics[width=\textwidth]{index_cfg.eps}

The test set contained 5 sentences of lengths between 13 and 18 words. The sentences were chosen from the first 5 files, to ensure that they will be parsed by every extracted grammar. Figure 7.8 shows that even for smaller numbers of rules, the indexed parser outperforms the non-indexed version. The improvements in parsing times for the indexed parser over those for the non-indexed parser range from a minimum of 25% to a maximum of 62%. All grammars with more than 3000 rules (11 out of the total 15 grammars) exhibit improvements in parsing times of more than 50%, while only the smallest grammar shows improvements under 30%. Although ``unification'' costs are small for atomic CFGs, using an indexing method is well justified. The difference in improvements from Penn Tree Bank CFG to MERGE TFSG is explained by the large branching factor of the CFG rules, by the ``perfect'' index, and by the sheer size of the charts in number of edges. All of these, as mentioned in Section 5.3, are important factors in an indexing method's success.


next up previous contents
Next: Conclusions and Future Work Up: Evaluation on Other UBGs Previous: The Alvey Grammar   Contents