Design

3 Design

Mcfgcky is a bottom-up chart parser in the style of Shieber et al. [1995]. As such, it approaches parsing as a deduction problem, where the axia are the string terminals and a set of inference rules specifies the formulation of derived categories. Both the axia and derived items are contained within a chart which records the entirety of the bottom-up deduction and thus eliminates any need for backtracking.

3.1 Chart Items

As such, the basic data structure in Mcfgcky is the chart item, which represents some hypothesis made by the recognizer for the category corresponding to a certain string segment. In Mcfgcky, this structure is of the OCaML type item, such as follows.

type item = {prob:float;pointer:int;cat:string;pos:span list}

The items populating the chart of a MCFG recognizer in the Shieber et al. [1995] style, like that of a CFG CKY recognizer, must minimally contain a category label (cat) and a mapping from that category to the input string (pos). In a CKY recognizer, this representation of the input string segment often takes the form of a tuple of integers (x,y), where x and y denote the positions which begin and end the string. This convention succinctly and uniquely identifies the string segment which the parser has attached a category label, and determines which items are concatenable. As the yield function of an MCFG rules takes not strings but tuples of strings as arguments, and returns such, an item’s mapping of categories to the input string is more complex. In Mcfgcky, the field pos achieves this more complex mapping as a list of spans, where a span represents some maximal contiguous item contained within an MCFG tuple.

The parametrized type span implements the range of automata we wish to parse.

The Empty and Var types represent categories and portions of categories that are not directly instantiated by the input string. In the automaton, these types implement epsilon transitions which generate a symbol without consuming an input symbol. The distinction between Empty and Var in the MCFG parser repects a distinction made by the Guillamin compiler from MG. A span with empty string yield could correspond to an MG category which has no string yield, such as a functional head. It could also correspond to an ad hoc category fashioned by the compiler to fascilitate such rules as head movement and adjunction. In the former case, we term the empty span Empty and in the latter case, we term it Var.

The type Wild is a constructor of an index (implemented as integer), reflecting a self-loop in the automaton that consumes any symbol in the input while producing a fixed symbol in the output. The type Frag is a constructor of a tuple of indicies, representing a transition from one index to another in the automaton producing a symbol. The abstraction of a tuple of spans is implemented in Mcfgcky as a list of spans, as OCaML permits homogeneous lists of arbitrary length.

Overall, the record item type maps a single MCFG category to potentially more than one segment of the input string, and permits the arguments of MCFG rewrite rules to be non-concatenate.

Implementation

The field point operationalizes the backpointing system required to fashion a parser from a recognizer. Throughout parsing, a backpointing hash table is populated in tandem with the chart, and point simply identifies the key in this table which corresponds to this item.

Implementation

At the end of parsing, the pointers for items which meet the success criterion are read off, and serve as the starting points for the backpointing traversal function, checkmcfg or checkmg, which yields derivation trees corresponding to the MCFG or MG derivation tree, respectively.

Finally, the field prob represents the probability of the item in the intersection grammar, which serves probabilistic parsing.

3.2 Inference Rules

Mcfgcky can handle grammars whose rules are unary or binary. As a result of this, Mcfgcky employs three main inference rules: unaryf, which is unary; left and right, which are binary. left and right refers simply to which member of the binary rule is in the trigger and which is in the chart, and do not refer to relative positions in the string. These rules all take as their arity an agenda og, a grammar g, the pointer hashtable phash, a trigger item trigger, and the chart chart.

let left og g phash trigger chart =

for comp=0 to n in chart

for rule=0 to n in

List.filter (rule’s first rhs argument = trigger.cat) grammar

if List.length (righthandside(rule)) =2 and

trigger.cat = (righthandsides(rule).[0]) and

comp.cat = (righthandsside(rule).[1]] then f(rule,trigger,comp)::og else og

left is seen in the above pseudocode to map a given trigger item to all members of the of the chart and with all rules in the grammar containing the category of the trigger item as first argument (second argument in right, only argument in unaryf). left,right,unary simply match premises to pertinent axiom schemata, returning (rule, trigger, comp) triples (or a (rule, trigger) tuple in the unary case) where the categories of trigger and comp match the right hand side of rule in the correct order. The final inference is implemented separately as the evaluation of the yield function f on the (rule, trigger, comp) triple, by the populate, populate2, and concat functions. populate2 simply builds a new item given the (rule, trigger, comp) triple, in the binary case, and populate does likewise in the unary case. populate2 and populate both include sanity checks, and both use the concat function to determine the pos field of the new item.

Implementation

As the pos field of the item is a span list, concat simply takes as arguments lists of spans (or span list list) and returns their concatenation, a span list; e.g. concat [[Fragment(2,3);Fragment(3,4); Fragment(4,5)];[Fragment(5,6)]] will return the concatenate [Fragment (2,5);Fragment(5,6)], while concat [Fragment(2,3);Wild(4,5);Fragment(5,7)] will return [Fragment(2,3);Fragment(4,7)]. span list list is given a type definition as an anonitem; as OCaML lists are homogenous and of arbitrary length, the use of the outer list and the temporary data structure enables f to take as input an arbitrary number of arguments (or equivalently, an n-tuple of arbitrary length)

Recall that a Multiple Context Free Grammar rule has arguments and yield that are tuples of strings. Whereas the yield function of a context-free grammar rewriting rule conflates category rewriting and string concatenation, the yield function of a multiple context free grammar combines the category rewriting and string concatenation operations in a more complex mapping. Revisiting again the example of a context free grammar rewriting rule...

α → β γ (14)

In a context free grammar, β must map onto a portion of the string which is prior to (left of) the part of the string which γ maps onto. This is not so in a Multiple Context Free Grammar.

A₀ → f[A₁,A₂,...A_q] (15)

t7 → t4 t5 t6 [(2,0)][(1,0);(0,0);(1,1)] (16)

The yield function f is free to interpolate and concatenate the members of its argument tuples independently. It follows that the MCFG parser must delineate the concatenation function from the inference rules somewhat, as a given inference rule may not exhibit any instance of concatenation.

Implementation

3.3 Agenda and Exhauastion Mechanism

Implementation

3.4 Backpointing

Mcfgcky uses a separate chart to record derivational information needed to print analyses. As each items is inserted into the main chart, a key,value pair is recorded in the backpointer chart, and the backpointer key recorded in the main chart item’s backpointer field.

When the exhaustion mechanism closes the chart, and items matching the success criterion are retrieved, the pointers can be read off the success items and seed the backpointing mechanism.

Implementation

3.5 Output

In parsing modes, Mcfgcky can output a variety of derivation tree outputs in LaTeX, plain text, or dot. It can print out either MCFG derivation trees or MG derivation trees, provided the corresponding dictionary file from the Guillaumin [2004] is provided for the reverse compilation of MCFG categories back into MG feature prefixes. It does not at present output MG derived trees. Mcfgcky also will print out a product-sum graph in both parsing and statistical modes.

Implementation

In statistical modes, Mcfgcky prints out reports with pertinent information. In the verbose statistical mode, Mcfgcky prints out the treebank grammar, derivation grammars, inside probabilities, h^→, and fertility matrices, and entropies for each prefix. vech and the fertility matrix could probably be shunted off into a debugging mode. In the quiet statistical mode, Mcfgcky prints out surprisals, entropies, and inside probability of the start symbol (prefix probabilities) for each prefix.

Implementation

3.6 Training

Implementation

3.7 Testing

Nederhof and Satta [2008] show that the number of iterations to convergence is dependent on the spectral radius. Convergence can be out of reach in pathological cases with sr ≈ 0.9, but convergence is readily obtainable for sr ≤ 0.7, which generally holds on natural language grammars.

Setting n = 100 should be sufficient for almost all natural language grammars trained from corpora, but caution is warranted when grammars have large spectral radii. We ran a series of tests on gazdar.mcfg, which is a quite loopy CFG with a spectral radius of INSERT. 15 show the results of parses of three and four word prefixes being parsed as n is incremented. The X-axes depict n, with entropies on the Y-axises. Entropy scores converged out to 16 significant digits around 50 n.

Figure 15: Convergence of gazdar.cfg on three and four word prefixes

Implementation

3.8 Inside

Implementation

3.9 Surprisal

3.10 Entropy

Implementation