Mcfgcky is a bottom-up chart parser in the style of Shieber et al. [1995]. As such, it approaches parsing as a deduction problem, where the axia are the string terminals and a set of inference rules specifies the formulation of derived categories. Both the axia and derived items are contained within a chart which records the entirety of the bottom-up deduction and thus eliminates any need for backtracking.

As such, the basic data structure in Mcfgcky is the chart item, which represents some hypothesis made by the recognizer for the category corresponding to a certain string segment. In Mcfgcky, this structure is of the OCaML type *item*, such as follows.

- type item = {prob:float;pointer:int;cat:string;pos:span list}

The items populating the chart of a MCFG recognizer in the Shieber et al. [1995] style, like that of a CFG CKY recognizer, must minimally contain a category label (cat) and a mapping from that category to the input string (pos). In a CKY recognizer, this representation of the input string segment often takes the form of a tuple of integers (*x*,*y*), where x and y denote the positions which begin and end the string. This convention succinctly and uniquely identifies the string segment
which the parser has attached a category label, and determines which items are concatenable. As the yield function of an MCFG rules takes not strings but tuples of strings as arguments, and returns such, an item’s mapping of categories to the input string is more complex. In Mcfgcky, the field *pos* achieves this more complex mapping as a list of *spans*, where a span represents some maximal contiguous item contained within an MCFG tuple.

The parametrized type *span* implements the range of automata we wish to parse.

The Empty and Var types represent categories and portions of categories that are not directly instantiated by the input string. In the automaton, these types implement epsilon transitions which generate a symbol without consuming an input symbol. The distinction between Empty and Var in the MCFG parser repects a distinction made by the Guillamin compiler from MG. A span with empty string yield could correspond to an MG category which has no string yield, such as a functional head. It could also correspond to an ad hoc category fashioned by the compiler to fascilitate such rules as head movement and adjunction. In the former case, we term the empty span Empty and in the latter case, we term it Var.

The type Wild is a constructor of an index (implemented as integer), reflecting a self-loop in the automaton that consumes any symbol in the input while producing a fixed symbol in the output. The type Frag is a constructor of a tuple of indicies, representing a transition from one index to another in the automaton producing a symbol. The abstraction of a tuple of spans is implemented in Mcfgcky as a list of spans, as OCaML permits homogeneous lists of arbitrary length.

Overall, the record item type maps a single MCFG category to potentially more than one segment of the input string, and permits the arguments of MCFG rewrite rules to be non-concatenate.

The field *point* operationalizes the backpointing system required to
fashion a parser from a recognizer. Throughout parsing, a
backpointing hash table is populated in tandem with the chart, and
*point* simply identifies the key in this table which corresponds to
this item.

At the end of parsing, the pointers for items which meet the success criterion are read off, and serve as the starting points for the backpointing traversal function, *checkmcfg* or *checkmg*, which yields derivation trees corresponding to the MCFG or MG derivation tree, respectively.

Finally, the field *prob* represents the probability of the item in the intersection grammar, which serves probabilistic parsing.

Mcfgcky can handle grammars whose rules are unary or binary. As a result of this, Mcfgcky employs three main inference rules: **unaryf**, which is unary; **left** and **right**, which are binary. **left** and **right** refers simply to which member of the binary rule is in the trigger and which is in the chart, and do not refer to relative
positions in the string. These rules all take as their arity an agenda **og**, a grammar **g**, the pointer hashtable **phash**, a trigger item **trigger**, and the chart **chart**.

- let left og g phash trigger chart =
for comp=0 to n in chart

for rule=0 to n in

List.filter (rule’s first rhs argument = trigger.cat) grammar

if List.length (righthandside(rule)) =2 and

trigger.cat = (righthandsides(rule).[0]) and

comp.cat = (righthandsside(rule).[1]] then f(rule,trigger,comp)::og else og

**left** is seen in the above pseudocode to map a given trigger item to all members of the of the chart and with all rules in the grammar containing the category of the trigger item as first argument (second argument in right, only argument in unaryf). **left**,**right**,**unary** simply match premises to pertinent axiom schemata, returning (rule, trigger, comp) triples (or a (rule, trigger) tuple in the unary case) where the categories of trigger and comp match the right hand side of rule in the correct order. The final inference is implemented separately as the evaluation of the yield function *f* on the (rule, trigger, comp) triple, by the *populate*, *populate*2, and *concat* functions. *populate*2 simply builds a new item given the (rule, trigger, comp) triple, in the binary case, and *populate* does likewise in the unary case. *populate*2 and *populate* both include sanity checks, and both use the *concat* function to determine the *pos* field of the new item.

As the pos field of the item is a span list, *concat* simply takes as arguments lists of spans (or span list list) and
returns their concatenation, a span list; e.g. concat [[Fragment(2,3);Fragment(3,4); Fragment(4,5)];[Fragment(5,6)]] will return the concatenate [Fragment (2,5);Fragment(5,6)], while concat [Fragment(2,3);Wild(4,5);Fragment(5,7)] will return [Fragment(2,3);Fragment(4,7)]. *span* *list* *list* is given a type definition as an *anonitem*; as OCaML lists are homogenous and of arbitrary length, the use of the outer list and the temporary data structure enables *f* to take as input an arbitrary
number of arguments (or equivalently, an n-tuple of arbitrary length)

Recall that a Multiple Context Free Grammar rule has arguments and yield that are tuples of strings. Whereas the yield function of a context-free grammar rewriting rule conflates category rewriting and string concatenation, the yield function of a multiple context free grammar combines the category rewriting and string concatenation operations in a more complex mapping. Revisiting again the example of a context free grammar rewriting rule...

α → β γ (14) |

In a context free grammar, β must map onto a portion of the string which is prior to (left of) the part of the string which γ maps onto. This is not so in a Multiple Context Free Grammar.

A_{0} → f[A_{1},A_{2},...A_{q}]
(15) |

t7 → t4 t5 t6 [(2,0)][(1,0);(0,0);(1,1)]
(16) |

The yield function *f* is free to interpolate and concatenate the members of its argument tuples independently. It follows that the MCFG parser must delineate the concatenation function from the inference rules somewhat, as a given inference rule may not exhibit any instance of concatenation.

Mcfgcky uses a separate chart to record derivational information needed to print analyses. As each items is inserted into the main chart, a key,value pair is recorded in the backpointer chart, and the backpointer key recorded in the main chart item’s backpointer field.

When the exhaustion mechanism closes the chart, and items matching the success criterion are retrieved, the pointers can be read off the success items and seed the backpointing mechanism.

In parsing modes, Mcfgcky can output a variety of derivation tree outputs in LaTeX, plain text, or dot. It can print out either MCFG derivation trees or MG derivation trees, provided the corresponding dictionary file from the Guillaumin [2004] is provided for the reverse compilation of MCFG categories back into MG feature prefixes. It does not at present output MG derived trees. Mcfgcky also will print out a product-sum graph in both parsing and statistical modes.

In statistical modes, Mcfgcky prints out reports with pertinent
information. In the verbose statistical mode, Mcfgcky prints out the
treebank grammar, derivation grammars, inside probabilities,
*h*^{→}, and fertility matrices, and entropies for each
prefix. *vec**h* and the fertility matrix could probably be shunted
off into a debugging mode. In the quiet statistical mode, Mcfgcky
prints out surprisals, entropies, and inside probability of the start
symbol (prefix probabilities) for each prefix.

Nederhof and Satta [2008] show that the number of iterations to convergence is dependent on the spectral radius. Convergence can be out of reach in pathological cases with *sr* ≈ 0.9, but convergence is readily obtainable for *sr* ≤ 0.7, which generally holds on natural language grammars.

Setting *n* = 100 should be sufficient for almost all natural language grammars trained from corpora, but caution is warranted when grammars have large spectral radii. We ran a series of tests on gazdar.mcfg, which is a quite loopy CFG with a spectral radius of INSERT. 15 show the results of parses of three and four word prefixes being parsed as n is incremented. The X-axes depict n, with entropies on the Y-axises. Entropy scores converged out to 16 significant digits around 50 n.