Implementation

4 Implementation

Myqueue implements a functional queue structure used in the agenda mechanism, and Seq (as coded in Paulson [1996] implements lazy lists used in point.ml to retrieve individual trees from the forest. Utilities provides data manipulation utilities. Print and Output contain the string-manipulation and output utilities used in printing different outputs to the toplevel, to the standard out, and to file.

Item contains the item record type, which uses the string cat type for category labels, the span type for mapping to the input string, and the integer point type for the backpointing mechanism. Concat implements the computation of string yield functions, which perform concatenations and range restrictions, and also contains several boolean check functions which ensure that derived items are interpretable.

Grammar contains the types related to the mcfg grammar data structure. Mcfgread and Mcfglex are a lexer/parser combination used for reading in this grammar object from an mcfg file, which is output by the guillamin compiler.

Rules implements the inference rule schemeta which are used by the parsing engine, while Mcfgcky contains the parsing engine. Point implements the packed shared forest. Decompile handles reverse compilation from MCFG to MG if desired. Dictlex and Dictread provide a lexer/parser used to read in a dictionary file provided by the Guillamin compiler. This dictionary file consists of tuples of MCFG categories and their MG interpretations, and is used by Mcfgcky to set the success criterion and to provide a translation of the MCFG parse into MG.

Fileread/Filelex,Corpread/Corplex, and Pcfgread/Pcfglex all read in unweighted corpora, weighted corpora, and pcfg for statical mode parsing. Fileread/filelex is also used via batch processing in parsing mode. Pcfg implements basic structures for probabilistic parsing at training time and testtime. Train trains up a model from a corpus using Weighted Relative Frequency Estimation over a treebank generated by the parser over the corpus from the grammar. Pcfglex/Pcfgread read in an arbitrary PCFG directly. Test implements the testtime functionality of computing intersection grammars and associated data which are prerequisite for entropy computations. Dcfggraph uses ocamlgraph to topologically sort the grammar into buckets. Inside computes the inside probability of catgeories and calculates renormalized probabilities of the intersection grammars. Entropy uses Grenander’s Theorem and camlgsl to invert a fertility matrix and obtain entropies over probabilistic grammars.

Print is responsible for printing to standard out and providing string translations of data structures, while Output assembles reports that are output to a desired file. Dcfgdisplay uses dcfggraph to output product-sum graphs if desired.

4.1 Utilities, Basic Data Structures, and Input Modules

4.1.1 Module `Utilities`

Basic utilities for mcfgcky.

4.1.2 Module `MyQueue`

4.1.3 Module `Seq`

4.1.4 Module `Mcfgread`

Read in a Multiple Context Free Grammar in Albro [] notation.

4.1.5 Module `Fileread`

Read in an unweighed corpus in basic list format.

4.1.6 Module `Corpread`

Read in an weighed corpus.

4.1.7 Module `Pcfgead`

Read in an arbitrary model in probabilistic context free grammar format.

4.2 Parsing Modules

4.2.1 Module `Item`

An item in our chart is a tuple of a category and its range-restricted string yield. The string yield is tuple-valued, where each component could either correspond to an actual string, an empty string, or Kleene star.

type cat = string

Categories are simple strings.

type index = int

We represent an item’s position in the string using an index of type int.

type span =
  | Var
  | Emp
  | Wild of index
  | Fragment of index * index
  | Hyp of index
  | Fail

A span is a component of the tuple-valued range-restriction of the string yield. Each span represents an automaton transition, so that the total string yield represents the sequence of automaton transitions that are necessary to derive the decorated item. Vars and Emps represent epsilon transitions in our system; they are distinguished in MCFG parsing because they have unique status in the MG items that gave rise to them. Vars represent the empty string yields of functional head categories in the source MG, whereas Emptys represent the empty string yields of ad-hoc categories introduced solely by the Guillaumin [2004] compiler.
Wilds represent Kleene closures in string yields, and serve to reflect that a given category may be present in the right context of a prefix. A Fragment (x,y) represents an automaton transition from index x to index y where y > x. We represent a range-restriction as a Fragment when that span’s derivation maps on to at least one instance of a real string. Thus, our string concatenation rules define an algebra (a semiring) over tuples of strings. Loosely, we realize (+) as string concatenation, and (*) as a function over tuples of strings yielding tuples of strings. The concatenation of a Fragment (x,y) and a Wild (y) is Fragment (x,y), where Wild(y) is a zero-item for concatenation. A (linear, non-erasing) string yield function for a unary rules which takes the 0th member of a child’s string yield is the null-item for (*).

type item = {
  cat : cat ;
  pos : span list ;
}

An item is a category situated over (range-restricted by) an n-tuple of spans in the string. This homogenous n-tuple of spans is implemented as a list of spans. An item is implemented as a record with category (cat) and string yield (pos) fields.

4.2.2 Module `Grammar`

Grammar.ml implements the data types and functions used in Multiple Context Free Grammar. Whereas a Context Free Grammar production conflates a symbol-forming operation and a concatenation operation, MCFG divorces abstract and concrete syntax by allowing arbitrary string yield functions with more powerful operations than concatenation. The domain and range of (Linear) MCFG string yield functions is defined over yields of type tuple of strings; this is guranteed by the linearity of (L)MCFG without which, MCFG would be Turing-equivalent.

type daughterID = int

type member_of_tupleID = int

type componentSpec = daughterID * member_of_tupleID

type rewriter =
  | Func of componentSpec list list
  | Terminal

We implement our string yield functions according to the Albro notation. For a rule A → B C [(0,0);(1,0)][(0,1)], we evaluate the string yield of A as a tuple with two components (where each component is indicated in brackets), where the first component is the concatenation of respective 0th members of string yields of B and C, and the second component is the 1th component of the 0th category, B. When A is a preterminal, we indicate as such with the category Terminal. daughter_ID indicates which category is referred to; member_of_tupleID indicates which component of some category’s string yield is referred to; such that componentSpec can uniquely indicate a span in B or C. Rewriter implements the tuple-valued rewrite function as a list of componentSpec.

type mcfgrule = Item.cat list * rewriter

An mcfgrule has abstract syntax (of type ’cat list’, where we always take the 0th member of the list to refer to the parent, and the other members of the list to refer to the children) and concrete syntax (of type rewriter).

type pmcfg = mcfgrule list

A (Parallel) Multiple Context Free Grammar is a list of mcfgrules.

val tupler : ’a * rewriter -> rewriter

Given a mcfgrule, obtain the string yield function.

val cats_of_rule : ’a * ’b -> ’a

Given a mcfgrule, obtain the abstract syntax as a list of categories.

val lhs : ’a list * ’b -> ’a

Given a mcfgrule, obtain the parent category (left hand side) of the rule.

val rhs : ’a list * ’b -> ’a list

Given a mcfgrule, obtain the children categories (right hand side) of the rule.

type opt_mcfg = {
  parent_ord : (Item.cat, mcfgrule) Hashtbl.t ;
  lchild_ord : (Item.cat, mcfgrule) Hashtbl.t ;
  rchild_ord : (Item.cat, mcfgrule) Hashtbl.t ;
  onchild_ord : (Item.cat, mcfgrule) Hashtbl.t ;
  ispreterm_ord : (bool, mcfgrule) Hashtbl.t ;
}

Implement an optimized MCFG object where any quantity of interest asked for by rules.ml is indexed at training time and retirevable in constant time at testing time. Another optimization would be to index by linearization arity as well; this would be relatively easy to implement, but may not be of great use: our MCFG categories generally have constant lineaarization arity.

val orderbyparent :
  (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit

Populate a hash table where each rule is indexed by parent.

val orderbylchild :
  (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit

Populate a hash table where each rule is binary and indexed by the left child.

val orderbyrchild :
  (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit

Populate a hash table where each rule is binary and indexed by the right child.

val orderbyonchild :
  (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit

Populate a hash table where each rule is unary and indexed by the only child.

val orderbypreterm :
  (string list * rewriter) list ->
  (bool, string list * rewriter) Hashtbl.t -> unit

Populate a hash table where each rule is a preterminal production, indexed by the preterminal.

exception WeirdRule of mcfgrule

Exception for abberant mcfg rules.

class grammar : pmcfg ->

object

method get_rules_for : Item.cat -> Grammar.mcfgrule list

Get any rules where category is a parent.

method get_rules_left_child : Item.cat -> Grammar.mcfgrule list

Get any binary rules where category is a left child.

method get_rules_right_child : Item.cat -> Grammar.mcfgrule list

Get any binary rules where category is a right child.

method get_rules_only_child : Item.cat -> Grammar.mcfgrule list

Get any unary rules where category is a child.

method get_all_empty_preterm : Item.cat list

Get any preterminals whose string yield is empty.

method get_all_nonempty_preterm : Item.cat list

Get any preterminals whose string yield is non-empty.

method get_all_empty : Item.cat list

Get any preterminals whose string yield is empty (deprecated).

method get_all_head : Item.cat list

Get any preterminals whose string yield is non-empty (deprecated).

method get_all_cats : Item.cat list

Get all categories in the grammar.

method get_start_symbol : string list

- end

Get the start symbol of the grammar.

val opt_gram : Grammar.opt_mcfg

The optimized grammar object.

val unopt_gram : Grammar.pmcfg

Return the list-formatted unoptimized grammar.

method get_all_rules : Grammar.pmcfg

Get all the rules in the grammar.

4.2.3 Module `Point`

Implements the backpointer system and subsumption check, which we use to construct a packed parse forest.

type point =
  | Term
  | Unary of Item.item * Grammar.componentSpec list list
  | Binary of Item.item * Item.item * Grammar.componentSpec list list

Like in CFG, a pointer gives us the abstract syntax used to enter some category, by referring to either one or two parents or a terminal. Unlike CFG, it should also indicate the concrete syntax used to form it, by including the string yield function which was used.

type sitrule = Item.item * point

Situated rules are non-terminals decorated or range restricted with their string yields. This could be cleaned up by having one parameterized type, which is either Situated of Item.item * point, or Nonsituated of Item.cat * catlist.

type sitcfg = sitrule list

type dcfgrhs =
  | Bin of Item.cat * Item.cat
  | Un of Item.cat
  | T

A Dcfg gives us unsituated rules back from the situated parse forest, for when we need parse trees to train rules with unsituated categories. This works effectively, but is conceptually clunky.

type dcfgrule = Item.cat * dcfgrhs

type dcfg = dcfgrule list

exception StrangeChart

Exception for when our chart is strange.

val crossproduct : ’a list -> ’b list -> (’a * ’b) list

Cross product of two lists.

val string_of_span : Item.span -> string

val sitcfgOfChart : (Item.item, point) Hashtbl.t -> Item.item -> sitcfg

Obtain the dcfg from a chart given the pointer hash table and the current parent/key to work on.

val wlength : string -> int

The length of the sentence that the situated start symbol needs to span over.

val sit_start_symbol : Item.index -> Item.item

Given wlength, return the situated start symbol.

val gr_sort : (Item.item * ’a) list -> (Item.item * ’a) list

Guarantee that items are sorted longest to shortest, with start symbol first. Not strictly necessary, but should speed processing.

val sortedcfgOfChart :
  (Item.item, point) Hashtbl.t -> Item.item -> sitrule list

Sorted pcfg from chart according to the above.

val desitCFG : (Item.item * point) list -> (Item.cat * dcfgrhs) list

For a situated context free grammar, return the desituated dcfg version

val catlist_of_dcfg : (’a * dcfgrhs) list -> ’a list

val catarray_of_dcfg : (’a * dcfgrhs) list -> ’a array

Return the list of categories in a dcfg. Used to index the momentum matrix.

val abstract : Item.item -> Item.cat

Desituate a situated category.

val abstract_rule : point -> dcfgrhs

Desituate a situated rule.

val range : int -> int -> int list

A range operator, which I’m sure is implemented elsewhere

val success_items : string -> Item.item list

Given a string, return a list of successful situated start symbols according to the approp string length

class forest :

object

val point_hash : (Item.item, Point.point) Hashtbl.t Pervasives.ref

A hashtable of pointers which provides the main functionality of the packed parse forest. In trunk, this is a standalone hash table, not bundled into the forest object.

val subsumptionok : ’a -> (’a, ’b) Hashtbl.t -> bool

Boolean check for whether parent item is already in chart; if it is, include the new pointer but don’t add the item again.

val (@@) : ’a Seq.seq -> ’a Seq.seq -> ’a Seq.seq

val (@::) : ’a -> ’a Seq.seq -> ’a Seq.seq

val lcrossproduct : ’a Seq.seq -> ’b Seq.seq -> (’a * ’b) Seq.seq

val treesOfChart :
  (Item.item, point) Hashtbl.t ->
  Item.item -> Item.item Utilities.stufftree list

Operators for lazily extracting a sample of parse trees out of the forest. Could be improved.

4.2.3.1 Experimental Kilbury Parsing

val cathash : (Item.cat, Item.item) Hashtbl.t Pervasives.ref

val chart : Item.item list Pervasives.ref

val grams_of_chart : (Item.item, Point.sitcfg) Hashtbl.t

method cats : (Item.cat, Item.item) Hashtbl.t

method pointers : (Item.item, Point.point) Hashtbl.t

method get_sit_cats : Item.cat -> Item.item list

method subsume_ok : Item.item -> bool

method add_pointer : Item.item -> Point.point -> unit

method add_cat : Item.item -> unit

method add_to_chart : Item.item -> Item.item list

method get_chart : Item.item list

method incr_chart : unit

method incr_pointers : unit

method incr_cats : unit

method hash_grams : string -> unit

method sitgram : Item.item -> Point.sitcfg

- end

4.2.4 Module `Concat`

Concat.ml implements concrete syntax via string yield functions for MCFG.

type anonitem = Item.span list list

An anonitem is an intermediate representation of the string yield function.

4.2.4.1 Integrity Checks

val reasonable_ok : Item.span -> bool

Check that indices on a given range restriction component (span) are sensible. Frag(x,y) requires 0 ≤ x ≤ y. Wild(x) requires 0 ≤ x.

val concatenable_ok : Item.span list -> bool

Check that a given instance of concatenation is sensible. A concatnenation [(x,y);(x₁,y₁...] requires that y = x₁, etc.

val overlap : Item.span -> Item.span -> bool

val overlap_ok : Item.span list -> bool

Overlap_ok applies overlap to deliver T iff candidate range restrictions don’t have overlapping spans, F o/w. A range restricted string yield (Frag(x,y),Frag(x₁,y₁)) requires that ¬ ∃ z . (x ≤ z ≤ y) ∧ (x₁ ≤ z ≤ y₁)

4.2.4.2 Sanity Concatenation Checks

val wholerule_ok : ’a list list -> anonitem -> bool

T iff ensure that entire mcfg rule is used in creating item, and rule was used non-erasingly, F o/w.

val concat_fail : Item.span list -> bool

Checks to see if item contains the Fail itemlet, created by bad, incomplete concatenation.

exception ConcatFail of string

Exception thrown if concatenation fails, to be caught safely by rules.ml which will not populate the chart in this instance.

4.2.4.3 Concatenation Functions

val concat : Item.span list -> Item.span

; is the concatenation function, defined on range restrictions. Concatenation is not a total function, as things ot be concatenated need to be adjacent (for [Frag(x,y);Frag(x₁,y₁], concatenation is valued only when y = x₁). Given a span list, concatenates into a single span. Nonsense concatenations should return the Fail span type, propogating information up to the sanity check
Emp and Var are ’free’ in the string and place no restrictions on where in the string they can be concatenated in; thus they are cached out at concatenation.

exception GrabNth of string

val grab : Grammar.componentSpec -> anonitem -> Item.span

Given an anonitem, returns the nth member of the nth tuple. Throws GrabNth is member is out of range of the arity of the tuple.

val mapconcat : Grammar.componentSpec list -> anonitem -> Item.span

Map concatenation across all the gets in a single spec list– eg (0,1);(1,0). Each instance of mapconcat builds one component of the parent’s string yield tuple.

val f : Grammar.componentSpec list list -> anonitem -> Item.span list option

The yield function maps mapconcat across the anonitem with the whole rewrite–eg (0,1);(1,0)0,0.

4.2.5 Module `Rules`

Rules.ml defines the inference rules we use to populate our chart, faciliating Shieber et al. [1995]-style parsing as deduction. This modularity would allow relatively easy adaption of all the statistical stuff to say, Earley parsing. It also handles the actual population of the chart.

val unfunc : Grammar.rewriter -> Grammar.componentSpec list list

Unbox the rewriter type. Doing this sparingly instead of at runtime every time a string yield function is called actually yields a signifcant performance gain.

4.2.5.1 Chart Population Functions

val populate2 :
  Item.cat list * Grammar.rewriter ->
  Item.item ->
  Item.item ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  ’b list

Populate2 factors out population, item checking, and string yield evaluation from the left and right inference rules. It applies the ’concrete syntax’ by calling concat.ml with the specified yield function. It checks as well that the string yield function is sensible before sending it off to concat.ml.

val populate :
  Item.cat list * Grammar.rewriter ->
  Item.item ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  ’b list

Populate2 factors out population, item checking,a nd string yield evaluation from the unary inference rules. It applies the ’concrete syntax’ by calling concat.ml with the specified yield function. It checks as well that the string yield function is sensible before sending it off to concat.ml.

val is_nonzero_prob :
  Item.cat list * ’a -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> bool

Boolean check to ensure that items at training time have nonzero probability. This is well-tested, and does not effect accuracy at the expense of performance so long as inside renormalization is used.

val non_erasing_bin : ’a list -> ’b list -> Grammar.rewriter -> bool

A cheap filter to reduce calls to concat.ml, which implements a cheap approximate and conservative (false negatives but not false positives) check that the rule is non-erasing. Checks that #comps in x + #comps in y = linearzation arity of rule r.

val non_erasing_un : ’a list -> Grammar.rewriter -> bool

A cheap filter to reduce calls to concat.ml, which implements a cheap approximate and conservative (false negatives but not false positives) check that the rule is non-erasing. Checks that #comps in x + #comps in y = linearzation arity of rule r.

4.2.5.2 Inference Rules

val left :
  < get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    get_sit_cats : Item.cat -> Item.item list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’c -> ’b list

Given the left member of a rhs binary rule, map over all pertinent rules and all pertinent items with the correct right member.

val right :
  < get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    get_sit_cats : Item.cat -> Item.item list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’c -> ’b list

Given the right member of a rhs binary rule, map over all pertinent rules and all pertinent items with the correct left member

val unaryf :
  < get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’c -> ’b list

Given the only child of a rule, map over all pertinent unary rules with this member.

4.2.5.3 Ground Axia/Initial Chart Population

val empties :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list; .. > ->
  ’a -> < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list

Populate all empties.

val can_hypo : string -> bool

Populate the chart based on any ellipsis cues (so far these are indicated with #).

val items_of_just_string :
  < get_all_nonempty_preterm : Item.cat list;
    get_rules_only_child : string -> (Item.cat list * ’a) list; .. > ->
  string ->
  < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list

Populate all nonterminals for the string.

4.2.5.4 Kilbury Style Inference Rules

val items_of_token :
  < get_all_nonempty_preterm : Item.cat list;
    get_rules_only_child : string -> (Item.cat list * ’a) list; .. > ->
  string ->
  < add_pointer : Item.item -> Point.point -> ’b; .. > ->
  Item.index -> Item.item list

Inference rules for kilbury. Experimental and not accurate.

val populate2_up :
  Item.cat list * Grammar.rewriter ->
  Item.item ->
  Item.item ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  ’c -> ’b list

Binary chart population for kilbury. Experimental and not accurate.

val populate_up :
  Item.cat list * Grammar.rewriter ->
  Item.item ->
  < add_pointer : Item.item -> Point.point -> ’a;
    add_to_chart : Item.item -> ’b list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  ’c -> ’b list

Unary chart population for kilbury. Experimental and not accurate.

val left_up :
  < get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  ’a ->
  < add_pointer : Item.item -> Point.point -> ’b;
    add_to_chart : Item.item -> ’c list;
    get_sit_cats : Item.cat -> Item.item list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’d -> ’c list

Binary inference rule for kilbury. Experimental and not accurate.

val right_up :
  < get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  ’a ->
  < add_pointer : Item.item -> Point.point -> ’b;
    add_to_chart : Item.item -> ’c list;
    get_sit_cats : Item.cat -> Item.item list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’d -> ’c list

Binary inference rule for kilbury. Experimental and not accurate.

val unaryf_up :
  < get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  ’a ->
  < add_pointer : Item.item -> Point.point -> ’b;
    add_to_chart : Item.item -> ’c list;
    pointers : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’d -> ’c list

Unary inference rule for Kilbury. Experimental and not accurate.

val delid : Item.item * Item.item -> bool

Experimental Deletion under Identity Check for Conjunction Parsing. Untested.

val noncyclic : Item.span list * Item.span list -> bool

Experimental Cyclicity Check for Conjunction Parsing. Untested.

val is_hyp : Item.item -> bool

Experimental Hypothesis Check for Conjunction Parsing. Untested.

val rel_hyp : (’a, Item.item) Hashtbl.t -> ’a -> Item.item list

val left_discharge :
  ’a ->
  ’b ->
  < add_pointer : Item.item -> Point.point -> ’c;
    cats : (Item.cat, Item.item) Hashtbl.t;
    points : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’d -> Item.item list

Experimental Disharge Rule for Hypothesis/Conjunction Parsing. Untested.

val right_discharge :
  ’a ->
  ’b ->
  < add_pointer : Item.item -> Point.point -> ’c;
    cats : (Item.cat, Item.item) Hashtbl.t;
    points : (Item.item, Point.point) Hashtbl.t;
    subsume_ok : Item.item -> bool; .. > ->
  Item.item -> ’d -> Item.item list

Experimental Disharge Rule for Hypothesis/Conjunction Parsing. Untested.

val items_of_string :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_only_child : string -> (Item.cat list * ’a) list; .. > ->
  string ->
  < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list

4.2.6 Module `Mcfgcky`

The main parsing module of mcfgcky.

val distribute :
  (Item.item -> ’a -> ’b list) list ->
  Item.item -> < get_chart : ’a; .. > -> ’b list

Distributes a trigger throughout the chart.

val distribute_debug :
  (’a list -> ’b) ->
  (Item.item -> ’c -> ’a list) list ->
  Item.item -> < get_chart : ’c; .. > -> ’a list

Distributes a trigger throughout the chart with debugging output.

val exhaust :
  ’a ->
  < add_cat : Item.item -> ’b; get_chart : ’c; subsume_ok : Item.item -> bool;
    .. > ->
  Item.item Myqueue.BatchedQueue.queue ->
  (Item.item -> ’c -> Item.item list) list -> ’c

Exhaustion mechanism, proceeds until the chart is empty, which to the extent the algorithm is valid and complete will gives us the closure of the ground items under the inference rules.

val exhaust_debug :
  ’a ->
  < add_cat : Item.item -> ’b; get_chart : ’c; subsume_ok : Item.item -> bool;
    .. > ->
  (Item.item list -> ’d) ->
  Item.item Myqueue.BatchedQueue.queue ->
  (Item.item -> ’c -> Item.item list) list -> ’c

Exhaustion mechanism with debugging output.

val parse_debug :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  string ->
  (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c;
     add_to_chart : Item.item -> Item.item list; get_chart : ’d;
     get_sit_cats : Item.cat -> Item.item list;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’a) ->
  ’a

Given a grammar g, sentence string w, and point side-table, run the parsing engine without backpointer retrieval and with debugging statements.

val parse :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_start_symbol : ’a; .. > ->
  string ->
  (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d;
     add_to_chart : Item.item -> Item.item list; get_chart : ’e;
     get_sit_cats : Item.cat -> Item.item list;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’b) ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b

Given a grammar g, sentence string w, and point side-table, parse with the appropriate decompilation of backpointers. Not used in statistical modes.

val parse_gen :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  string ->
  (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c;
     add_to_chart : Item.item -> Item.item list; get_chart : ’d;
     get_sit_cats : Item.cat -> Item.item list;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’a) ->
  ’e -> ’f -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’a

Given a grammar g, sentence string w, point side-table and a training pcfg, parse without appropriate decompilation of backpointers for use in statistical modes.

4.2.6.1 Experimental Kilbury Stuff

val all_points : (’a, ’b) Hashtbl.t -> (’a * ’b) list

val parse_up :
  < get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    .. > ->
  Item.cat ->
  (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c;
     add_to_chart : Item.item -> Item.item list; get_chart : ’d;
     get_sit_cats : Item.cat -> Item.item list;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’a) ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> Item.index -> ’a

Experimental Kilbury stuff, doesn’t work yet.

val parse_kilbury_gen :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_start_symbol : ’a; .. > ->
  string ->
  (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d;
     add_to_chart : Item.item -> Item.item list; get_chart : ’e;
     get_sit_cats : Item.cat -> Item.item list; hash_grams : string -> ’f;
     incr_cats : ’g; incr_chart : ’h; incr_pointers : ’i;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’b) ->
  ’j -> ’k -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b

Experimental Kilbury stuff, doesn’t work yet.

val parse_kilbury :
  < get_all_empty : Item.cat list; get_all_head : Item.cat list;
    get_all_nonempty_preterm : Item.cat list;
    get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list;
    get_start_symbol : ’a; .. > ->
  string ->
  (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d;
     add_to_chart : Item.item -> Item.item list; get_chart : ’e;
     get_sit_cats : Item.cat -> Item.item list; incr_cats : ’f;
     incr_chart : ’g; incr_pointers : ’h;
     pointers : (Item.item, Point.point) Hashtbl.t;
     subsume_ok : Item.item -> bool; .. >
   as ’b) ->
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b

Experimental killbury stuff, doesn’t work yet.

4.3 Statistical Parsing

4.3.1 Module `Pcfg`

Pcfg.ml implements basic types and functions for probabilistic parsing.

type fertility_matrix = Item.item array * float array array

A fertility matrix is an array_matrix of floats indexed by situated items, where the member of the matrix indexed by (x,y) indicates the expected number of y obtained in one rewriting of x, for a probabilistic grammar.

exception RowZero

Throw RowZero when a row of the fertility matrix is entirely zeroes.

type bucketed_key = {
  category : Item.item ;
  index : int ;
}

For inside probability functions, we have to divide the pcfg into ’buckets’, where each ’bucket’ is a strongly connected component of the pcfg’s corresponding graph structure. A bucketed key tells us which bucket of the pcfg the item of interest is located in.

type iteration = int

For inside probability functions, we want to know which ’iteration’ of the inside algorithm a value was evaluated at.

type bucketed_inside_hash = (bucketed_key, iteration * float) Hashtbl.t

Implements a memomoized data structure that permits us to record iterations of the loopy inside algorithm, and look them up in constant time when we need values from a particular iteration or bucket.

type pcfgrule = {
  rhs : Point.dcfgrhs ;
  mutable count : float ;
  mutable prob : float option ;
}

A rule type for training time. Works efficiently but is conceptually clunky, and could be combined with the test time probabilist ruletype Pcfg.renormed_probcfgrule, as a parameterized type.

type pcfg = (Item.cat, pcfgrule) Hashtbl.t

A PCFG is a hashtable of pcfgrules indexed by parent categories.

type renorm_cfgrule = {
  rrhs : Point.point ;
  mutable prior : float option ;
  mutable postr : float option ;
}

A rule type for testing time. Works efficiently but is conceptually clunky, and could be combined with the test time probabilist ruletype Pcfg.probcfgrule, as a parameterized type.

type renorm_cfg = (Item.item, renorm_cfgrule) Hashtbl.t

A renorm_cfg (sic) is a hashtable of renorm_cfgrules pcfgrules indexed by parent items, which are situated.

type t_bank = {
  test_sitcfg : Point.sitcfg ;
  mutable test_pcfg : renorm_cfg option ;
  mutable inside : bucketed_inside_hash option ;
  mutable fmatrix : fertility_matrix option ;
  mutable invert : fertility_matrix option ;
  mutable entropy : (Item.item, float) Hashtbl.t option ;
  mutable hvector : (Item.item, float) Hashtbl.t option ;
}

For a particular situated forest, keep around all of the pertinent data structures in a record.

type testbank = ((string * Item.item) * t_bank) list

List all the intersection probabilistic grammars and t_banks by prefix per sentence.

val catarray_of_pcfg : (’a, ’b) Hashtbl.t -> ’a array

Retrieve the array of categories in a probabilistic context free grammar.

val fmatrix_of_training_pcfg :
  (Item.cat, pcfgrule) Hashtbl.t -> Item.cat array * float array array

Obtain the fertility matrix of the training grammar, for eigenvalue checks and computing entropy of the training pcfg.

val fmatrix_of_pcfg :
  (Item.item, renorm_cfgrule) Hashtbl.t ->
  Item.item array * float array array

Obtain the fertility matrix of an intersection pcfg.

4.3.2 Module `Train`

Train up the corpus-minitreebank for -i mode. Also handles the reading in of pcfg, most of which is done in pcfglex.mll/pcfgread.mly

exception Logzero

Exception if log is taken of 0.0.

exception NegLog of float

Exception if log is taken of a negative number.

type s_bank = {
  weight : float ;
  training_sitcfg : Point.sitcfg ;
  mutable training_cfg : Point.dcfg option ;
}

Basic record type that bundles information we need for training on a particular sentence in a treebank.

type revprob_bankrule = {
  rcount : float ;
  rprob : float ;
}

type revprob_bankcfg = (Item.cat * Point.dcfgrhs, revprob_bankrule) Hashtbl.t option

Optimized PCFG type where a rule probability can be indexed by children for lookup in constant time. Used by inside.ml. Could be hidden behind a fancier pcfg object, but doesn’t create much clunkiness in practice (inside.ml only sees this representation of the pcfg).

type train = {
  sbank : (string, s_bank) Hashtbl.t ;
  mutable by_parent : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ;
  mutable by_rule : (Item.cat * Point.dcfgrhs, revprob_bankrule) Hashtbl.t option ;
}

Basic record type of a pcfg trained off of a single sentence. A PCFG for an entire treebank can be extracted from all of these sentences easily.

val read_pcfg : string -> train -> unit

Directly read in a training pcfg.

val parse_train :
  string ->
  (’a ->
   string ->
   (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’b) ->
   ’c -> (’d, ’e) Hashtbl.t -> ’f option -> ’g) ->
  ’a -> ’b -> ’c -> ’h -> train -> string -> unit

Train up the corpus-minitreebank for -i mode. Requires an initialized parse forest (empty chart), an empty grammar object, and checking functions provided by decompile.ml to train on a weighted corpus. Uses simple weighted relative frequency estimation (Chi 1999).

val countpcfg_init :
  (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
  Point.dcfg -> float -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option

Increments counts across the treebank on a treebank-level pcfg object from the sentence-level information.

val totalcatn : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> float

Sum across all trees in treebank to obtain a count for the number of rules with parent ’a, for the denominator in weighted relative frequency estimation.

val relcount : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> Point.dcfgrhs -> float

For a given rhs (point.dcfgrhs) and a given parent (’a), return the number of counts of that rule across the treebank.

val relfreq : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> Point.dcfgrhs -> float

Find the relative frequency of a rule in the grammar, according to the number of times the parent A (’a) → B (Point.dcfgrhs) found divided by the total count of rhs occurences of A.

val relprobpcfg_init : (’a, Pcfg.pcfgrule) Hashtbl.t -> unit

Populate the pcfg’s hash table object with one relative-frequency estimated rule.

val pcfg_of_train_init : ’a -> train -> unit

Initialize the pcfg object by finding all relative frequencies of all rules. Desituate dcfgs, compile and prob treebank rules in one step.

val trainf :
  bool ->
  string ->
  ’a ->
  ’b ->
  (’c ->
   string ->
   (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’d) ->
   ’e -> (’f, ’g) Hashtbl.t -> ’h option -> ’i) ->
  ’c -> ’d -> ’e -> ’j -> ’k -> string -> train -> unit

A higher order training function parameterized by arguments, to pass in to treebank class. The parse can be agnostic as to which way it was trained.

class treebank : (train -> ’a) -> ’b ->

object

val optbank : Train.train

Our main unoptimized treebank structure which is hidden behind the optimized treebank object.

method prob_by_rule : Item.cat * Point.dcfgrhs -> float

method has_par : Item.cat -> bool

Treebank object method which defines a relation which is T if parent is in the grammar, F o/w.

method has_rule : Item.cat * Point.dcfgrhs -> bool

method pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option

- end

An optimized pcfg object.

type parsorted_drule = {
  pindex : int ;
  size : int ;
  pdefcat : Item.item ;
  ppar : Item.item ;
  prhs : Point.point ;
}

This is for indexing rules by their right hand side to be used by inside.ml, enabling these rules to be sorted into buckets.

type parsorted_dcfg = (Item.item, parsorted_drule) Hashtbl.t

The actual representation of the pcfg which inside.ml sees. This is necessitated partially by the function in ocamlgraph which does our topological sorting (scc_array), and partially the kind of iteration desired through the sort output, where we need an index and index item for each list.

val index_dhash_init :
  (Dcfggraph.D.vertex * Point.point) list ->
  (Dcfggraph.D.vertex, parsorted_drule) Hashtbl.t -> unit

Initialize the optimized pcfg representation which inside.ml sees.

4.3.3 Module `Inside`

Using the treebank grammar and the testtime interesection grammar, computes inside probabilities for each category in the intersection grammar to renormalize that grammar to find the weighted intersection. Compute inside probabilities of all categories in the intersection grammar, and then find the renormalized, situated probability of a rule by:

Because of recursion, a category can contribute inside probability to itself, so must find the limit of the inside probability for each non-terminal. Goodman [1999] shows us this is doable–sort the grammar into strongly connected components, the inside prob of a category C in a bucket B is the suprememum of its value in the inside semiring over B. Then iterate through buckets in topological order, finding the limit of all values in the bucket before moving on.

STRATEGY:

Assuming you call with s each time, and you iterate progressively up., a memoized value is accessible if all of the following hold:

Assumption 1. Previous buckets are static. The grammar’s graph structure can be sorted into its strongly connected components. Assumption 2: You can iterate the inside computation within a given bucket to reach the supremum, then proceed on to the next bucket.

Then we proceed on a memoized, loopy version of the inside algorithm, in which we work in topological order of the buckets. In general, we memoize the inside algorithm by working top down then bottom up. For a given category A in training pcfg R and with productions A → B ₁ C₁, A → B₂ C₂ ..., β(A) depends on probabilities in R and β(B₁), β(C₁) .... If we have no appropriate inside probability value for A in cache, we recur into children and determine the inside probability of children.

For the bucketed, iterative version of this algorithm that can treat looping buckets, we only need to check the value for the most recent iteration of some category, and then determine if that value is appropriate. If a category is its own child, the correct iteration to retrieve for calculating A_i,j is A_i,j−1. If a category only depends on values outside its bucket, it necessarily depends on previous bucket, since we have topologically sorted the buckets and we are working through the buckets bottom-up. Thus, we can compute its inside probability ’statically’, because the inside probabilities of those children aere guaranteed not to change, by Assumptions 1 and Assumptions 2.

If the inside probability of a category A depends on the inside probability of a child B in the same looping bucket i as it, then we iterate through the bucket because the value of β(A) depends on β(B) and vice versa. We proceed by first computing a static value for B which depends only on categories outside i, then iterate through i so that β(A_i,j) depends on β(B_i,j), and β(B_i,j+1) depends on β(A_i,j). This can happen when B is what we term the ’entry item’ for i; if our grammar is topologically sortable, then each bucket has such an entry item. This is not guaranteed in general (it is what Goodman [1999] terms linear solvability), but in practice our grammars are sortable this way. We then only need to have two kinds of dynamic inside probability computation; one for non-entry items, which require either values from the same bucket and same iteration, or static values from outside the bucket, and one for entry items, which require values from the same bucket and the previous iteration, or static values from outside the bucket. It is then only necessary to ever keep at most two iterations of a bucket around.

We can factor out the control flow of obtaining an appropriate inside value from the inside algorithm itself, and that is exactly what we do. The current design does not use a stopping criterion other than a set number of iterations, as a more sophisticated stopping criterion caused problems, and was eliminated.

exception Logzero

Exception for logarithms of zero.

exception NegLog of float

Exception for logarithms of negative number. Both of these exceptions are redundant on training.ml.

val maxtup : (int * float) list -> int * float

Used for finding the maximum iteration number in a list of buckets.

exception FirstRecurrence

Exception for when the value we need is the entry item.

val abstract_cat : Item.item -> Item.cat

Extract a category from an item, probably redundant on operations in dcfg.ml.

val all_abstract : Item.item * Point.point -> Item.cat * Point.dcfgrhs

Extract a rule from a situated rule, probably redundant on operations in dcfg.ml.

exception SubNormal

Exception which should raise if our values are every subnormal (if they ever lose precision). They never do. Take that log probs.

exception ImpossProb of string

Exception if our value is ever greater than 1.0, which because there is so much recursion, could result in a lot of chaos. This was caused by inprecise control flow in a previous version, but now never occurs.

exception NonsenseRetrieval

Value that you wanted from hash doesn’t exists. Suggests incorrect control flow.

val inside_complexiter :
  Item.item ->
  (Pcfg.bucketed_key, int * float) Hashtbl.t ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  (Item.item, Train.parsorted_drule) Hashtbl.t -> int -> int -> unit

Iterates the loopy inside algorithm over a particular bucket, one iteration. Control flow which proceeds in topological order over the buckets is enforced elsewhere, as is the stoping criterion.

val print_inside : (Pcfg.bucketed_key, int * float) Hashtbl.t -> unit

Print some inside values.

val list_of_terms : (’a, Train.parsorted_drule) Hashtbl.t -> (’a * int) list

Obtain a list of terminals from the hashtable. Probably redundant.

val inside_complex_bucket :
  Item.item ->
  (Pcfg.bucketed_key, int * float) Hashtbl.t ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  (Item.item, Train.parsorted_drule) Hashtbl.t -> int -> int -> unit

Iterate as desired over a particular bucket.

val maxbucket : (’a, Train.parsorted_drule) Hashtbl.t -> int

Functions to get info about bucket structure

val bucket_indices : (’a, Train.parsorted_drule) Hashtbl.t -> int list

Returns a list of int*bool tuple, where int is the index and true indicates the item recurred in the original. Used to derive a list of bucket indices that actually exist.

val parents_of_buckets :
  (’a, Train.parsorted_drule) Hashtbl.t -> (int * Item.item) list

Return the parents of every bucket, where the parent is a ’start symbol’ for that bucket. The particular identity of the start symbol doesn’t matter so much, as we will immediately recur down to the entry item, which is readily identifiable at run time.

exception SafeNth of int

val safenth : ’a list -> int -> ’a

Exception for retrieving the nth item when n is larger than the list size, probably redundant.

val inside_complex :
  (Pcfg.bucketed_key, int * float) Hashtbl.t ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  ’a -> (Item.item, Train.parsorted_drule) Hashtbl.t -> unit

Main algorithm that calcuates inside probability for all categories of a pcfg. Computes buckets in topological order, and iterates through each bucket n times. N > 40 is generally sufficient for values to converge. Lack of underflow and some testing suggest that iteration values much greater than 40 do not pose an accuracy problem.

val renorm_init :
  (Item.item, Pcfg.renorm_cfgrule) Hashtbl.t ->
  (Pcfg.bucketed_key, ’a * float) Hashtbl.t ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  (Item.item, Train.parsorted_drule) Hashtbl.t -> unit

val inside_renorm :
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  (Dcfggraph.D.vertex * Point.point) list ->
  (Dcfggraph.D.vertex, Pcfg.renorm_cfgrule) Hashtbl.t option *
  (Pcfg.bucketed_key, int * float) Hashtbl.t option

Find the renormalized, situated probability of a rule by::
r′(A−>B) = r(A−>B) * β_B / β_A r′(A−>BC) = r(A−>BC) * β_B * β_C / β_A (17)

where β₍A) indicates the supremum of the inside probability.

4.3.4 Module `Test`

Handles initalizing and renormalizing pcfg values across the testbank. Calculates certain values required for entropy computation, such as vech and the fertility matrix. Also handles the creation of prefix automata from strings we wish to parse.

exception UnsampledRule

Exception for when rule to calculate with was unseen at training time.

val init_priors : Pcfg.t_bank -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t -> unit

Initialize the weighted intersection with values form training time. The unsampled rule case shouldnt happen unless something is wrong–because test dodesn’t parse zero prob items.

val renorm : (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> unit

Higher order function which parametrizes the renormalization of pcfg, so that testing can be agnostic to renormalization. Renormalizes either naviely or according to inside probability.

val inside_renorm_init :
  Pcfg.t_bank ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  unit

Initialize the values that inside.ml depends on.

val keylist_of_pcfg : (’a, ’b) Hashtbl.t -> ’a list

Obtain nonterminal indiced for various things in test.ml, probably redunandant.

val lg : float -> float

Binary logarithm function which treats the lg of 0 as 0.

val retrieve_bucket_probs :
  (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> ’a list -> (’a, float list) Hashtbl.t

val entropy_of_bucket : float list -> float

Compute the one step entropy for a non-terminal for use in H^→.

val hvector_pcfg_init : Pcfg.t_bank -> unit

Computes the one step entropy for each non-terminal for use in H^→.

val expect : ’a -> Item.item -> (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> float

Compute the fertility matrix for the pcfg.

val testbank_init :
  Pcfg.t_bank ->
  < pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option; .. > -> unit

Across the test corpus, calculate all the things needed for entropy computation, according to naive renormalization.

val inside_testbank_init :
  Pcfg.t_bank ->
  < has_rule : Item.cat * Point.dcfgrhs -> bool;
    pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option;
    prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > ->
  unit

Across the test corpus, calculate all the things needed for entropy computation, according to inside renormalization.

val automatize : bool -> string list -> string list

If autbool, then construct automata for all possible prefixes of a sentence. Intelligently handle any stars in the input. Enforces a policy of parsing, in order, Kleene star (the unconditioned grammar), each prefix of the sentence including the prefix spanning the sentence, and then the sentence as a sentence. The just Kleene-star portion could be turned off when entropy of the unconditioned grammar would be hard to compute where the conditioned grammar might be easy, and we don’t care about the information value of the first word.

val parse_testbank :
  (’a -> string -> Point.forest -> ’b -> (’c, ’d) Hashtbl.t -> ’e -> ’f) ->
  ’a ->
  ’g ->
  ’b ->
  ’h ->
  (< pcfg : ’e; .. > as ’i) ->
  (Pcfg.t_bank -> ’i -> ’j) ->
  string -> bool -> string -> ((string * Item.item) * Pcfg.t_bank) list

Higher order function which takes in several other higher order functions, including the renormalization function, either testbank_init or inside_testbank_init, the desired output functions, etc...and coordinates test time computation of values to hand off to entropy.ml.

val range : int -> int -> int list

A Range operator which could go in utilities, and is probably redundant.

val parse_testbank_kilbury :
  (’a ->
   string ->
   ’b ->
   ’c ->
   (’d, ’e) Hashtbl.t ->
   ’f ->
   < pointers : (Item.item, Point.point) Hashtbl.t;
     sitgram : Item.item -> Point.sitcfg; .. >) ->
  ’a ->
  ’b ->
  ’c ->
  ’g ->
  (< pcfg : ’f; .. > as ’h) ->
  (Pcfg.t_bank -> ’h -> unit) ->
  string -> ’i -> string -> ((string * Item.item) * Pcfg.t_bank) list

Experimental incremental parsing on testbanks, doesn’t work yet.

4.3.5 Module `Entropy`

Compute all the entropies.

exception NoOption

val optget : ’a option -> ’a

Unboxing option types for entropy purposes, probably redundant.

exception UnequalLengths of (int * int)

Exception is which is thrown when matrix and h-vector have different lengths (are differently indexed by unique sets of nonterminals).

4.3.5.1 Stolcke matrix cut trick

val zap : ’a array -> int -> ’a array

Delete the nth row in the array.

val zap_col : ’a array array -> int -> ’a array array

Delete the nth column in the matrix.

val ins : ’a array -> int -> ’a -> ’a array

Insert a row of all zeros into the nth position in the array.

val ins_col : ’a array array -> int -> ’a -> ’a array array

Insert a column of all zeros into the nth position in the matrix.

val reduce : ’a array * float array array -> ’a array * float array array

For any n, if both row n and column are all 0.0, eliminate row and column n and return them after inversion. Stolcke tells us this should work.

val blowup :
  ’a array * float array array -> ’a array -> ’a array * float array array

For any reduced invert matrix, put back in all the rows and columns of all zero that we took out of the preinverted matrix. Stolcke tells us that this should work.

exception NonSquare of string

Exception which is thrown when matrix is non-square.

exception ZeroDeterminant

Exception which is thrown when determinant of the matrix is equal to zero, aka the matrix is noninvertible.

exception EmptyMatrix

Exception which is thrown when matrix is entropy (probably because a parse was fail and not identified as such). It usually raises when something wrong happened at training. Before including this, parser would return a strange fragmentation error, which is supposed to be impossible for OCaML, according to Jon Harrop. Turns out it is, it was gsl (which is written in C) getting an empty matrix to invert.

exception TrainingSpectralRadius of string

Exception which is thrown when spectral radius of the matrix is greater than or equal to 1.0. This is bad, but should never happen when we traing the grammar from Treebank, according to Chi. It could potentially happen from arbitrary PCFG. It means that the set of sentences has probability greater than 1.0, ergo we fail to define a probabilistic language.

exception SpectralRadius of float

Exception which is thrown when spectral radius of the matrix is greater than or equal to 1.0. This is bad, but shouldn’t happen so long as our intersection grammar is consistent, which is suggested by Nederhof and Satta. Our intersection grammar should be consistent when both our training grammar is consistent and our automaton is consistent. It could potentially happen from arbitrary PCFG. It means that the set of sentences has probability greater than 1.0, ergo we fail to define a probabilistic language.

val eigenlist : Gsl_matrix.matrix -> float list

Give me the eigenvalues of the matrix. Used for the spectral radius check, potentially very interesting for other things.

val eigencheck : float -> bool

val infnorm : float array array -> float

The infinity norm on matrices. Used to determine a condition number on matrice formed from inf
_norm M * inf
_norm M^-1.

val cond_number : float array array -> float array array -> float

Condition number of the matrix times the input error equals total error of the inverted matrix. The condition number tells us how ’lossy’ matrix inversion is. Condition numbers range from 1 to infinity.

val entropy_of_matrix :
  ’a array * float array array ->
  (’a, float) Hashtbl.t -> (’a, float) Hashtbl.t

Compute the entropy of a fertility matrix by calculating (I−A)⁻¹ h^→. We can be polymorphic, and pass in either original or renormalized matrices/hvectors. Original values are read off of a record, renormalized values off of the matrix-particular record entry.

exception Catlength of int

val entropy_of_testbank :
  string ->
  string ->
  bool ->
  bool ->
  (Pcfg.t_bank -> Train.treebank -> ’a) ->
  (’b ->
   string ->
   Point.forest ->
   ’c -> (’d, ’e) Hashtbl.t -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’f) ->
  ’b ->
  Point.forest ->
  ’c ->
  ’g ->
  (Train.treebank -> ((string * Item.item) * Pcfg.t_bank) list -> string -> ’h) ->
  string -> ((string * Item.item) * Pcfg.t_bank) list

Compute entropies for the entire testbank.

val entropy_of_testbank_kilbury :
  string ->
  string ->
  bool ->
  ’a ->
  (Pcfg.t_bank -> Train.treebank -> unit) ->
  (’b ->
   string ->
   (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’c) ->
   ’d -> (’e, ’f) Hashtbl.t -> ’g option -> ’h) ->
  (’b ->
   string ->
   ’c ->
   ’d ->
   (’i, ’j) Hashtbl.t ->
   (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ->
   < pointers : (Item.item, Point.point) Hashtbl.t;
     sitgram : Item.item -> Point.sitcfg; .. >) ->
  ’b ->
  ’c ->
  ’d ->
  ’k ->
  (Train.treebank -> ((string * Item.item) * Pcfg.t_bank) list -> string -> ’l) ->
  string -> ((string * Item.item) * Pcfg.t_bank) list

Experimental kilbury mode doesnt work yet.

4.4 Printing and Output

4.4.1 Module `Decompile`

Given an mcfg tree labeled with points and a dictionary file from the Guillaumin [2004] compiler, decompiles into an mg tree labeled with a (string,category) tuple. This can then be passed into a print function.

val label_of_tree : ’a Utilities.stufftree -> ’a

Extract a given category label from the tree.

val rewr_of_tree :
  Item.item Utilities.stufftree ->
  (Item.item list, Grammar.componentSpec list list) Hashtbl.t ->
  Grammar.componentSpec list list

Extract a given pointer from the tree.

val mcfg_to_mcfg :
  string ->
  ’a ->
  ’b -> Item.item Utilities.stufftree -> (string * Item.cat) Utilities.disptree

Provide an mcfg disptree for a given mcfg stufftree.

val mcfg_to_mg_ation :
  string ->
  (Item.cat, string) Hashtbl.t ->
  ’a -> Item.item Utilities.stufftree -> (string * string) Utilities.disptree

Provide an MG derivation tree for a given mcfg tree.

val mcfg_to_mg_ived :
  string ->
  (Item.cat, string) Hashtbl.t ->
  (Item.item, Point.point) Hashtbl.t ->
  Item.item Utilities.stufftree -> (string * string) Utilities.disptree

Attempt to provide an MG derived tree for a given mcfg tree. Not implemented.

val successes : Item.item list -> string -> Item.cat list -> Item.item list

Return a list of successful start items spanning the string. Probably should go in point.ml if possible.

val check_mcfg :
  < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t;
    .. > ->
  string -> ’a -> Item.cat list -> (string * Item.cat) Utilities.disptree list

Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MCFG derivation tree output.

val check_mg :
  < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t;
    .. > ->
  string ->
  (Item.cat, string) Hashtbl.t ->
  Item.cat list -> (string * string) Utilities.disptree list

Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MG derivation tree output.

val check_mg_ived :
  < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t;
    .. > ->
  string ->
  (Item.cat, string) Hashtbl.t ->
  Item.cat list -> (string * string) Utilities.disptree list

Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MG derived tree output. Not implemented.

val check_null : ’a -> ’b -> ’c -> ’d -> ’e list

Vacuous higher order checking function for mcfg to pass into Mcfgcky.parse_gen where it is used for statistical parsing modules which do not require explicit discrete derivation or derived trees.

4.4.2 Module `Output`

Various output functions that we need. Coordinates output in a quiet and a verbose mode. Could probably stuff some verbose mode functionality into a separate debugging mode. Alternatively, could just have the command line option spit out what you need (include a seperate option for the fertility matrix, for example).

val round : float -> float

Round a floating point number intelligently to 3 places after the decimal. Used sparingly.

val rounded_string : float -> string

Provide a rounded string representation of the floating point number.

val pp_H_hash : Pcfg.t_bank -> Pervasives.out_channel -> unit

Output entropy values from the hash table for verbose output. Could in most cases be shut off.

val string_of_fmatrix :
  (Item.item array * float array array) option -> string

val pp_fmatrices : Pcfg.t_bank -> Pervasives.out_channel -> unit

Output the fertility matrix for verbose output. Could definitely be suppressed, but should be kept around for debugging purposes.

val string_of_trainrhs : Point.dcfgrhs -> Item.cat

val pp_trainpcfg :
  < pcfg : (string, Pcfg.pcfgrule) Hashtbl.t option; .. > ->
  Pervasives.out_channel -> unit

Output the treebank grammar in verbose mode. Very useful to have around.

val pp_testcfg : Pcfg.t_bank -> Pervasives.out_channel -> unit

Output the weighted intersection grammar in verbose mode. Very useful to have around.

val pp_renormedhvect : Pcfg.t_bank -> Pervasives.out_channel -> unit

Output the values of h^→. Could be suppresed but useful for debugging.

val pp_inside : Pcfg.t_bank -> Pervasives.out_channel -> unit

Output the inside probabilities of different values in verbose mode. Mostly useful to have around.

val pp_testbank_report :
  ((string * ’a) * Pcfg.t_bank) list -> Pervasives.out_channel -> unit

Output all the testtime-pretinent values provided by the above functions in a report in verbose mode.

val pp_ent_report :
  < pcfg : (string, Pcfg.pcfgrule) Hashtbl.t option; .. > ->
  ((string * ’a) * Pcfg.t_bank) list -> string -> unit

Total output for verbose mode.

val pp_ent_sum : ’a -> ((string * ’b) * Pcfg.t_bank) list -> string -> unit

Summary of entropy values for quiet mode.

val pp_surp_sum :
  ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit

Summary of surprisal values for quiet mode.

val pp_surp_sum_kilbury :
  ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit

Summary of surprisal values for experimental kilbury parsing in quiet mode.

val pp_short_sum :
  ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit

Total output for quiet mode.

val pp_short_sum_kilbury :
  ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit

Total output for quiet mode in experimental kilbury parsing.

4.4.3 Module `Print`

Low-level printing capabilities for standard out and file output.

val string_of_span : Item.span -> string

Return a string of a given span.

val string_of_spanlist : Item.span list -> string

Return a string of a given span list.

val string_of_anonitem : Item.span list list -> string

Return a string of a given span list list. Not used.

val string_of_item : Item.item -> string

Return a string of a given item.

val string_of_point : Point.point -> string

Return a string of a given pointer.

val print_item : (Item.item, Point.point) Hashtbl.t -> Item.item -> unit

Print a given item to standard out.

val string_label : string -> Item.span list -> string

val cat_label : ’a -> (’a, string) Hashtbl.t -> string

val full_label :
  string -> Item.span list -> ’a -> (’a, string) Hashtbl.t -> string

val qtree_of_tree : (string * string) Utilities.disptree -> string

For a given derivation/derived tree return a qtree-format string for LaTeX.

val dot_of_tree : (string * string) Utilities.disptree -> string

For a given derivation/derived tree return a qtree-format string for DOT.

val pp_of_tree : (’a * string) Utilities.disptree -> string

For a given derivation/derived tree return a pretty-printed string.

val print_trees :
  (’a ->
   string ->
   ’b ->
   ((< pointers : (’d, ’e) Hashtbl.t; .. > as ’c) ->
    string -> ’f -> string list -> ’g list) ->
   ’f -> ’h option -> ’c) ->
  ’a ->
  string ->
  ’b ->
  (’c -> string -> ’f -> string list -> ’g list) ->
  (’g -> string) -> ’f -> string -> unit

Given a checking function and a print function, return the appropriate display tree.

val print_trees_iter :
  (’a ->
   string ->
   (< pointers : (’c, ’d) Hashtbl.t; .. > as ’b) ->
   (’e -> string -> ’f -> string list -> ’g list) -> ’f -> ’h option -> ’e) ->
  ’a ->
  string ->
  ’b ->
  (’e -> string -> ’f -> string list -> ’g list) ->
  (’g -> string) -> ’f -> string -> unit

Same as above, but iterates over a list of sentences passed in from a file

val print_debug : (’a -> ’b -> ’c -> ’d) -> ’a -> ’b -> ’c -> ’e -> ’d

Given a checking function and a print function, return debugging statements from the stack during parsing.

4 Implementation

4.1 Utilities, Basic Data Structures, and Input Modules

4.1.1 Module Utilities

4.1.2 Module MyQueue

4.1.3 Module Seq

4.1.4 Module Mcfgread

4.1.5 Module Fileread

4.1.6 Module Corpread

4.1.7 Module Pcfgead

4.2 Parsing Modules

4.2.1 Module Item

4.2.2 Module Grammar

4.2.3 Module Point

4.2.3.1 Experimental Kilbury Parsing

4.2.4 Module Concat

4.2.4.1 Integrity Checks

4.2.4.2 Sanity Concatenation Checks

4.2.4.3 Concatenation Functions

4.2.5 Module Rules

4.2.5.1 Chart Population Functions

4.2.5.2 Inference Rules

4.2.5.3 Ground Axia/Initial Chart Population

4.2.5.4 Kilbury Style Inference Rules

4.2.6 Module Mcfgcky

4.2.6.1 Experimental Kilbury Stuff

4.3 Statistical Parsing

4.3.1 Module Pcfg

4.3.2 Module Train

4.3.3 Module Inside

4.3.4 Module Test

4.3.5 Module Entropy

4.3.5.1 Stolcke matrix cut trick

4.4 Printing and Output

4.4.1 Module Decompile

4.4.2 Module Output

4.4.3 Module Print

4.1.1 Module `Utilities`

4.1.2 Module `MyQueue`

4.1.3 Module `Seq`

4.1.4 Module `Mcfgread`

4.1.5 Module `Fileread`

4.1.6 Module `Corpread`

4.1.7 Module `Pcfgead`

4.2.1 Module `Item`

4.2.2 Module `Grammar`

4.2.3 Module `Point`

4.2.4 Module `Concat`

4.2.5 Module `Rules`

4.2.6 Module `Mcfgcky`

4.3.1 Module `Pcfg`

4.3.2 Module `Train`

4.3.3 Module `Inside`

4.3.4 Module `Test`

4.3.5 Module `Entropy`

4.4.1 Module `Decompile`

4.4.2 Module `Output`

4.4.3 Module `Print`