Myqueue implements a functional queue structure used in the agenda mechanism, and Seq (as coded in Paulson [1996] implements lazy lists used in point.ml to retrieve individual trees from the forest. Utilities provides data manipulation utilities. Print and Output contain the string-manipulation and output utilities used in printing different outputs to the toplevel, to the standard out, and to file.
Item contains the item record type, which uses the string cat type for category labels, the span type for mapping to the input string, and the integer point type for the backpointing mechanism. Concat implements the computation of string yield functions, which perform concatenations and range restrictions, and also contains several boolean check functions which ensure that derived items are interpretable.
Grammar contains the types related to the mcfg grammar data structure. Mcfgread and Mcfglex are a lexer/parser combination used for reading in this grammar object from an mcfg file, which is output by the guillamin compiler.
Rules implements the inference rule schemeta which are used by the parsing engine, while Mcfgcky contains the parsing engine. Point implements the packed shared forest. Decompile handles reverse compilation from MCFG to MG if desired. Dictlex and Dictread provide a lexer/parser used to read in a dictionary file provided by the Guillamin compiler. This dictionary file consists of tuples of MCFG categories and their MG interpretations, and is used by Mcfgcky to set the success criterion and to provide a translation of the MCFG parse into MG.
Fileread/Filelex,Corpread/Corplex, and Pcfgread/Pcfglex all read in unweighted corpora, weighted corpora, and pcfg for statical mode parsing. Fileread/filelex is also used via batch processing in parsing mode. Pcfg implements basic structures for probabilistic parsing at training time and testtime. Train trains up a model from a corpus using Weighted Relative Frequency Estimation over a treebank generated by the parser over the corpus from the grammar. Pcfglex/Pcfgread read in an arbitrary PCFG directly. Test implements the testtime functionality of computing intersection grammars and associated data which are prerequisite for entropy computations. Dcfggraph uses ocamlgraph to topologically sort the grammar into buckets. Inside computes the inside probability of catgeories and calculates renormalized probabilities of the intersection grammars. Entropy uses Grenander’s Theorem and camlgsl to invert a fertility matrix and obtain entropies over probabilistic grammars.
Print is responsible for printing to standard out and providing string translations of data structures, while Output assembles reports that are output to a desired file. Dcfgdisplay uses dcfggraph to output product-sum graphs if desired.
Basic utilities for mcfgcky.
Read in a Multiple Context Free Grammar in Albro [] notation.
Read in an unweighed corpus in basic list format.
Read in an weighed corpus.
Read in an arbitrary model in probabilistic context free grammar format.
An item in our chart is a tuple of a category and its range-restricted string yield. The string yield is tuple-valued, where each component could either correspond to an actual string, an empty string, or Kleene star.
type cat = string
Categories are simple strings.
type index = int
We represent an item’s position in the string using an index of type int.
type span = | Var | Emp | Wild of index | Fragment of index * index | Hyp of index | Fail
A span is a component of the tuple-valued range-restriction of the string yield. Each span represents an automaton transition, so that the total string yield represents the sequence of automaton transitions that are necessary to derive the decorated item. Vars and Emps represent epsilon transitions in our system; they are distinguished in MCFG parsing because they have unique status in the MG items that gave rise to them. Vars represent the empty string yields of functional head categories in the source MG, whereas Emptys represent the empty string yields of ad-hoc categories introduced solely by the Guillaumin [2004] compiler.
Wilds represent Kleene closures in string yields, and serve to reflect that a given category may be present in the right context of a prefix. A Fragment (x,y) represents an automaton transition from index x to index y where y > x. We represent a range-restriction as a Fragment when that span’s derivation maps on to at least one instance of a real string. Thus, our string concatenation rules define an algebra (a semiring) over tuples of strings. Loosely, we realize (+) as string concatenation, and (*) as a function over tuples of strings yielding tuples of strings. The concatenation of a Fragment (x,y) and a Wild (y) is Fragment (x,y), where Wild(y) is a zero-item for concatenation. A (linear, non-erasing) string yield function for a unary rules which takes the 0th member of a child’s string yield is the null-item for (*).
type item = { cat : cat ; pos : span list ; }
An item is a category situated over (range-restricted by) an n-tuple of spans in the string. This homogenous n-tuple of spans is implemented as a list of spans. An item is implemented as a record with category (cat) and string yield (pos) fields.
Grammar.ml implements the data types and functions used in Multiple Context Free Grammar. Whereas a Context Free Grammar production conflates a symbol-forming operation and a concatenation operation, MCFG divorces abstract and concrete syntax by allowing arbitrary string yield functions with more powerful operations than concatenation. The domain and range of (Linear) MCFG string yield functions is defined over yields of type tuple of strings; this is guranteed by the linearity of (L)MCFG without which, MCFG would be Turing-equivalent.
type daughterID = int
type member_of_tupleID = int
type componentSpec = daughterID * member_of_tupleID
type rewriter = | Func of componentSpec list list | Terminal
We implement our string yield functions according to the Albro notation. For a rule A → B C [(0,0);(1,0)][(0,1)], we evaluate the string yield of A as a tuple with two components (where each component is indicated in brackets), where the first component is the concatenation of respective 0th members of string yields of B and C, and the second component is the 1th component of the 0th category, B. When A is a preterminal, we indicate as such with the category Terminal. daughter_ID indicates which category is referred to; member_of_tupleID indicates which component of some category’s string yield is referred to; such that componentSpec can uniquely indicate a span in B or C. Rewriter implements the tuple-valued rewrite function as a list of componentSpec.
type mcfgrule = Item.cat list * rewriter
An mcfgrule has abstract syntax (of type ’cat list’, where we always take the 0th member of the list to refer to the parent, and the other members of the list to refer to the children) and concrete syntax (of type rewriter).
type pmcfg = mcfgrule list
A (Parallel) Multiple Context Free Grammar is a list of mcfgrules.
val tupler : ’a * rewriter -> rewriter
Given a mcfgrule, obtain the string yield function.
val cats_of_rule : ’a * ’b -> ’a
Given a mcfgrule, obtain the abstract syntax as a list of categories.
val lhs : ’a list * ’b -> ’a
Given a mcfgrule, obtain the parent category (left hand side) of the rule.
val rhs : ’a list * ’b -> ’a list
Given a mcfgrule, obtain the children categories (right hand side) of the rule.
type opt_mcfg = { parent_ord : (Item.cat, mcfgrule) Hashtbl.t ; lchild_ord : (Item.cat, mcfgrule) Hashtbl.t ; rchild_ord : (Item.cat, mcfgrule) Hashtbl.t ; onchild_ord : (Item.cat, mcfgrule) Hashtbl.t ; ispreterm_ord : (bool, mcfgrule) Hashtbl.t ; }
Implement an optimized MCFG object where any quantity of interest asked for by rules.ml is indexed at training time and retirevable in constant time at testing time. Another optimization would be to index by linearization arity as well; this would be relatively easy to implement, but may not be of great use: our MCFG categories generally have constant lineaarization arity.
val orderbyparent : (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit
Populate a hash table where each rule is indexed by parent.
val orderbylchild : (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit
Populate a hash table where each rule is binary and indexed by the left child.
val orderbyrchild : (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit
Populate a hash table where each rule is binary and indexed by the right child.
val orderbyonchild : (’a list * ’b) list -> (’a, ’a list * ’b) Hashtbl.t -> unit
Populate a hash table where each rule is unary and indexed by the only child.
val orderbypreterm : (string list * rewriter) list -> (bool, string list * rewriter) Hashtbl.t -> unit
Populate a hash table where each rule is a preterminal production, indexed by the preterminal.
exception WeirdRule of mcfgrule
Exception for abberant mcfg rules.
class grammar : pmcfg ->
object
method get_rules_for : Item.cat -> Grammar.mcfgrule list
Get any rules where category is a parent.
method get_rules_left_child : Item.cat -> Grammar.mcfgrule list
Get any binary rules where category is a left child.
method get_rules_right_child : Item.cat -> Grammar.mcfgrule list
Get any binary rules where category is a right child.
method get_rules_only_child : Item.cat -> Grammar.mcfgrule list
Get any unary rules where category is a child.
method get_all_empty_preterm : Item.cat list
Get any preterminals whose string yield is empty.
method get_all_nonempty_preterm : Item.cat list
Get any preterminals whose string yield is non-empty.
method get_all_empty : Item.cat list
Get any preterminals whose string yield is empty (deprecated).
method get_all_head : Item.cat list
Get any preterminals whose string yield is non-empty (deprecated).
method get_all_cats : Item.cat list
Get all categories in the grammar.
method get_start_symbol : string list
- end
Get the start symbol of the grammar.
val opt_gram : Grammar.opt_mcfg
The optimized grammar object.
val unopt_gram : Grammar.pmcfg
Return the list-formatted unoptimized grammar.
method get_all_rules : Grammar.pmcfg
Get all the rules in the grammar.
Implements the backpointer system and subsumption check, which we use to construct a packed parse forest.
type point = | Term | Unary of Item.item * Grammar.componentSpec list list | Binary of Item.item * Item.item * Grammar.componentSpec list list
Like in CFG, a pointer gives us the abstract syntax used to enter some category, by referring to either one or two parents or a terminal. Unlike CFG, it should also indicate the concrete syntax used to form it, by including the string yield function which was used.
type sitrule = Item.item * point
Situated rules are non-terminals decorated or range restricted with their string yields. This could be cleaned up by having one parameterized type, which is either Situated of Item.item * point, or Nonsituated of Item.cat * catlist.
type sitcfg = sitrule list
type dcfgrhs = | Bin of Item.cat * Item.cat | Un of Item.cat | T
A Dcfg gives us unsituated rules back from the situated parse forest, for when we need parse trees to train rules with unsituated categories. This works effectively, but is conceptually clunky.
type dcfgrule = Item.cat * dcfgrhs
type dcfg = dcfgrule list
exception StrangeChart
Exception for when our chart is strange.
val crossproduct : ’a list -> ’b list -> (’a * ’b) list
Cross product of two lists.
val string_of_span : Item.span -> string
val sitcfgOfChart : (Item.item, point) Hashtbl.t -> Item.item -> sitcfg
Obtain the dcfg from a chart given the pointer hash table and the current parent/key to work on.
val wlength : string -> int
The length of the sentence that the situated start symbol needs to span over.
val sit_start_symbol : Item.index -> Item.item
Given wlength, return the situated start symbol.
val gr_sort : (Item.item * ’a) list -> (Item.item * ’a) list
Guarantee that items are sorted longest to shortest, with start symbol first. Not strictly necessary, but should speed processing.
val sortedcfgOfChart : (Item.item, point) Hashtbl.t -> Item.item -> sitrule list
Sorted pcfg from chart according to the above.
val desitCFG : (Item.item * point) list -> (Item.cat * dcfgrhs) list
For a situated context free grammar, return the desituated dcfg version
val catlist_of_dcfg : (’a * dcfgrhs) list -> ’a list
val catarray_of_dcfg : (’a * dcfgrhs) list -> ’a array
Return the list of categories in a dcfg. Used to index the momentum matrix.
val abstract : Item.item -> Item.cat
Desituate a situated category.
val abstract_rule : point -> dcfgrhs
Desituate a situated rule.
val range : int -> int -> int list
A range operator, which I’m sure is implemented elsewhere
val success_items : string -> Item.item list
Given a string, return a list of successful situated start symbols according to the approp string length
class forest :
object
val point_hash : (Item.item, Point.point) Hashtbl.t Pervasives.ref
A hashtable of pointers which provides the main functionality of the packed parse forest. In trunk, this is a standalone hash table, not bundled into the forest object.
val subsumptionok : ’a -> (’a, ’b) Hashtbl.t -> bool
Boolean check for whether parent item is already in chart; if it is, include the new pointer but don’t add the item again.
val (@@) : ’a Seq.seq -> ’a Seq.seq -> ’a Seq.seq
val (@::) : ’a -> ’a Seq.seq -> ’a Seq.seq
val lcrossproduct : ’a Seq.seq -> ’b Seq.seq -> (’a * ’b) Seq.seq
val treesOfChart : (Item.item, point) Hashtbl.t -> Item.item -> Item.item Utilities.stufftree list
Operators for lazily extracting a sample of parse trees out of the forest. Could be improved.
val cathash : (Item.cat, Item.item) Hashtbl.t Pervasives.ref
val chart : Item.item list Pervasives.ref
val grams_of_chart : (Item.item, Point.sitcfg) Hashtbl.t
method cats : (Item.cat, Item.item) Hashtbl.t
method pointers : (Item.item, Point.point) Hashtbl.t
method get_sit_cats : Item.cat -> Item.item list
method subsume_ok : Item.item -> bool
method add_pointer : Item.item -> Point.point -> unit
method add_cat : Item.item -> unit
method add_to_chart : Item.item -> Item.item list
method get_chart : Item.item list
method incr_chart : unit
method incr_pointers : unit
method incr_cats : unit
method hash_grams : string -> unit
method sitgram : Item.item -> Point.sitcfg
- end
Concat.ml implements concrete syntax via string yield functions for MCFG.
type anonitem = Item.span list list
An anonitem is an intermediate representation of the string yield function.
val reasonable_ok : Item.span -> bool
Check that indices on a given range restriction component (span) are sensible. Frag(x,y) requires 0 ≤ x ≤ y. Wild(x) requires 0 ≤ x.
val concatenable_ok : Item.span list -> bool
Check that a given instance of concatenation is sensible. A concatnenation [(x,y);(x1,y1...] requires that y = x1, etc.
val overlap : Item.span -> Item.span -> bool
val overlap_ok : Item.span list -> bool
Overlap_ok applies overlap to deliver T iff candidate range restrictions don’t have overlapping spans, F o/w. A range restricted string yield (Frag(x,y),Frag(x1,y1)) requires that ¬ ∃ z . (x ≤ z ≤ y) ∧ (x1 ≤ z ≤ y1)
val wholerule_ok : ’a list list -> anonitem -> bool
T iff ensure that entire mcfg rule is used in creating item, and rule was used non-erasingly, F o/w.
val concat_fail : Item.span list -> bool
Checks to see if item contains the Fail itemlet, created by bad, incomplete concatenation.
exception ConcatFail of string
Exception thrown if concatenation fails, to be caught safely by rules.ml which will not populate the chart in this instance.
val concat : Item.span list -> Item.span
; is the concatenation function, defined on range restrictions. Concatenation is not a total function, as things ot be concatenated need to be adjacent (for [Frag(x,y);Frag(x1,y1], concatenation is valued only when y = x1). Given a span list, concatenates into a single span. Nonsense concatenations should return the Fail span type, propogating information up to the sanity checkEmp and Var are ’free’ in the string and place no restrictions on where in the string they can be concatenated in; thus they are cached out at concatenation.
exception GrabNth of string
val grab : Grammar.componentSpec -> anonitem -> Item.span
Given an anonitem, returns the nth member of the nth tuple. Throws GrabNth is member is out of range of the arity of the tuple.
val mapconcat : Grammar.componentSpec list -> anonitem -> Item.span
Map concatenation across all the gets in a single spec list– eg (0,1);(1,0). Each instance of mapconcat builds one component of the parent’s string yield tuple.
val f : Grammar.componentSpec list list -> anonitem -> Item.span list option
The yield function maps mapconcat across the anonitem with the whole rewrite–eg (0,1);(1,0)0,0.
Rules.ml defines the inference rules we use to populate our chart, faciliating Shieber et al. [1995]-style parsing as deduction. This modularity would allow relatively easy adaption of all the statistical stuff to say, Earley parsing. It also handles the actual population of the chart.
val unfunc : Grammar.rewriter -> Grammar.componentSpec list list
Unbox the rewriter type. Doing this sparingly instead of at runtime every time a string yield function is called actually yields a signifcant performance gain.
val populate2 : Item.cat list * Grammar.rewriter -> Item.item -> Item.item -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> ’b list
Populate2 factors out population, item checking, and string yield evaluation from the left and right inference rules. It applies the ’concrete syntax’ by calling concat.ml with the specified yield function. It checks as well that the string yield function is sensible before sending it off to concat.ml.
val populate : Item.cat list * Grammar.rewriter -> Item.item -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> ’b list
Populate2 factors out population, item checking,a nd string yield evaluation from the unary inference rules. It applies the ’concrete syntax’ by calling concat.ml with the specified yield function. It checks as well that the string yield function is sensible before sending it off to concat.ml.
val is_nonzero_prob : Item.cat list * ’a -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> bool
Boolean check to ensure that items at training time have nonzero probability. This is well-tested, and does not effect accuracy at the expense of performance so long as inside renormalization is used.
val non_erasing_bin : ’a list -> ’b list -> Grammar.rewriter -> bool
A cheap filter to reduce calls to concat.ml, which implements a cheap approximate and conservative (false negatives but not false positives) check that the rule is non-erasing. Checks that #comps in x + #comps in y = linearzation arity of rule r.
val non_erasing_un : ’a list -> Grammar.rewriter -> bool
A cheap filter to reduce calls to concat.ml, which implements a cheap approximate and conservative (false negatives but not false positives) check that the rule is non-erasing. Checks that#
comps in x +#
comps in y = linearzation arity of rule r.
val left : < get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’c -> ’b list
Given the left member of a rhs binary rule, map over all pertinent rules and all pertinent items with the correct right member.
val right : < get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’c -> ’b list
Given the right member of a rhs binary rule, map over all pertinent rules and all pertinent items with the correct left member
val unaryf : < get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’c -> ’b list
Given the only child of a rule, map over all pertinent unary rules with this member.
val empties : < get_all_empty : Item.cat list; get_all_head : Item.cat list; .. > -> ’a -> < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list
Populate all empties.
val can_hypo : string -> bool
Populate the chart based on any ellipsis cues (so far these are indicated with #).
val items_of_just_string : < get_all_nonempty_preterm : Item.cat list; get_rules_only_child : string -> (Item.cat list * ’a) list; .. > -> string -> < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list
Populate all nonterminals for the string.
val items_of_token : < get_all_nonempty_preterm : Item.cat list; get_rules_only_child : string -> (Item.cat list * ’a) list; .. > -> string -> < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.index -> Item.item list
Inference rules for kilbury. Experimental and not accurate.
val populate2_up : Item.cat list * Grammar.rewriter -> Item.item -> Item.item -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> ’c -> ’b list
Binary chart population for kilbury. Experimental and not accurate.
val populate_up : Item.cat list * Grammar.rewriter -> Item.item -> < add_pointer : Item.item -> Point.point -> ’a; add_to_chart : Item.item -> ’b list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> ’c -> ’b list
Unary chart population for kilbury. Experimental and not accurate.
val left_up : < get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’a -> < add_pointer : Item.item -> Point.point -> ’b; add_to_chart : Item.item -> ’c list; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’d -> ’c list
Binary inference rule for kilbury. Experimental and not accurate.
val right_up : < get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’a -> < add_pointer : Item.item -> Point.point -> ’b; add_to_chart : Item.item -> ’c list; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’d -> ’c list
Binary inference rule for kilbury. Experimental and not accurate.
val unaryf_up : < get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’a -> < add_pointer : Item.item -> Point.point -> ’b; add_to_chart : Item.item -> ’c list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’d -> ’c list
Unary inference rule for Kilbury. Experimental and not accurate.
val delid : Item.item * Item.item -> bool
Experimental Deletion under Identity Check for Conjunction Parsing. Untested.
val noncyclic : Item.span list * Item.span list -> bool
Experimental Cyclicity Check for Conjunction Parsing. Untested.
val is_hyp : Item.item -> bool
Experimental Hypothesis Check for Conjunction Parsing. Untested.
val rel_hyp : (’a, Item.item) Hashtbl.t -> ’a -> Item.item list
val left_discharge : ’a -> ’b -> < add_pointer : Item.item -> Point.point -> ’c; cats : (Item.cat, Item.item) Hashtbl.t; points : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’d -> Item.item list
Experimental Disharge Rule for Hypothesis/Conjunction Parsing. Untested.
val right_discharge : ’a -> ’b -> < add_pointer : Item.item -> Point.point -> ’c; cats : (Item.cat, Item.item) Hashtbl.t; points : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > -> Item.item -> ’d -> Item.item list
Experimental Disharge Rule for Hypothesis/Conjunction Parsing. Untested.
val items_of_string : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_only_child : string -> (Item.cat list * ’a) list; .. > -> string -> < add_pointer : Item.item -> Point.point -> ’b; .. > -> Item.item list
The main parsing module of mcfgcky.
val distribute : (Item.item -> ’a -> ’b list) list -> Item.item -> < get_chart : ’a; .. > -> ’b list
Distributes a trigger throughout the chart.
val distribute_debug : (’a list -> ’b) -> (Item.item -> ’c -> ’a list) list -> Item.item -> < get_chart : ’c; .. > -> ’a list
Distributes a trigger throughout the chart with debugging output.
val exhaust : ’a -> < add_cat : Item.item -> ’b; get_chart : ’c; subsume_ok : Item.item -> bool; .. > -> Item.item Myqueue.BatchedQueue.queue -> (Item.item -> ’c -> Item.item list) list -> ’c
Exhaustion mechanism, proceeds until the chart is empty, which to the extent the algorithm is valid and complete will gives us the closure of the ground items under the inference rules.
val exhaust_debug : ’a -> < add_cat : Item.item -> ’b; get_chart : ’c; subsume_ok : Item.item -> bool; .. > -> (Item.item list -> ’d) -> Item.item Myqueue.BatchedQueue.queue -> (Item.item -> ’c -> Item.item list) list -> ’c
Exhaustion mechanism with debugging output.
val parse_debug : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> string -> (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c; add_to_chart : Item.item -> Item.item list; get_chart : ’d; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’a) -> ’a
Given a grammar g, sentence string w, and point side-table, run the parsing engine without backpointer retrieval and with debugging statements.
val parse : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_start_symbol : ’a; .. > -> string -> (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d; add_to_chart : Item.item -> Item.item list; get_chart : ’e; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’b) -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b
Given a grammar g, sentence string w, and point side-table, parse with the appropriate decompilation of backpointers. Not used in statistical modes.
val parse_gen : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> string -> (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c; add_to_chart : Item.item -> Item.item list; get_chart : ’d; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’a) -> ’e -> ’f -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’a
Given a grammar g, sentence string w, point side-table and a training pcfg, parse without appropriate decompilation of backpointers for use in statistical modes.
val all_points : (’a, ’b) Hashtbl.t -> (’a * ’b) list
val parse_up : < get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; .. > -> Item.cat -> (< add_cat : Item.item -> ’b; add_pointer : Item.item -> Point.point -> ’c; add_to_chart : Item.item -> Item.item list; get_chart : ’d; get_sit_cats : Item.cat -> Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’a) -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> Item.index -> ’a
Experimental Kilbury stuff, doesn’t work yet.
val parse_kilbury_gen : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_start_symbol : ’a; .. > -> string -> (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d; add_to_chart : Item.item -> Item.item list; get_chart : ’e; get_sit_cats : Item.cat -> Item.item list; hash_grams : string -> ’f; incr_cats : ’g; incr_chart : ’h; incr_pointers : ’i; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’b) -> ’j -> ’k -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b
Experimental Kilbury stuff, doesn’t work yet.
val parse_kilbury : < get_all_empty : Item.cat list; get_all_head : Item.cat list; get_all_nonempty_preterm : Item.cat list; get_rules_left_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_only_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_rules_right_child : Item.cat -> (Item.cat list * Grammar.rewriter) list; get_start_symbol : ’a; .. > -> string -> (< add_cat : Item.item -> ’c; add_pointer : Item.item -> Point.point -> ’d; add_to_chart : Item.item -> Item.item list; get_chart : ’e; get_sit_cats : Item.cat -> Item.item list; incr_cats : ’f; incr_chart : ’g; incr_pointers : ’h; pointers : (Item.item, Point.point) Hashtbl.t; subsume_ok : Item.item -> bool; .. > as ’b) -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’b
Experimental killbury stuff, doesn’t work yet.
Pcfg.ml implements basic types and functions for probabilistic parsing.
type fertility_matrix = Item.item array * float array array
A fertility matrix is an array_matrix of floats indexed by situated items, where the member of the matrix indexed by (x,y) indicates the expected number of y obtained in one rewriting of x, for a probabilistic grammar.
exception RowZero
Throw RowZero when a row of the fertility matrix is entirely zeroes.
type bucketed_key = { category : Item.item ; index : int ; }
For inside probability functions, we have to divide the pcfg into ’buckets’, where each ’bucket’ is a strongly connected component of the pcfg’s corresponding graph structure. A bucketed key tells us which bucket of the pcfg the item of interest is located in.
type iteration = int
For inside probability functions, we want to know which ’iteration’ of the inside algorithm a value was evaluated at.
type bucketed_inside_hash = (bucketed_key, iteration * float) Hashtbl.t
Implements a memomoized data structure that permits us to record iterations of the loopy inside algorithm, and look them up in constant time when we need values from a particular iteration or bucket.
type pcfgrule = { rhs : Point.dcfgrhs ; mutable count : float ; mutable prob : float option ; }
A rule type for training time. Works efficiently but is conceptually clunky, and could be combined with the test time probabilist ruletype Pcfg.renormed_probcfgrule, as a parameterized type.
type pcfg = (Item.cat, pcfgrule) Hashtbl.t
A PCFG is a hashtable of pcfgrules indexed by parent categories.
type renorm_cfgrule = { rrhs : Point.point ; mutable prior : float option ; mutable postr : float option ; }
A rule type for testing time. Works efficiently but is conceptually clunky, and could be combined with the test time probabilist ruletype Pcfg.probcfgrule, as a parameterized type.
type renorm_cfg = (Item.item, renorm_cfgrule) Hashtbl.t
A renorm_cfg (sic) is a hashtable of renorm_cfgrules pcfgrules indexed by parent items, which are situated.
type t_bank = { test_sitcfg : Point.sitcfg ; mutable test_pcfg : renorm_cfg option ; mutable inside : bucketed_inside_hash option ; mutable fmatrix : fertility_matrix option ; mutable invert : fertility_matrix option ; mutable entropy : (Item.item, float) Hashtbl.t option ; mutable hvector : (Item.item, float) Hashtbl.t option ; }
For a particular situated forest, keep around all of the pertinent data structures in a record.
type testbank = ((string * Item.item) * t_bank) list
List all the intersection probabilistic grammars and t_banks by prefix per sentence.
val catarray_of_pcfg : (’a, ’b) Hashtbl.t -> ’a array
Retrieve the array of categories in a probabilistic context free grammar.
val fmatrix_of_training_pcfg : (Item.cat, pcfgrule) Hashtbl.t -> Item.cat array * float array array
Obtain the fertility matrix of the training grammar, for eigenvalue checks and computing entropy of the training pcfg.
val fmatrix_of_pcfg : (Item.item, renorm_cfgrule) Hashtbl.t -> Item.item array * float array array
Obtain the fertility matrix of an intersection pcfg.
Train up the corpus-minitreebank for -i mode. Also handles the reading in of pcfg, most of which is done in pcfglex.mll/pcfgread.mly
exception Logzero
Exception if log is taken of 0.0.
exception NegLog of float
Exception if log is taken of a negative number.
type s_bank = { weight : float ; training_sitcfg : Point.sitcfg ; mutable training_cfg : Point.dcfg option ; }
Basic record type that bundles information we need for training on a particular sentence in a treebank.
type revprob_bankrule = { rcount : float ; rprob : float ; }
type revprob_bankcfg = (Item.cat * Point.dcfgrhs, revprob_bankrule) Hashtbl.t option
Optimized PCFG type where a rule probability can be indexed by children for lookup in constant time. Used by inside.ml. Could be hidden behind a fancier pcfg object, but doesn’t create much clunkiness in practice (inside.ml only sees this representation of the pcfg).
type train = { sbank : (string, s_bank) Hashtbl.t ; mutable by_parent : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option ; mutable by_rule : (Item.cat * Point.dcfgrhs, revprob_bankrule) Hashtbl.t option ; }
Basic record type of a pcfg trained off of a single sentence. A PCFG for an entire treebank can be extracted from all of these sentences easily.
val read_pcfg : string -> train -> unit
Directly read in a training pcfg.
val parse_train : string -> (’a -> string -> (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’b) -> ’c -> (’d, ’e) Hashtbl.t -> ’f option -> ’g) -> ’a -> ’b -> ’c -> ’h -> train -> string -> unit
Train up the corpus-minitreebank for -i mode. Requires an initialized parse forest (empty chart), an empty grammar object, and checking functions provided by decompile.ml to train on a weighted corpus. Uses simple weighted relative frequency estimation (Chi 1999).
val countpcfg_init : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> Point.dcfg -> float -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option
Increments counts across the treebank on a treebank-level pcfg object from the sentence-level information.
val totalcatn : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> float
Sum across all trees in treebank to obtain a count for the number of rules with parent ’a, for the denominator in weighted relative frequency estimation.
val relcount : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> Point.dcfgrhs -> float
For a given rhs (point.dcfgrhs) and a given parent (’a), return the number of counts of that rule across the treebank.
val relfreq : (’a, Pcfg.pcfgrule) Hashtbl.t -> ’a -> Point.dcfgrhs -> float
Find the relative frequency of a rule in the grammar, according to the number of times the parent A (’a) → B (Point.dcfgrhs) found divided by the total count of rhs occurences of A.
val relprobpcfg_init : (’a, Pcfg.pcfgrule) Hashtbl.t -> unit
Populate the pcfg’s hash table object with one relative-frequency estimated rule.
val pcfg_of_train_init : ’a -> train -> unit
Initialize the pcfg object by finding all relative frequencies of all rules. Desituate dcfgs, compile and prob treebank rules in one step.
val trainf : bool -> string -> ’a -> ’b -> (’c -> string -> (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’d) -> ’e -> (’f, ’g) Hashtbl.t -> ’h option -> ’i) -> ’c -> ’d -> ’e -> ’j -> ’k -> string -> train -> unit
A higher order training function parameterized by arguments, to pass in to treebank class. The parse can be agnostic as to which way it was trained.
class treebank : (train -> ’a) -> ’b ->
object
val optbank : Train.train
Our main unoptimized treebank structure which is hidden behind the optimized treebank object.
method prob_by_rule : Item.cat * Point.dcfgrhs -> float
method has_par : Item.cat -> bool
Treebank object method which defines a relation which is T if parent is in the grammar, F o/w.
method has_rule : Item.cat * Point.dcfgrhs -> bool
method pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option
- end
An optimized pcfg object.
type parsorted_drule = { pindex : int ; size : int ; pdefcat : Item.item ; ppar : Item.item ; prhs : Point.point ; }
This is for indexing rules by their right hand side to be used by inside.ml, enabling these rules to be sorted into buckets.
type parsorted_dcfg = (Item.item, parsorted_drule) Hashtbl.t
The actual representation of the pcfg which inside.ml sees. This is necessitated partially by the function in ocamlgraph which does our topological sorting (scc_array), and partially the kind of iteration desired through the sort output, where we need an index and index item for each list.
val index_dhash_init : (Dcfggraph.D.vertex * Point.point) list -> (Dcfggraph.D.vertex, parsorted_drule) Hashtbl.t -> unit
Initialize the optimized pcfg representation which inside.ml sees.
Using the treebank grammar and the testtime interesection grammar, computes inside probabilities for each category in the intersection grammar to renormalize that grammar to find the weighted intersection. Compute inside probabilities of all categories in the intersection grammar, and then find the renormalized, situated probability of a rule by:
Because of recursion, a category can contribute inside probability to itself, so must find the limit of the inside probability for each non-terminal. Goodman [1999] shows us this is doable–sort the grammar into strongly connected components, the inside prob of a category C in a bucket B is the suprememum of its value in the inside semiring over B. Then iterate through buckets in topological order, finding the limit of all values in the bucket before moving on.
STRATEGY:
Assuming you call with s each time, and you iterate progressively up., a memoized value is accessible if all of the following hold:
Assumption 1. Previous buckets are static. The grammar’s graph structure can be sorted into its strongly connected components. Assumption 2: You can iterate the inside computation within a given bucket to reach the supremum, then proceed on to the next bucket.
Then we proceed on a memoized, loopy version of the inside algorithm, in which we work in topological order of the buckets. In general, we memoize the inside algorithm by working top down then bottom up. For a given category A in training pcfg R and with productions A → B 1 C1, A → B2 C2 ..., β(A) depends on probabilities in R and β(B1), β(C1) .... If we have no appropriate inside probability value for A in cache, we recur into children and determine the inside probability of children.
For the bucketed, iterative version of this algorithm that can treat looping buckets, we only need to check the value for the most recent iteration of some category, and then determine if that value is appropriate. If a category is its own child, the correct iteration to retrieve for calculating Ai,j is Ai,j−1. If a category only depends on values outside its bucket, it necessarily depends on previous bucket, since we have topologically sorted the buckets and we are working through the buckets bottom-up. Thus, we can compute its inside probability ’statically’, because the inside probabilities of those children aere guaranteed not to change, by Assumptions 1 and Assumptions 2.
If the inside probability of a category A depends on the inside probability of a child B in the same looping bucket i as it, then we iterate through the bucket because the value of β(A) depends on β(B) and vice versa. We proceed by first computing a static value for B which depends only on categories outside i, then iterate through i so that β(Ai,j) depends on β(Bi,j), and β(Bi,j+1) depends on β(Ai,j). This can happen when B is what we term the ’entry item’ for i; if our grammar is topologically sortable, then each bucket has such an entry item. This is not guaranteed in general (it is what Goodman [1999] terms linear solvability), but in practice our grammars are sortable this way. We then only need to have two kinds of dynamic inside probability computation; one for non-entry items, which require either values from the same bucket and same iteration, or static values from outside the bucket, and one for entry items, which require values from the same bucket and the previous iteration, or static values from outside the bucket. It is then only necessary to ever keep at most two iterations of a bucket around.
We can factor out the control flow of obtaining an appropriate inside value from the inside algorithm itself, and that is exactly what we do. The current design does not use a stopping criterion other than a set number of iterations, as a more sophisticated stopping criterion caused problems, and was eliminated.
exception Logzero
Exception for logarithms of zero.
exception NegLog of float
Exception for logarithms of negative number. Both of these exceptions are redundant on training.ml.
val maxtup : (int * float) list -> int * float
Used for finding the maximum iteration number in a list of buckets.
exception FirstRecurrence
Exception for when the value we need is the entry item.
val abstract_cat : Item.item -> Item.cat
Extract a category from an item, probably redundant on operations in dcfg.ml.
val all_abstract : Item.item * Point.point -> Item.cat * Point.dcfgrhs
Extract a rule from a situated rule, probably redundant on operations in dcfg.ml.
exception SubNormal
Exception which should raise if our values are every subnormal (if they ever lose precision). They never do. Take that log probs.
exception ImpossProb of string
Exception if our value is ever greater than 1.0, which because there is so much recursion, could result in a lot of chaos. This was caused by inprecise control flow in a previous version, but now never occurs.
exception NonsenseRetrieval
Value that you wanted from hash doesn’t exists. Suggests incorrect control flow.
val inside_complexiter : Item.item -> (Pcfg.bucketed_key, int * float) Hashtbl.t -> < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> (Item.item, Train.parsorted_drule) Hashtbl.t -> int -> int -> unit
Iterates the loopy inside algorithm over a particular bucket, one iteration. Control flow which proceeds in topological order over the buckets is enforced elsewhere, as is the stoping criterion.
val print_inside : (Pcfg.bucketed_key, int * float) Hashtbl.t -> unit
Print some inside values.
val list_of_terms : (’a, Train.parsorted_drule) Hashtbl.t -> (’a * int) list
Obtain a list of terminals from the hashtable. Probably redundant.
val inside_complex_bucket : Item.item -> (Pcfg.bucketed_key, int * float) Hashtbl.t -> < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> (Item.item, Train.parsorted_drule) Hashtbl.t -> int -> int -> unit
Iterate as desired over a particular bucket.
val maxbucket : (’a, Train.parsorted_drule) Hashtbl.t -> int
Functions to get info about bucket structure
val bucket_indices : (’a, Train.parsorted_drule) Hashtbl.t -> int list
Returns a list of int*bool tuple, where int is the index and true indicates the item recurred in the original. Used to derive a list of bucket indices that actually exist.
val parents_of_buckets : (’a, Train.parsorted_drule) Hashtbl.t -> (int * Item.item) list
Return the parents of every bucket, where the parent is a ’start symbol’ for that bucket. The particular identity of the start symbol doesn’t matter so much, as we will immediately recur down to the entry item, which is readily identifiable at run time.
exception SafeNth of int
val safenth : ’a list -> int -> ’a
Exception for retrieving the nth item when n is larger than the list size, probably redundant.
val inside_complex : (Pcfg.bucketed_key, int * float) Hashtbl.t -> < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> ’a -> (Item.item, Train.parsorted_drule) Hashtbl.t -> unit
Main algorithm that calcuates inside probability for all categories of a pcfg. Computes buckets in topological order, and iterates through each bucket n times. N > 40 is generally sufficient for values to converge. Lack of underflow and some testing suggest that iteration values much greater than 40 do not pose an accuracy problem.
val renorm_init : (Item.item, Pcfg.renorm_cfgrule) Hashtbl.t -> (Pcfg.bucketed_key, ’a * float) Hashtbl.t -> < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> (Item.item, Train.parsorted_drule) Hashtbl.t -> unit
val inside_renorm : < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> (Dcfggraph.D.vertex * Point.point) list -> (Dcfggraph.D.vertex, Pcfg.renorm_cfgrule) Hashtbl.t option * (Pcfg.bucketed_key, int * float) Hashtbl.t option
Find the renormalized, situated probability of a rule by::
r′(A−>B) = r(A−>B) * β_B / β_A r′(A−>BC) = r(A−>BC) * β_B * β_C / β_A (17) where β(A) indicates the supremum of the inside probability.
Handles initalizing and renormalizing pcfg values across the testbank. Calculates certain values required for entropy computation, such as vech and the fertility matrix. Also handles the creation of prefix automata from strings we wish to parse.
exception UnsampledRule
Exception for when rule to calculate with was unseen at training time.
val init_priors : Pcfg.t_bank -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t -> unit
Initialize the weighted intersection with values form training time. The unsampled rule case shouldnt happen unless something is wrong–because test dodesn’t parse zero prob items.
val renorm : (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> unit
Higher order function which parametrizes the renormalization of pcfg, so that testing can be agnostic to renormalization. Renormalizes either naviely or according to inside probability.
val inside_renorm_init : Pcfg.t_bank -> < has_rule : Item.cat * Point.dcfgrhs -> bool; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> unit
Initialize the values that inside.ml depends on.
val keylist_of_pcfg : (’a, ’b) Hashtbl.t -> ’a list
Obtain nonterminal indiced for various things in test.ml, probably redunandant.
val lg : float -> float
Binary logarithm function which treats the lg of 0 as 0.
val retrieve_bucket_probs : (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> ’a list -> (’a, float list) Hashtbl.t
val entropy_of_bucket : float list -> float
Compute the one step entropy for a non-terminal for use in H→.
val hvector_pcfg_init : Pcfg.t_bank -> unit
Computes the one step entropy for each non-terminal for use in H→.
val expect : ’a -> Item.item -> (’a, Pcfg.renorm_cfgrule) Hashtbl.t -> float
Compute the fertility matrix for the pcfg.
val testbank_init : Pcfg.t_bank -> < pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option; .. > -> unit
Across the test corpus, calculate all the things needed for entropy computation, according to naive renormalization.
val inside_testbank_init : Pcfg.t_bank -> < has_rule : Item.cat * Point.dcfgrhs -> bool; pcfg : (Item.cat, Pcfg.pcfgrule) Hashtbl.t option; prob_by_rule : Item.cat * Point.dcfgrhs -> float; .. > -> unit
Across the test corpus, calculate all the things needed for entropy computation, according to inside renormalization.
val automatize : bool -> string list -> string list
If autbool, then construct automata for all possible prefixes of a sentence. Intelligently handle any stars in the input. Enforces a policy of parsing, in order, Kleene star (the unconditioned grammar), each prefix of the sentence including the prefix spanning the sentence, and then the sentence as a sentence. The just Kleene-star portion could be turned off when entropy of the unconditioned grammar would be hard to compute where the conditioned grammar might be easy, and we don’t care about the information value of the first word.
val parse_testbank : (’a -> string -> Point.forest -> ’b -> (’c, ’d) Hashtbl.t -> ’e -> ’f) -> ’a -> ’g -> ’b -> ’h -> (< pcfg : ’e; .. > as ’i) -> (Pcfg.t_bank -> ’i -> ’j) -> string -> bool -> string -> ((string * Item.item) * Pcfg.t_bank) list
Higher order function which takes in several other higher order functions, including the renormalization function, either testbank_init or inside_testbank_init, the desired output functions, etc...and coordinates test time computation of values to hand off to entropy.ml.
val range : int -> int -> int list
A Range operator which could go in utilities, and is probably redundant.
val parse_testbank_kilbury : (’a -> string -> ’b -> ’c -> (’d, ’e) Hashtbl.t -> ’f -> < pointers : (Item.item, Point.point) Hashtbl.t; sitgram : Item.item -> Point.sitcfg; .. >) -> ’a -> ’b -> ’c -> ’g -> (< pcfg : ’f; .. > as ’h) -> (Pcfg.t_bank -> ’h -> unit) -> string -> ’i -> string -> ((string * Item.item) * Pcfg.t_bank) list
Experimental incremental parsing on testbanks, doesn’t work yet.
Compute all the entropies.
exception NoOption
val optget : ’a option -> ’a
Unboxing option types for entropy purposes, probably redundant.
exception UnequalLengths of (int * int)
Exception is which is thrown when matrix and h-vector have different lengths (are differently indexed by unique sets of nonterminals).
val zap : ’a array -> int -> ’a array
Delete the nth row in the array.
val zap_col : ’a array array -> int -> ’a array array
Delete the nth column in the matrix.
val ins : ’a array -> int -> ’a -> ’a array
Insert a row of all zeros into the nth position in the array.
val ins_col : ’a array array -> int -> ’a -> ’a array array
Insert a column of all zeros into the nth position in the matrix.
val reduce : ’a array * float array array -> ’a array * float array array
For any n, if both row n and column are all 0.0, eliminate row and column n and return them after inversion. Stolcke tells us this should work.
val blowup : ’a array * float array array -> ’a array -> ’a array * float array array
For any reduced invert matrix, put back in all the rows and columns of all zero that we took out of the preinverted matrix. Stolcke tells us that this should work.
exception NonSquare of string
Exception which is thrown when matrix is non-square.
exception ZeroDeterminant
Exception which is thrown when determinant of the matrix is equal to zero, aka the matrix is noninvertible.
exception EmptyMatrix
Exception which is thrown when matrix is entropy (probably because a parse was fail and not identified as such). It usually raises when something wrong happened at training. Before including this, parser would return a strange fragmentation error, which is supposed to be impossible for OCaML, according to Jon Harrop. Turns out it is, it was gsl (which is written in C) getting an empty matrix to invert.
exception TrainingSpectralRadius of string
Exception which is thrown when spectral radius of the matrix is greater than or equal to 1.0. This is bad, but should never happen when we traing the grammar from Treebank, according to Chi. It could potentially happen from arbitrary PCFG. It means that the set of sentences has probability greater than 1.0, ergo we fail to define a probabilistic language.
exception SpectralRadius of float
Exception which is thrown when spectral radius of the matrix is greater than or equal to 1.0. This is bad, but shouldn’t happen so long as our intersection grammar is consistent, which is suggested by Nederhof and Satta. Our intersection grammar should be consistent when both our training grammar is consistent and our automaton is consistent. It could potentially happen from arbitrary PCFG. It means that the set of sentences has probability greater than 1.0, ergo we fail to define a probabilistic language.
val eigenlist : Gsl_matrix.matrix -> float list
Give me the eigenvalues of the matrix. Used for the spectral radius check, potentially very interesting for other things.
val eigencheck : float -> bool
val infnorm : float array array -> float
The infinity norm on matrices. Used to determine a condition number on matrice formed from inf
_norm M * inf
_norm M^-1.
val cond_number : float array array -> float array array -> float
Condition number of the matrix times the input error equals total error of the inverted matrix. The condition number tells us how ’lossy’ matrix inversion is. Condition numbers range from 1 to infinity.
val entropy_of_matrix : ’a array * float array array -> (’a, float) Hashtbl.t -> (’a, float) Hashtbl.t
Compute the entropy of a fertility matrix by calculating (I−A)−1 h→. We can be polymorphic, and pass in either original or renormalized matrices/hvectors. Original values are read off of a record, renormalized values off of the matrix-particular record entry.
exception Catlength of int
val entropy_of_testbank : string -> string -> bool -> bool -> (Pcfg.t_bank -> Train.treebank -> ’a) -> (’b -> string -> Point.forest -> ’c -> (’d, ’e) Hashtbl.t -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> ’f) -> ’b -> Point.forest -> ’c -> ’g -> (Train.treebank -> ((string * Item.item) * Pcfg.t_bank) list -> string -> ’h) -> string -> ((string * Item.item) * Pcfg.t_bank) list
Compute entropies for the entire testbank.
val entropy_of_testbank_kilbury : string -> string -> bool -> ’a -> (Pcfg.t_bank -> Train.treebank -> unit) -> (’b -> string -> (< pointers : (Item.item, Point.point) Hashtbl.t; .. > as ’c) -> ’d -> (’e, ’f) Hashtbl.t -> ’g option -> ’h) -> (’b -> string -> ’c -> ’d -> (’i, ’j) Hashtbl.t -> (Item.cat, Pcfg.pcfgrule) Hashtbl.t option -> < pointers : (Item.item, Point.point) Hashtbl.t; sitgram : Item.item -> Point.sitcfg; .. >) -> ’b -> ’c -> ’d -> ’k -> (Train.treebank -> ((string * Item.item) * Pcfg.t_bank) list -> string -> ’l) -> string -> ((string * Item.item) * Pcfg.t_bank) list
Experimental kilbury mode doesnt work yet.
Given an mcfg tree labeled with points and a dictionary file from the Guillaumin [2004] compiler, decompiles into an mg tree labeled with a (string,category) tuple. This can then be passed into a print function.
val label_of_tree : ’a Utilities.stufftree -> ’a
Extract a given category label from the tree.
val rewr_of_tree : Item.item Utilities.stufftree -> (Item.item list, Grammar.componentSpec list list) Hashtbl.t -> Grammar.componentSpec list list
Extract a given pointer from the tree.
val mcfg_to_mcfg : string -> ’a -> ’b -> Item.item Utilities.stufftree -> (string * Item.cat) Utilities.disptree
Provide an mcfg disptree for a given mcfg stufftree.
val mcfg_to_mg_ation : string -> (Item.cat, string) Hashtbl.t -> ’a -> Item.item Utilities.stufftree -> (string * string) Utilities.disptree
Provide an MG derivation tree for a given mcfg tree.
val mcfg_to_mg_ived : string -> (Item.cat, string) Hashtbl.t -> (Item.item, Point.point) Hashtbl.t -> Item.item Utilities.stufftree -> (string * string) Utilities.disptree
Attempt to provide an MG derived tree for a given mcfg tree. Not implemented.
val successes : Item.item list -> string -> Item.cat list -> Item.item list
Return a list of successful start items spanning the string. Probably should go in point.ml if possible.
val check_mcfg : < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; .. > -> string -> ’a -> Item.cat list -> (string * Item.cat) Utilities.disptree list
Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MCFG derivation tree output.
val check_mg : < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; .. > -> string -> (Item.cat, string) Hashtbl.t -> Item.cat list -> (string * string) Utilities.disptree list
Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MG derivation tree output.
val check_mg_ived : < get_chart : Item.item list; pointers : (Item.item, Point.point) Hashtbl.t; .. > -> string -> (Item.cat, string) Hashtbl.t -> Item.cat list -> (string * string) Utilities.disptree list
Higher order checking function for mcfg to pass into mcfgcky.ml and other modules requiring MG derived tree output. Not implemented.
val check_null : ’a -> ’b -> ’c -> ’d -> ’e list
Vacuous higher order checking function for mcfg to pass into Mcfgcky.parse_gen where it is used for statistical parsing modules which do not require explicit discrete derivation or derived trees.
Various output functions that we need. Coordinates output in a quiet and a verbose mode. Could probably stuff some verbose mode functionality into a separate debugging mode. Alternatively, could just have the command line option spit out what you need (include a seperate option for the fertility matrix, for example).
val round : float -> float
Round a floating point number intelligently to 3 places after the decimal. Used sparingly.
val rounded_string : float -> string
Provide a rounded string representation of the floating point number.
val pp_H_hash : Pcfg.t_bank -> Pervasives.out_channel -> unit
Output entropy values from the hash table for verbose output. Could in most cases be shut off.
val string_of_fmatrix : (Item.item array * float array array) option -> string
val pp_fmatrices : Pcfg.t_bank -> Pervasives.out_channel -> unit
Output the fertility matrix for verbose output. Could definitely be suppressed, but should be kept around for debugging purposes.
val string_of_trainrhs : Point.dcfgrhs -> Item.cat
val pp_trainpcfg : < pcfg : (string, Pcfg.pcfgrule) Hashtbl.t option; .. > -> Pervasives.out_channel -> unit
Output the treebank grammar in verbose mode. Very useful to have around.
val pp_testcfg : Pcfg.t_bank -> Pervasives.out_channel -> unit
Output the weighted intersection grammar in verbose mode. Very useful to have around.
val pp_renormedhvect : Pcfg.t_bank -> Pervasives.out_channel -> unit
Output the values of h→. Could be suppresed but useful for debugging.
val pp_inside : Pcfg.t_bank -> Pervasives.out_channel -> unit
Output the inside probabilities of different values in verbose mode. Mostly useful to have around.
val pp_testbank_report : ((string * ’a) * Pcfg.t_bank) list -> Pervasives.out_channel -> unit
Output all the testtime-pretinent values provided by the above functions in a report in verbose mode.
val pp_ent_report : < pcfg : (string, Pcfg.pcfgrule) Hashtbl.t option; .. > -> ((string * ’a) * Pcfg.t_bank) list -> string -> unit
Total output for verbose mode.
val pp_ent_sum : ’a -> ((string * ’b) * Pcfg.t_bank) list -> string -> unit
Summary of entropy values for quiet mode.
val pp_surp_sum : ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit
Summary of surprisal values for quiet mode.
val pp_surp_sum_kilbury : ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit
Summary of surprisal values for experimental kilbury parsing in quiet mode.
val pp_short_sum : ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit
Total output for quiet mode.
val pp_short_sum_kilbury : ’a -> ((string * Item.item) * Pcfg.t_bank) list -> string -> unit
Total output for quiet mode in experimental kilbury parsing.
Low-level printing capabilities for standard out and file output.
val string_of_span : Item.span -> string
Return a string of a given span.
val string_of_spanlist : Item.span list -> string
Return a string of a given span list.
val string_of_anonitem : Item.span list list -> string
Return a string of a given span list list. Not used.
val string_of_item : Item.item -> string
Return a string of a given item.
val string_of_point : Point.point -> string
Return a string of a given pointer.
val print_item : (Item.item, Point.point) Hashtbl.t -> Item.item -> unit
Print a given item to standard out.
val string_label : string -> Item.span list -> string
val cat_label : ’a -> (’a, string) Hashtbl.t -> string
val full_label : string -> Item.span list -> ’a -> (’a, string) Hashtbl.t -> string
val qtree_of_tree : (string * string) Utilities.disptree -> string
For a given derivation/derived tree return a qtree-format string for LaTeX.
val dot_of_tree : (string * string) Utilities.disptree -> string
For a given derivation/derived tree return a qtree-format string for DOT.
val pp_of_tree : (’a * string) Utilities.disptree -> string
For a given derivation/derived tree return a pretty-printed string.
val print_trees : (’a -> string -> ’b -> ((< pointers : (’d, ’e) Hashtbl.t; .. > as ’c) -> string -> ’f -> string list -> ’g list) -> ’f -> ’h option -> ’c) -> ’a -> string -> ’b -> (’c -> string -> ’f -> string list -> ’g list) -> (’g -> string) -> ’f -> string -> unit
Given a checking function and a print function, return the appropriate display tree.
val print_trees_iter : (’a -> string -> (< pointers : (’c, ’d) Hashtbl.t; .. > as ’b) -> (’e -> string -> ’f -> string list -> ’g list) -> ’f -> ’h option -> ’e) -> ’a -> string -> ’b -> (’e -> string -> ’f -> string list -> ’g list) -> (’g -> string) -> ’f -> string -> unit
Same as above, but iterates over a list of sentences passed in from a file
val print_debug : (’a -> ’b -> ’c -> ’d) -> ’a -> ’b -> ’c -> ’e -> ’d
Given a checking function and a print function, return debugging statements from the stack during parsing.