I am trying to create a compact version of the resulting LFs in the parses of the CCG Parse.java class.
By default, those get unpacked into a fully flattened list of parses, but I am trying to achieve a single <lf> output which will contain a compact and disjunctive representation of the results.
I have already found some infrastructure that could be useful, namely the HyloHelper "compact" and "append" methods, but I am still struggling with figuring out how to interact with the CKY table represented by parse/Chart.java, so that I can start compacting at the right moment of processing. I have seen that there are papers online showing off results with a similar CCG extension, but I have never found traces of the code anywhere online. Could anyone be able to give me some tips?
There is code to support generating from disjunctive logical forms (DLFs), but the creation of DLFs to date has been assumed to take place externally to OpenCCG (eg in a sentence planner).
That said, it would be interesting to add the capability to create DLFs from the parse chart. I'm curious, what is your intended usage?
In principle, creating a DLF from the parse chart is conceptually straightforward - just add 'or' nodes for the 'or' branches in the and-or graph that the chart implicitly represents. In practice, there will be a number of issues to address in implementing this capability (and unfortunately it's not high on my 'to do' list right now). First, note that the LFs will have to be re-constructed bottom up, with the disjunctions in them. Also, the unifications resulting from the rule applications will have to be re-applied to the DLF under construction. Finally, the compact and convert routines will likely need to be modified to support inputs containing disjunctions.
If you're interested in trying to implement this capability, I can answer more specific implementation questions.
Thank you for your helpful reply! I am afraid that the implications of the disjunctions being present during the parse process are a bit too sever for me to handle in the time-frame I had originally intended to fit. As you said, working with LFs already in the chart probably requires extensions of many (all?) components of the parsing process.
My use case is a bit different from typical CCG usages, I am trying to parse linearized mathematical expressions into semantic trees, where the disjunctive compact form would allow me to convert the output to an explicit representation of the present ambiguities, which can in turn be further disambiguated by external tools. What is essential for my problem is that:
1. I preserve all disambiguation my grammar can give me
2. I don't multiply out ambiguity into multiple parses, but I localize it to the greatest extent possible. Of course, localizing should not be done at the expense of 1.
If I achieve this I can directly move the ambiguous result into a standard XML language for representing mathematics (Content MathML) and almost immediately achieve searching on ambiguous mathematical expressions. That is one of the targets of my Master's Thesis, which I am currently working on. Any suggestions on what is the easiest way to get this outcome would be very welcome, as I don't think I will have the time to learn so much of OpenCCG as to make all necessary extensions to the Parser components. A tentative solution I am thinking of is to post-process the sequence of parses with my own disjunctive compactification, but that is a highly non-trivial problem in itself.
I don't know anything else about the specific domain you are working with, but if it is a non-human grammar of mathematical expressions, OpenCCG might be overkill. Have you looked into parser generators like Antlr?
Having said that, you should look at work by Magdalena Wolska, using OpenCCG to work with mathematical expressions. Various papers available on her publications page:
I am not fully convinced it would be an overkill, since mathematical expressions in full generality come back and bit you in unexpectedly complex ways. I have already invested some work in the OpenCCG approach and will try to push that further, at least for now. If things turn out to be a dead-end street I would retrace my steps and start with something like Antlr indeed.
Thank you to the pointer to Magdalena, actually she has been kind enough to offer support to my work already. So I am lucky to have valuable help as far as the grammar is concerned, my bottleneck at the moment is getting this representation bit out of the way, hence I am trying to write my own compactifier for disjunctive LFs.
Thanks for the tips,
Thanks for the background on your project. I think you may be
right that it will be easier to create a disjunctive LF from
multiple parses than doing so directly from the chart, at least
in terms of avoiding getting too far into openccg internals.
I've thought about this a bit and it seems that it may not be too
hard to come up with something reasonable. The way to do it
might be to start with the flattened LFs for each parse, matching
up the elementary predications (EPs) that are equal, and creating
a disjunct for the ones that differ. I'll illustrate by example
below, though I'm sure there will be some further details to
resolve to turn this into an algorithm. If you pursue this
direction, I'd like to hear how it goes.
Note btw that as you've probably already discovered, for each
parse there is a distinct sign, whose LF is accessible via the
category (ie get the cat then get the LF). As shown in
TextCCG.java, the LF can be 'compacted' and 'converted', ie
compacted from flat form into hierarchical form, with nominals
instantiated as atoms. You can do this and then flatten the
result to get a flat list of EPs with nominal atoms. (Another
option might be to just skip the compaction step.)
Ok, as an example, let's look at the classic 5-way ambiguity in
'A man saw a girl on the hill with a telescope' (ignoring
determiners and tense). In flat form, the 5 readings can be
represented as follows, where the common part is shown just once:
@e(see) ^ @e(<Arg0>m ^ @e(<Arg1>g) ^ @m(man) ^ @g(girl) ^
@o(on) ^ @o(<Arg1>h) ^ @h(hill) ^
@w(with) ^ @w(<Arg1>t) ^ @t(telescope) ^
1. … @e(<Mod>o) ^ @h(<Mod>w)
2. … @g(<Mod>o) ^ @g(<Mod>w)
3. … @g(<Mod>o) ^ @h(<Mod>w)
4. … @e(<Mod>o) ^ @e(<Mod>w)
5. … @g(<Mod>o) ^ @e(<Mod>w)
So, if you take readings 1 and 2, recognize the common parts, and
disjoin the parts that differ, you get:
1..2: … (@e(<Mod>o) ^ @h(<Mod>w)) v (@g(<Mod>o) ^ @g(<Mod>w))
Now, when we get to the 3rd reading, we could just add a 3rd
disjunct. But, we'll get a more compact DLF if we recognize that
the final two EPs can be partially merged with the 2nd disjunct,
ie by recognizing that @g(<Mod>o) is in common, and disjoining
1..3: … (@e(<Mod>o) ^ @h(<Mod>w)) v
(@g(<Mod>o) ^ (@g(<Mod>w) v @h(<Mod>w)))
Similarly, the 4th reading can be integrated by partially merging
with the first disjunct:
1..4: … (@e(<Mod>o) ^ (@h(<Mod>w) v @e(<Mod>w))) v
(@g(<Mod>o) ^ (@g(<Mod>w) v @h(<Mod>w)))
Finally, the 5th reading can be integrated by partially merging
with the 2nd disjunct again:
1..5: … (@e(<Mod>o) ^ (@h(<Mod>w) v @e(<Mod>w))) v
(@g(<Mod>o) ^ (@g(<Mod>w) v @h(<Mod>w) v @e(<Mod>w)))
At this point the DLF represents all five readings. The
conjoined part could then be compacted for better readability, I
In the general case, it may be that incrementally merging in this
way does not produce an optimally compact DLF, but that may not
be too important in the scheme of things.