Combining Lexical Resources in a Robust Broad-Coverage Semantic Parser

John Dowding (UC/Santa Cruz) and Mathew Purver (Stanford University/CSLI)

We describe an on-going effort to produce the lexicon for a robust
broad-coverage semantic parser by combining syntactic and semantic
information from several publically available lexical resources. 

This parser is motivated by a need to extract propositional content
from human-human meetings, as part of DARPA's CALO project.
Extracting this content requires a broad-coverage lexicon, since the
meeting topics are not determined in advance.  The parser is applied
to highly errorful speech recognition results (30%-40% word error
rates, so it must be robust.  These speech recognition results are
representated as Word Confusion Networks, each of which may encode a
large number of potential utterance hypotheses, so the parser must be
fast.  For these reasons, we decided on an approach that would depend
heavily on the lexicon, with a relatively impoverished set of
grammatical rules, focusing on extract basic predicate-argument
structure, with less attention paid to more varied syntactic forms.

The resources we are currently using are COMLEX, VerbNet, WordNet,
and NomLex.  These resources each provide unique types of syntactic
and semantic information:

- COMLEX intends to provide detailed syntactic information for the
  40,000 most common words of English.  We extract from COMLEX lexical
  information for 4,200 adjectives (gradability and
  subcategorization), 5,665 verbs (subcategorization), 23,195 nouns
  (mass/count and temporality), and 3,120 adverbs (syntactic
  distribution), as well as most closed-class lexical categories.
  COMLEX also provides morphological varients for irregular forms.

- VerbNet provides semantic information for 5,000 verbs.  This
  information includes the verb class, verb frames, thematic roles,
  syntax-semantic mapping, and selectional restrictions.

- From WordNet we identify another 15,539 nouns, and the semantic
  class information for all nouns.  These semantic classes are
  hand-aligned to the selectional classes used in VerbNet, based on
  the upper ontology of EuroWordNet.

- NOMLEX (and NOMLEXPLUS) provide syntactic information for
  nominalizations, and information for mapping the noun arguments to
  the corresponding verb syntactic positions.  When combined with
  VerbNet's selectional restrictions on thematic roles, this provides
  additional selection for nominalizations.

These lexical and grammar rules are converted to the Prolog-based
format used in the Gemini framework, which includes a fast bottom-up
robust parser in which syntactic and semantic information is applied
interleaved.  The semantic rules in this grammar produce a Minimal
Recusion Semantics representation, motivated by a desire to make the
semantic features extracted by the parser available as inputs to
further machine learning algorithms for identifying higher-level
semantic content, such as the action items that have been assigned, or
decisions that have been made.

This work is similar to prior work in (SPOT), XEROX, and Swift.  It
differs from prior work primarily in the inclusion of NOMLEX, and the
mapping of nominalizations to verb frames.

References

VerbNet
COMLEX 
WordNet
NomLex
Gemini
MRS
CALO
ICSI Meeting Corpus
SPOT
Xerox
Swift