This project is a word sense disambiguation task that involves some preliminary work importing a WordNet database into Soar's Semantic Memory. It contains a set of PhP scripts that does various conversions to a format that Soar can use and an agent that uses that knowledge to disambiguate words in various sentences.

Download LinksAssociated Agents
  • Pending
Documentation
  • Pending
Data Source

We are using data from the WN-LEXICAL project:From the project page, download the wn-lexical-data for the version of WordNet of interest. Once extracted, WNChunks.data is the file of interest.

Conversion

The data files in this project contain ACT-R chunks. The conversion to Soar's semantic memory declarative add is basically string replacement. We have a PHP script to perform this conversion on our git repository. The script takes via standard-in a data file from WN-LEXICAL and outputs a valid semantic memory declarative add statement. For example:
cat WNChunks.data | php wnlexical.php smem > wn.soar
This conversion is very fast (~30s) and produces about 84MB of text data. Sourcing the resulting data into Soar takes a few minutes and a good amount of memory (~1GB). To make this a one-time cost, switch semantic memory to write to a file before sourcing the data:

smem --set database file smem --set path /path/to/wn.db) This script also can create SQL statements to populate a "sqlite" or "mysql" database (swap the quoted strings for "smem" in the above command). This may be useful to search/analyze/understand the WordNet chunks.

SemCor

We have done some preliminary work with SemCor (http://web.eecs.umich.edu/~mihalcea/...mcor3.0.tar.gz), which is a set of texts semantically annotated with WordNet senses. The scripts described here (available on github) parse the SemCor format (cleaning as necessary), cross-reference the SQL outputs of the WN conversion above, and create WordNet-independent Word Sense Disambiguation (WSD) test sets.
  • semcor.php:
    • input=semcor
    • output=cleaned, non-sgml format
  • semcor-tags.php:
    • input=semcor taglist directory
    • output=output of semcor.php run on all files in directory
  • wn-semcor.php:
    • input=SQLite3 database produced from wnlexical.php, output of semcor.php
    • output=SQL for WSD database
Example sequence (assumes WN-LEXICAL 3 and SemCor 3):
  1. php wnlexical.php sqlite WNChunks.data | sqlite3 wn.db
  2. php semcor.php < br-a01 > br-a01.txt
  3. php wn-semcor.php wn.db sqlite 1 y br-a01.txt | sqlite3 br-a01.db
Note that for all but the very smallest SemCor data files, additional indexes on the WordNet database will dramatically reduce production of the WSD database (step 3):
  • CREATE INDEX sk_sk ON wn_chunk_sk (sense_key);
  • CREATE INDEX s_syn_w_sn_st_wl ON wn_chunk_s (synset_id,w_num,sense_number,ss_type,word_lower);
  • CREATE INDEX s_wl_st ON wn_chunk_s (word_lower,ss_type);
  • CREATE INDEX g_syn ON wn_chunk_g (synset_id);
The output will contain four tables:
  • wsd_sentences:
    • c_id=corpus id (from step 3)
    • s_id=sentence id (from SemCor)
    • w_id=word id (from SemCor)
    • w_lex=lexical word to search (from WordNet)
    • w_pos=part-of-speech (from SemCor/WordNet; note "a" covers both adjectives and satellites in WN)
  • wsd_word_options:
    • c_id, s_id, w_id = from wsd_sentences
    • w_synset = possible synonym set (from WordNet)
    • w_tag_count = corpus frequency of synset (from WordNet)
    • w_gloss = synset definition (from WordNet)
  • wsd_word_assignments
    • c_id, s_id, w_id = from wsd_sentences
    • w_synset = correct assignment (from SemCor, could be multiple)
  • wsd_ambiguity
    • c_id, w_pos = from wsd_sentences
    • ambig = (# rows in wsd_assignments)/(# rows in wsd_options)
    • ambig_prop = proportion of w_pos in c_id with this ambig value
Associated Publications
  • Pending
Developer
  • Nate Derbinsky
Soar Versions
  • Any
Language
  • PHP