Download LinksAssociated Agents
- Pending
- Pending
We are using data from the WN-LEXICAL project:
- Project: http://sourceforge.net/projects/actr-wn-lexical/
- Paper: http://act-r.psy.cmu.edu/publication...nfo.php?id=648
Conversion
The data files in this project contain ACT-R chunks. The conversion to Soar's semantic memory declarative add is basically string replacement. We have a PHP script to perform this conversion on our git repository. The script takes via standard-in a data file from WN-LEXICAL and outputs a valid semantic memory declarative add statement. For example:
cat WNChunks.data | php wnlexical.php smem > wn.soar
This conversion is very fast (~30s) and produces about 84MB of text data. Sourcing the resulting data into Soar takes a few minutes and a good amount of memory (~1GB). To make this a one-time cost, switch semantic memory to write to a file before sourcing the data:
smem --set database file smem --set path /path/to/wn.db) This script also can create SQL statements to populate a "sqlite" or "mysql" database (swap the quoted strings for "smem" in the above command). This may be useful to search/analyze/understand the WordNet chunks.
SemCor
We have done some preliminary work with SemCor (http://web.eecs.umich.edu/~mihalcea/...mcor3.0.tar.gz), which is a set of texts semantically annotated with WordNet senses. The scripts described here (available on github) parse the SemCor format (cleaning as necessary), cross-reference the SQL outputs of the WN conversion above, and create WordNet-independent Word Sense Disambiguation (WSD) test sets.
- semcor.php:
- input=semcor
- output=cleaned, non-sgml format
- semcor-tags.php:
- input=semcor taglist directory
- output=output of semcor.php run on all files in directory
- wn-semcor.php:
- input=SQLite3 database produced from wnlexical.php, output of semcor.php
- output=SQL for WSD database
- php wnlexical.php sqlite WNChunks.data | sqlite3 wn.db
- php semcor.php < br-a01 > br-a01.txt
- php wn-semcor.php wn.db sqlite 1 y br-a01.txt | sqlite3 br-a01.db
- CREATE INDEX sk_sk ON wn_chunk_sk (sense_key);
- CREATE INDEX s_syn_w_sn_st_wl ON wn_chunk_s (synset_id,w_num,sense_number,ss_type,word_lower);
- CREATE INDEX s_wl_st ON wn_chunk_s (word_lower,ss_type);
- CREATE INDEX g_syn ON wn_chunk_g (synset_id);
- wsd_sentences:
- c_id=corpus id (from step 3)
- s_id=sentence id (from SemCor)
- w_id=word id (from SemCor)
- w_lex=lexical word to search (from WordNet)
- w_pos=part-of-speech (from SemCor/WordNet; note "a" covers both adjectives and satellites in WN)
- wsd_word_options:
- c_id, s_id, w_id = from wsd_sentences
- w_synset = possible synonym set (from WordNet)
- w_tag_count = corpus frequency of synset (from WordNet)
- w_gloss = synset definition (from WordNet)
- wsd_word_assignments
- c_id, s_id, w_id = from wsd_sentences
- w_synset = correct assignment (from SemCor, could be multiple)
- wsd_ambiguity
- c_id, w_pos = from wsd_sentences
- ambig = (# rows in wsd_assignments)/(# rows in wsd_options)
- ambig_prop = proportion of w_pos in c_id with this ambig value
- Pending
- Nate Derbinsky
- Any
- PHP