User Tools

Site Tools


projects:wordnet:wordnet

project:wordnet

Wordnet

Maintained at Princeton, though no longer updated.

We started with the 3.1 version of the wordnet data, in the file named:
wn3.1.dict.tar.gz 16MB

and loaded the data into a SQL database.

Documentation
https://wordnet.princeton.edu/documentation

A topic map implementation
https://www.visual-thesaurus.com/wordnet.php

concordance - alphabetical list of principal words of a book
index - alphabetical list of topics
lemma - word that stands at the head of a definition in a dictionary. In english, the lemma is the singular masculine infinitive
lexeme - a unit of meaning, the lemma and its forms. run, runs, running all one lexeme, the lemma is run.
stem - the part that never changes. produce, producer, producing, production; the stem is produc
corpus
lexicon

example:
lemma: produce
lexeme: produce, producer, producing, production
stem: produc

synset - A synonym set; a set of words that are interchangeable.
hypernym - superordinate
hyponym - subordinate
unique beginner - a sunset with no hypernym
word - word
collocation - word combination, eg. “fountain pen”, “take in”

pos - part of speech: noun, verb, adj, adv

Adjectives may be organized into clusters containing head synsets and satellite synsets.

Adverbs generally point to the adjectives from which they are derived.

wordnet pointers

wordnet sense

wordnet adjective clusters

wordnet reflexive adjectives

notes

nouns have tops
adjectives have heads and satellites
verbs have frames

lexicon - dictionary
lexicography - the practice of making dictionary files
lexis - gr. word or speech
lexical - relating to words or vocabulary
sense = meaning = definition = gloss

Two sets of lexicography files
1. lexicograher files - created manually by lexicographers, input to grind
2. db files - index + data, output from grind, used in apps

Two additional files, used by both lexicography and db file sets
cntlist - frequency counts per word
verb.Frametexts - generic sentence frames, frame = grammar pattern

lexicographer files
stored in wordnet/dict/dbfiles/  (note: this foldername dbfiles is misleading.)
https://wordnet.princeton.edu/documentation/wninput5wn
input to grind
pos.category

format
{} - each row is wrapped with braces and terminated with a newline
[] - each synset is wrapped in square brackets
() - each definition is wrapped in parens
fields are separated by a single space

simplest structure
{ entity, (that which is perceived...distinct existence) }
two fields, word and definition, separated by a space
the word has a trailing comma because it could potentially be part of a list
the definition is wrapped in parens ()

first complication
{ physical_entity, entity,@ (an entity that has physical existence) }
two words in a list
word comma symbol
symbols: 
no symbol, comma only
@ - hypernym
+ - derivationally related form
! - antonym
#m - member holonym

digit appended to word
{ nutrient1, substance1,@ (any substance (such as a chemical element or inorganic compound) that can be taken in by a green plant and used in organic synthesis) }

brackets
list of words and bracketed sets, commas after words but not after bracketed set
{ [ animal, verb.creation:animalize,+ verb.creation:animalise,+ ] animate_being, [ beast, adj.all:inhumane^beastly,+ adj.all:inhumane^bestial,+ ] brute, [ creature, verb.creation:create,+ ] fauna, organism,@ noun.animal:kingdom_Animalia,#m (a living organism characterized by voluntary movement) }

filename colon word comma symbol
verb.creation:animalize,+

^ caret followed by another word
{ food, [ nutrient, adj.all:wholesome^nutrient,+ verb.consumption:nutrify,+ ] substance1,@ (any substance that can be metabolized by an animal to give energy and build tissue) }

{ [ substance, adj.all:substantial,+ ] matter,@  noun.relation:part,@ (the real physical matter of which a person or thing consists; "DNA is the substance of our genes") }
{ substance1, matter,@ (a particular kind or species of matter with uniform properties; "shigella is one of the most toxic substances known to man") }

------
db files
stored in wordnet/dict/
https://wordnet.princeton.edu/documentation/wndb5wn
output from grind
index.pos + data.pos - the database, 8 files
pos.exc, cousin.exc - morphology exception lists
sentidx.vrb, sents.vrb - used to display example sentences for some verbs
index.sense - an alternative index into a synset

format
index
Each index file is an alphabetized list of all the words found in WordNet in the corresponding part of speech. On each line, following the word, is a list of byte offsets (synset_offset s) in the corresponding data file, one for each synset containing the word. Words in the index file are in lower case only, regardless of how they were entered in the lexicographer files. This folds various orthographic representations of the word into one line enabling database searches to be case insensitive. See wninput(5WN)(link is external) for a detailed description of the lexicographer files

Each index file begins with several lines containing a copyright notice, version number, and license agreement. These lines all begin with two spaces and the line number so they do not interfere with the binary search algorithm that is used to look up entries in the index files. All other lines are in the following format. In the field descriptions, number always refers to a decimal integer unless otherwise defined.

lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...] 

lemma
lower case ASCII text of word or collocation. Collocations are formed by joining individual words with an underscore (_ ) character.
pos
Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.
All remaining fields are with respect to senses of lemma in pos .

synset_cnt

Number of synsets that lemma is in. This is the number of senses of the word in WordNet. See Sense Numbers below for a discussion of how sense numbers are assigned and the order of synset_offset s in the index files.
p_cnt
Number of different pointers that lemma has in all synsets containing it.
ptr_symbol
A space separated list of p_cnt different types of pointers that lemma has in all synsets containing it. See wninput(5WN) for a list of pointer_symbol s. If all senses of lemma have no pointers, this field is omitted and p_cnt is 0 .
sense_cnt
Same as sense_cnt above. This is redundant, but the field was preserved for compatibility reasons.
tagsense_cnt
Number of senses of lemma that are ranked according to their frequency of occurrence in semantic concordance texts.
synset_offset
Byte offset in data.pos file of a synset containing lemma . Each synset_offset in the list corresponds to a different sense of lemma in WordNet. synset_offset is an 8 digit, zero-filled decimal integer that can be used with fseek(link is external)(3)(link is external) to read a synset from the data file. When passed to read_synset(3WN) along with the syntactic category, a data structure containing the parsed synset is returned.


data
one line per synset, with relational pointers resolved to synset_offset s.  
Pointers are followed and hierarchies traversed by moving from one synset to another via the synset_offset s.

synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |   gloss 

synset_offset - Current byte offset in the file represented as an 8 digit decimal integer.

lex_filenum - Two digit decimal integer corresponding to the lexicographer file name containing the synset. (lexnames(5WN)

ss_type - One character code indicating the synset type:
n    NOUN 
v    VERB 
a    ADJECTIVE 
s    ADJECTIVE SATELLITE 
r    ADVERB 

w_cnt - Two digit hexadecimal integer indicating the number of words in the synset.
word - or colocation, case sensitive.  In data.adj , a word is followed by a syntactic marker if one was specified in the lexicographer file. A syntactic marker is appended, in parentheses, onto word without any intervening spaces. See wninput(5WN)(link is external) for a list of the syntactic markers for adjectives.

lex_id - One digit hexadecimal integer that, when appended onto lemma , uniquely identifies a sense within a lexicographer file.

p_cnt - Three digit decimal integer indicating the number of pointers from this synset to other synsets. If p_cnt is 000 the synset has no pointers.

ptr - A pointer from this synset to another. (wninput(5WN)) ptr is of the form:
	pointer_symbol  synset_offset  pos  source/target 

	where synset_offset is the byte offset of the target synset in the data file corresponding to pos .

	The source/target field distinguishes lexical and semantic pointers. It is a four byte field, containing two two-digit hexadecimal integers. The first two digits indicates the word number in the current (source) synset, the last two digits indicate the word number in the target synset. A value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset indicated by synset_offset .

	A lexical relation between two words in different synsets is represented by non-zero values in the source and target word numbers. The first and last two bytes of this field indicate the word numbers in the source and target synsets, respectively, between which the relation holds. Word numbers are assigned to the word fields in a synset, from left to right, beginning with 1 .

frames - In data.verb only, a list of numbers corresponding to the generic verb sentence frames for word s in the synset. frames is of the form:
	f_cnt   +   f_num  w_num  [ +   f_num  w_num...] 

	where f_cnt a two digit decimal integer indicating the number of generic frames listed, f_num is a two digit decimal integer frame number, and w_num is a two digit hexadecimal integer indicating the word in the synset that the frame applies to. As with pointers, if this number is 00 , f_num applies to all word s in the synset. If non-zero, it is applicable only to the word indicated. Word numbers are assigned as described for pointers. Each f_num  w_num pair is preceded by a + . See wninput(5WN) for the text of the generic sentence frames.

gloss - Each synset contains a gloss. A gloss is represented as a vertical bar (| ), followed by a text string that continues until the end of the line. The gloss may contain a definition, one or more example sentences, or both.

example 1
00001740 03 n 01 entity 0 003 ~ 00001930 n 0000 ~ 00002137 n 0000 ~ 04431553 n 0000 | that which is perceived or known or inferred to have its own distinct existence (living or nonliving)  

offset 00001740
filenum 03 noun.Tops
ss_type n  noun
wrd_cnt 01  1 word in the synset
word entity 
lex_id 0   1st sense of this word
p_cnt 3    three relation pointers
ptr1 ~ 00001930 n 0000   hyponym physical_entity noun semantic
ptr2 ~ 00002137 n 0000   hyponym abstraction, abstract_entity noun semantic
ptr3 ~ 04431553 n 0000   hyponym thing noun semantic
gloss "| that which is perceived..."

example 2
00002137 03 n 02 abstraction 0 abstract_entity 0 010 @ 00001740 n 0000 + 00694095 v 0101 ~ 00023280 n 0000 ~ 00024444 n 0000 ~ 00031563 n 0000 ~ 00032220 n 0000 ~ 00033319 n 0000 ~ 00033914 n 0000 ~ 05818169 n 0000 ~ 08016141 n 0000 | a general concept formed by extracting common features from specific examples  

offset 00002137
filenum 03  noun.Tops
ss_type  n  noun
wrd_cnt 02  two words in this synset
word abstraction 
lex_id 0  1st sense of this word
word abstract_entity
lex_id 0 1st sense of this word
p_cnt 010 ten relation pointers
@ 00001740 n 0000  hypernym noun semantic
+ 00694095 v 0101  derivationally related form, verb lexical "to abstract"
~ 00023280 n 0000  hyponym noun semantic
~ 00024444 n 0000  hyponym noun semantic 
...
gloss "| a general concept formed by..."
projects/wordnet/wordnet.txt · Last modified: 2023/01/12 08:38 by jhagstrand

Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain
Public Domain Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki