Editing your vocabulary object

The Nephosem module includes a Vocab class based on a dictionary, where the keys are the lemmas (in whatever unit you have decided) and the values, their frequencies in a corpus. It is one of the first things to be computed. In these examples, myvocab will stand for an instance of the Vocab class.

[1]:
import sys
nephosemdir = "../../nephosem/"
sys.path.append(nephosemdir)
mydir = f"./"
from nephosem import ConfigLoader, Vocab
conf = ConfigLoader()
settings = conf.update_config('config.ini')

Get a Vocab

You can load an existing Vocab with the .load() method, which is paired with a .save() method.

[2]:
myvocab = Vocab.load('output/Toy.nfreq')
myvocab
# myvocab.save('output/Toy.nfreq')
[2]:
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]

You can inspect your vocab with myvocab.describe(), but also len(myvocab) will return the number of types, while myvocab.sum() will show the number of tokens.

[3]:
print(myvocab.describe())
Total items: 55
Total freqs: 264
count  55.000000
mean    4.800000
std     8.722895
min     1.000000
25%     1.000000
50%     1.000000
75%     4.000000
max    53.000000
[4]:
len(myvocab)
[4]:
55
[5]:
myvocab.sum()
[5]:
264

Filtering

You can filter a vocabulary with a frequency threshold (e.g. 10) like so:

[6]:
over10 = myvocab[myvocab.freq > 10]
over10
[6]:
[('the/D', 53),('boy/N', 25),('girl/N', 21),('eat/V', 22),('apple/N', 21),('be/V', 11)]
[7]:
top3 = myvocab[:3]
top3
[7]:
[('the/D', 53),('boy/N', 25),('eat/V', 22)]

You can also use regular expressions on names of the items to filter your vocabulary:

[8]:
nouns = myvocab[myvocab.match('item', '^..+/N.*')]
nouns
[8]:
[('boy/N', 25),('girl/N', 21),('apple/N', 21),('baby/N', 2),('food/N', 4),('house/N', 1),('year/N', 3)]

If you have a list of items, as you could receive from myvocab.get_item_list(), you can use it to get a subset with myvocab.subvocab(list_of_items). If one of the items is not present in the vocabulary, it will be included in the subset, with frequency 0.

[9]:
myvocab.subvocab(['boy/N', 'girl/N', 'vector/N'])
[9]:
[('boy/N', 25),('girl/N', 21),('vector/N', 0)]

If you feel comfortable manipulating Pandas dataframe, you can transform the vocabulary list into a dataframe with myvocab.dataframe and then feed the output back to Vocab() to turn it into a Vocab object again.

[10]:
vocab_df = myvocab.dataframe
vocab_df.head()
[10]:
item freq
0 the/D 53
1 boy/N 25
2 eat/V 22
3 apple/N 21
4 girl/N 21
[11]:
Vocab(vocab_df)
[11]:
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('very/R', 1),('which/W', 1),('without/I', 1)]

Dictionaries and lists

Even though the Vocab class is based on a dictionary, it is possible to slice it with indices based on a descending-frequency order: myvocab['word'] will return the ‘word’ item in the vocabulary, and myvocab[3:5] will return the third and fourth most frequent items.

However, when you obtain the list of items with myvocab.get_item_list(), by default they are sorted in ascending alphabetical order:

[12]:
myvocab[3:5].get_item_list()
[12]:
['apple/N', 'girl/N']
[13]:
myvocab.get_item_list()[3:5]
[13]:
['about/I', 'about/R']

If you want to obtain the item list in descending frequency order, you must specify different values for the arguments of the get_item_list() method:

[14]:
myvocab.get_item_list(sorting = 'freq', descending=True)[3:5]
[14]:
['apple/N', 'girl/N']

Finally, remember that selecting from a dictionary is much more efficient than selecting from a list. For example, if you want to identify the items in list A that are present in myvocab, the comprehension list [x for x in A if x in myvocab] is faster than [x for x in A if x in myvocab.get_item_list()] (especially if the vocabulary is very large).