Editing your vocabulary object¶

The Nephosem module includes a Vocab class based on a dictionary, where the keys are the lemmas (in whatever unit you have decided) and the values, their frequencies in a corpus. It is one of the first things to be computed. In these examples, myvocab will stand for an instance of the Vocab class.

[1]:

import sys
nephosemdir = "../../nephosem/"
sys.path.append(nephosemdir)
mydir = f"./"
from nephosem import ConfigLoader, Vocab
conf = ConfigLoader()
settings = conf.update_config('config.ini')

Get a Vocab¶

You can load an existing Vocab with the .load() method, which is paired with a .save() method.

[2]:

myvocab = Vocab.load('output/Toy.nfreq')
myvocab
# myvocab.save('output/Toy.nfreq')

[2]:

[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]

You can inspect your vocab with myvocab.describe(), but also len(myvocab) will return the number of types, while myvocab.sum() will show the number of tokens.

[3]:

print(myvocab.describe())

Total items: 55
Total freqs: 264
count  55.000000
mean    4.800000
std     8.722895
min     1.000000
25%     1.000000
50%     1.000000
75%     4.000000
max    53.000000

[4]:

len(myvocab)

[4]:

[5]:

myvocab.sum()

[5]:

Filtering¶

You can filter a vocabulary with a frequency threshold (e.g. 10) like so:

[6]:

over10 = myvocab[myvocab.freq > 10]
over10

[6]:

[('the/D', 53),('boy/N', 25),('girl/N', 21),('eat/V', 22),('apple/N', 21),('be/V', 11)]

[7]:

top3 = myvocab[:3]
top3

[7]:

[('the/D', 53),('boy/N', 25),('eat/V', 22)]

You can also use regular expressions on names of the items to filter your vocabulary:

[8]:

nouns = myvocab[myvocab.match('item', '^..+/N.*')]
nouns

[8]:

[('boy/N', 25),('girl/N', 21),('apple/N', 21),('baby/N', 2),('food/N', 4),('house/N', 1),('year/N', 3)]

If you have a list of items, as you could receive from myvocab.get_item_list(), you can use it to get a subset with myvocab.subvocab(list_of_items). If one of the items is not present in the vocabulary, it will be included in the subset, with frequency 0.

[9]:

myvocab.subvocab(['boy/N', 'girl/N', 'vector/N'])

[9]:

[('boy/N', 25),('girl/N', 21),('vector/N', 0)]

If you feel comfortable manipulating Pandas dataframe, you can transform the vocabulary list into a dataframe with myvocab.dataframe and then feed the output back to Vocab() to turn it into a Vocab object again.

[10]:

vocab_df = myvocab.dataframe
vocab_df.head()

[10]:

	item	freq
0	the/D	53
1	boy/N	25
2	eat/V	22
3	apple/N	21
4	girl/N	21

[11]:

Vocab(vocab_df)

[11]:

[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('very/R', 1),('which/W', 1),('without/I', 1)]

Dictionaries and lists¶

Even though the Vocab class is based on a dictionary, it is possible to slice it with indices based on a descending-frequency order: myvocab['word'] will return the ‘word’ item in the vocabulary, and myvocab[3:5] will return the third and fourth most frequent items.

However, when you obtain the list of items with myvocab.get_item_list(), by default they are sorted in ascending alphabetical order:

[12]:

myvocab[3:5].get_item_list()

[12]:

['apple/N', 'girl/N']

[13]:

myvocab.get_item_list()[3:5]

[13]:

['about/I', 'about/R']

If you want to obtain the item list in descending frequency order, you must specify different values for the arguments of the get_item_list() method:

[14]:

myvocab.get_item_list(sorting = 'freq', descending=True)[3:5]

[14]:

['apple/N', 'girl/N']

Finally, remember that selecting from a dictionary is much more efficient than selecting from a list. For example, if you want to identify the items in list A that are present in myvocab, the comprehension list [x for x in A if x in myvocab] is faster than [x for x in A if x in myvocab.get_item_list()] (especially if the vocabulary is very large).