Editing your vocabulary object¶
The Nephosem module includes a Vocab
class based on a dictionary, where the keys are the lemmas (in whatever unit you have decided) and the values, their frequencies in a corpus. It is one of the first things to be computed. In these examples, myvocab
will stand for an instance of the Vocab
class.
[1]:
import sys
nephosemdir = "../../nephosem/"
sys.path.append(nephosemdir)
mydir = f"./"
from nephosem import ConfigLoader, Vocab
conf = ConfigLoader()
settings = conf.update_config('config.ini')
Get a Vocab¶
You can load an existing Vocab with the .load()
method, which is paired with a .save()
method.
[2]:
myvocab = Vocab.load('output/Toy.nfreq')
myvocab
# myvocab.save('output/Toy.nfreq')
[2]:
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]
You can inspect your vocab with myvocab.describe()
, but also len(myvocab)
will return the number of types, while myvocab.sum()
will show the number of tokens.
[3]:
print(myvocab.describe())
Total items: 55
Total freqs: 264
count 55.000000
mean 4.800000
std 8.722895
min 1.000000
25% 1.000000
50% 1.000000
75% 4.000000
max 53.000000
[4]:
len(myvocab)
[4]:
55
[5]:
myvocab.sum()
[5]:
264
Filtering¶
You can filter a vocabulary with a frequency threshold (e.g. 10) like so:
[6]:
over10 = myvocab[myvocab.freq > 10]
over10
[6]:
[('the/D', 53),('boy/N', 25),('girl/N', 21),('eat/V', 22),('apple/N', 21),('be/V', 11)]
[7]:
top3 = myvocab[:3]
top3
[7]:
[('the/D', 53),('boy/N', 25),('eat/V', 22)]
You can also use regular expressions on names of the items to filter your vocabulary:
[8]:
nouns = myvocab[myvocab.match('item', '^..+/N.*')]
nouns
[8]:
[('boy/N', 25),('girl/N', 21),('apple/N', 21),('baby/N', 2),('food/N', 4),('house/N', 1),('year/N', 3)]
If you have a list of items, as you could receive from myvocab.get_item_list()
, you can use it to get a subset with myvocab.subvocab(list_of_items)
. If one of the items is not present in the vocabulary, it will be included in the subset, with frequency 0.
[9]:
myvocab.subvocab(['boy/N', 'girl/N', 'vector/N'])
[9]:
[('boy/N', 25),('girl/N', 21),('vector/N', 0)]
If you feel comfortable manipulating Pandas dataframe, you can transform the vocabulary list into a dataframe with myvocab.dataframe
and then feed the output back to Vocab()
to turn it into a Vocab
object again.
[10]:
vocab_df = myvocab.dataframe
vocab_df.head()
[10]:
item | freq | |
---|---|---|
0 | the/D | 53 |
1 | boy/N | 25 |
2 | eat/V | 22 |
3 | apple/N | 21 |
4 | girl/N | 21 |
[11]:
Vocab(vocab_df)
[11]:
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('very/R', 1),('which/W', 1),('without/I', 1)]
Dictionaries and lists¶
Even though the Vocab
class is based on a dictionary, it is possible to slice it with indices based on a descending-frequency order: myvocab['word']
will return the ‘word’ item in the vocabulary, and myvocab[3:5]
will return the third and fourth most frequent items.
However, when you obtain the list of items with myvocab.get_item_list()
, by default they are sorted in ascending alphabetical order:
[12]:
myvocab[3:5].get_item_list()
[12]:
['apple/N', 'girl/N']
[13]:
myvocab.get_item_list()[3:5]
[13]:
['about/I', 'about/R']
If you want to obtain the item list in descending frequency order, you must specify different values for the arguments of the get_item_list()
method:
[14]:
myvocab.get_item_list(sorting = 'freq', descending=True)[3:5]
[14]:
['apple/N', 'girl/N']
Finally, remember that selecting from a dictionary is much more efficient than selecting from a list. For example, if you want to identify the items in list A
that are present in myvocab
, the comprehension list [x for x in A if x in myvocab]
is faster than [x for x in A if x in myvocab.get_item_list()]
(especially if the vocabulary is very large).