{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Editing your vocabulary object\n", "\n", "The Nephosem module includes a `Vocab` class based on a dictionary, where the keys are the lemmas (in whatever unit you have decided) and the values, their frequencies in a corpus. It is one of the first things [to be computed](all-in-one.ipynb#2.-Frequency-lists).\n", "In these examples, `myvocab` will stand for an instance of the `Vocab` class." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "nephosemdir = \"../../nephosem/\"\n", "sys.path.append(nephosemdir)\n", "mydir = f\"./\"\n", "from nephosem import ConfigLoader, Vocab\n", "conf = ConfigLoader()\n", "settings = conf.update_config('config.ini')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a Vocab\n", "\n", "You can load an existing Vocab with the `.load()` method, which is paired with a `.save()` method." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab = Vocab.load('output/Toy.nfreq')\n", "myvocab\n", "# myvocab.save('output/Toy.nfreq')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can inspect your vocab with `myvocab.describe()`, but also `len(myvocab)` will return the number of types, while `myvocab.sum()` will show the number of tokens." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total items: 55\n", "Total freqs: 264\n", "count 55.000000\n", "mean 4.800000\n", "std 8.722895\n", "min 1.000000\n", "25% 1.000000\n", "50% 1.000000\n", "75% 4.000000\n", "max 53.000000\n" ] } ], "source": [ "print(myvocab.describe())" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "55" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(myvocab)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "264" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering\n", "\n", "You can filter a vocabulary with a frequency threshold (e.g. 10) like so:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('girl/N', 21),('eat/V', 22),('apple/N', 21),('be/V', 11)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "over10 = myvocab[myvocab.freq > 10]\n", "over10" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top3 = myvocab[:3]\n", "top3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use [regular expressions](https://regexr.com/) on names of the items to filter your vocabulary:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('boy/N', 25),('girl/N', 21),('apple/N', 21),('baby/N', 2),('food/N', 4),('house/N', 1),('year/N', 3)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nouns = myvocab[myvocab.match('item', '^..+/N.*')]\n", "nouns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a list of items, as you could receive from `myvocab.get_item_list()`, you can use it to get a subset with `myvocab.subvocab(list_of_items)`. If one of the items is not present in the vocabulary, it *will* be included in the subset, with frequency 0." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('boy/N', 25),('girl/N', 21),('vector/N', 0)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.subvocab(['boy/N', 'girl/N', 'vector/N'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you feel comfortable manipulating Pandas dataframe, you can transform the vocabulary list into a dataframe with `myvocab.dataframe` and then feed the output back to `Vocab()` to turn it into a `Vocab` object again." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemfreq
0the/D53
1boy/N25
2eat/V22
3apple/N21
4girl/N21
\n", "
" ], "text/plain": [ " item freq\n", "0 the/D 53\n", "1 boy/N 25\n", "2 eat/V 22\n", "3 apple/N 21\n", "4 girl/N 21" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_df = myvocab.dataframe\n", "vocab_df.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('very/R', 1),('which/W', 1),('without/I', 1)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Vocab(vocab_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionaries and lists\n", "\n", "Even though the `Vocab` class is based on a dictionary, it is possible to slice it with indices based on a descending-frequency order: `myvocab['word']` will return the 'word' item in the vocabulary, and `myvocab[3:5]` will return the third and fourth most frequent items.\n", "\n", "However, when you obtain the list of items with `myvocab.get_item_list()`, by default they are sorted in ascending *alphabetical* order:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['apple/N', 'girl/N']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab[3:5].get_item_list()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['about/I', 'about/R']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.get_item_list()[3:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to obtain the item list in descending frequency order, you must specify different values for the arguments of the `get_item_list()` method:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['apple/N', 'girl/N']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.get_item_list(sorting = 'freq', descending=True)[3:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, remember that selecting from a dictionary is much more efficient than selecting from a list. For example, if you want to identify the items in list `A` that are present in `myvocab`, the comprehension list `[x for x in A if x in myvocab]` is faster than `[x for x in A if x in myvocab.get_item_list()]` (especially if the vocabulary is very large)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }