{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Editing your vocabulary object\n", "\n", "The Nephosem module includes a `Vocab` class based on a dictionary, where the keys are the lemmas (in whatever unit you have decided) and the values, their frequencies in a corpus. It is one of the first things [to be computed](all-in-one.ipynb#2.-Frequency-lists).\n", "In these examples, `myvocab` will stand for an instance of the `Vocab` class." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "nephosemdir = \"../../nephosem/\"\n", "sys.path.append(nephosemdir)\n", "mydir = f\"./\"\n", "from nephosem import ConfigLoader, Vocab\n", "conf = ConfigLoader()\n", "settings = conf.update_config('config.ini')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get a Vocab\n", "\n", "You can load an existing Vocab with the `.load()` method, which is paired with a `.save()` method." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab = Vocab.load('output/Toy.nfreq')\n", "myvocab\n", "# myvocab.save('output/Toy.nfreq')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can inspect your vocab with `myvocab.describe()`, but also `len(myvocab)` will return the number of types, while `myvocab.sum()` will show the number of tokens." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total items: 55\n", "Total freqs: 264\n", "count 55.000000\n", "mean 4.800000\n", "std 8.722895\n", "min 1.000000\n", "25% 1.000000\n", "50% 1.000000\n", "75% 4.000000\n", "max 53.000000\n" ] } ], "source": [ "print(myvocab.describe())" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "55" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(myvocab)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "264" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering\n", "\n", "You can filter a vocabulary with a frequency threshold (e.g. 10) like so:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('girl/N', 21),('eat/V', 22),('apple/N', 21),('be/V', 11)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "over10 = myvocab[myvocab.freq > 10]\n", "over10" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top3 = myvocab[:3]\n", "top3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use [regular expressions](https://regexr.com/) on names of the items to filter your vocabulary:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('boy/N', 25),('girl/N', 21),('apple/N', 21),('baby/N', 2),('food/N', 4),('house/N', 1),('year/N', 3)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nouns = myvocab[myvocab.match('item', '^..+/N.*')]\n", "nouns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a list of items, as you could receive from `myvocab.get_item_list()`, you can use it to get a subset with `myvocab.subvocab(list_of_items)`. If one of the items is not present in the vocabulary, it *will* be included in the subset, with frequency 0." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('boy/N', 25),('girl/N', 21),('vector/N', 0)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myvocab.subvocab(['boy/N', 'girl/N', 'vector/N'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you feel comfortable manipulating Pandas dataframe, you can transform the vocabulary list into a dataframe with `myvocab.dataframe` and then feed the output back to `Vocab()` to turn it into a `Vocab` object again." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | item | \n", "freq | \n", "
---|---|---|
0 | \n", "the/D | \n", "53 | \n", "
1 | \n", "boy/N | \n", "25 | \n", "
2 | \n", "eat/V | \n", "22 | \n", "
3 | \n", "apple/N | \n", "21 | \n", "
4 | \n", "girl/N | \n", "21 | \n", "