{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# All-in-one nephosem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows the main tasks you can perform with the `nephosem` library, with the following steps:\n", "\n", "0. [**Inital setup**](#0.-Initial-setup): load required libraries\n", "1. [**Configuration**](#1.-Configuration): define settings specific for your case study, related mostly to how to read your corpus.\n", "2. [**Frequency lists**](#2.-Frequency-lists)\n", "3. [**Collocation matrices**](#3.-Co-occurrence-matrix): create co-occurrence matrices with and without dependency information\n", "4. [**Association measures**](#4.-Association-measures)\n", "5. [**Basic token level**](#5.-Basic-token-level): create token-level vectors, with and without dependency information\n", "6. [**Full token level**](#6.-Full-token-level): weights and replaces first-order context words with their type-level vectors\n", "7. [**Cosine distances**](#7.-Cosine-distances)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "NOTE: \n", "Tips on manipulation of the different objects will be given in their respective tutorials.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Initial setup " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np # for \"booleanize()\"\n", "from scipy import sparse # for \"booleanize()\"\n", "import logging # to keep debugging log\n", "import sys\n", "nephosemdir = \"../../nephosem/\"\n", "sys.path.append(nephosemdir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once `nephosem` is in our path, we can import different classes and functions from the library, depending on the specific tasks you need to do." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from nephosem.conf import ConfigLoader # to setup the configuration\n", "from nephosem import Vocab, TypeTokenMatrix # to manage frequency lists and matrices\n", "from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler # to generate frequency lists and matrices\n", "from nephosem import compute_association, compute_distance # to compute PPMI and distances\n", "from nephosem.specutils.mxcalc import compute_token_weights, compute_token_vectors # for token level\n", "from nephosem.models.typetoken import build_tc_weight_matrix # for weighting at token level\n", "\n", "# For dependencies\n", "from nephosem.core.graph import SentenceGraph, MacroGraph, PatternGraph\n", "from nephosem.models.deprel import DepRelHandler, read_sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Configuration " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on what you need, you will have to set up some useful paths as variables for your future filenames." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "mydir = f\"./\"\n", "output_path = f\"{mydir}/output/\"\n", "corpus_name = 'Toy'\n", "logging.basicConfig(filename = f'{mydir}/log.log', level = logging.DEBUG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important concrete step is to adapt the configuration file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "WARNING: \n", "You need to run the appropriate settings at the beginning of every script/notebook you run.\n", "Every part of the code of one same project has to use the same settings.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "conf = ConfigLoader()\n", "settings = conf.settings\n", "# If you already have your settings in a config file, you can load them:\n", "# settings = conf.update_config('config.ini')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook, we will use a dataset of toy sentences in English annotated with Stanford dependencies, stored in 'data'. This is what part of one of the files looks like:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "The\tDT\tthe\t1\t2\tdet\n", "\n", "girl\tNNS\tgirl\t2\t3\tnsubj\n", "\n", "looks\tVBZ\tlook\t3\t0\tROOT\n", "\n", "healthy\tJJ\thealthy\t4\t3\tacomp\n", "\n", "\n", "\n" ] } ], "source": [ "with open('data/StanfDepSents.1.conll', 'r') as f:\n", " lines = f.readlines()\n", "for line in lines[:6]:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the one hand, we have token lines: each token is in a line with tab-separated attributes. In this case, they are: word form, part of speech, lemma, index in sentence, index of dependency head, dependency relation.\n", "On the other hand, we have lines with other information, in this case sentence delimiters.\n", "\n", "Now we need to define the settings so that the code knows how to read the corpus: which lines count as tokens, where sentences end, and which are the different attributes of the corpus. In addition, we will specify what attributes we want for the definition of a type.\n", "\n", "The `line-machine` setting is a regular expression that shoudl *only* match the lines that count as tokens, and in which the different attributes are captured by groups. In this case, we indicate that we have six sequences of non-tab characters (`[^\\t]`), which are captured by parentheses and separated by tab characters.\n", "\n", "The `global-columns` settings labels the different groups that `line-machine` has captured." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "settings['line-machine'] = '([^\\t]+)\\t([^\\t])[^\\t]*\\t([^\\t]+)\\t([^\\t]+)\\t([^\\t]+)\\t([^\\t]+)' #Stanford corpus\n", "settings['global-columns'] = 'word,pos,lemma,id,head,deprel'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `type`, `colloc` and `token` settings indicate the format of the type, collocate and token ID's, reusing the labels set in `settings['global-columns']`.\n", "Here the target and collocate types are set to the default (combination of 'lemma' and 'pos'), and the token ID uses the values of the 'lemma' and 'pos' fields along with the file name/ID ('fid') and the line number starting from 1 ('lid'), all separated by slashes. The 'fid' is computed as the basename of the filename, without extension." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "settings['type'] = 'lemma/pos'\n", "settings['colloc'] = 'lemma/pos'\n", "settings['token'] = 'lemma/pos/fid/lid'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have dependency-models, you will need to define a few extra settings: the format of the node and edges in the dependency graph and the labels of the index and head information. In other words, you map the labels given in `settings['global-columns']` to specific roles in the dependencies." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "settings['node-attr'] = 'lemma,pos'\n", "settings['edge-attr'] = 'deprel'\n", "settings['currID'] = 'id'\n", "settings['headID'] = 'head'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can set up the file encoding and the paths for corpus and output. The code will run on all the files found in `settings['corpus-path']`, so if you only want to work on a subset, you can **create a list of filenames** (with full paths), store in a file, and input either the list or the filenames-path as `fnames` argument of any function that scans the corpus." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "settings['file-encoding'] = 'utf-8'\n", "settings['outfile-encoding'] = 'utf-8'\n", "\n", "settings['output-path'] = output_path\n", "settings['corpus-path'] = f\"{mydir}/data/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Maleable settings\n", "\n", "The previous settings must be defined **only once** at the beginning of the project and not be changed, since they indicate how to read the corpus.\n", "\n", "The next two settings may be changed at different stages of the workflow as hyper-parameters:\n", "\n", "- The `separator-line-machine` setting is optional for bag-of-words models (it will exclude context words in a different sentence) but necessary for dependency-models: it tells the code where sentences end.\n", "\n", "- The `left-span` and `right-span` values specify the size of the bag-of-words window spans." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "settings['separator-line-machine'] = ''\n", "settings['left-span'] = 4\n", "settings['right-span'] = 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Frequency lists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Vocab class is based on dictionaries. The steps to create one are twofold:\n", "\n", "- Set up an ItemFreqHandler class with the settings\n", "- Build the frequency list with its `.build_item_freq()` method. The `fnames` argument can be a list of paths or a path to a file with a list of paths. If it is not provided, the full content of `corpus-path` will be used." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building item frequency list...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b68b1fa0fdb8413b95cd5beda4a19292", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ifhan = ItemFreqHandler(settings = settings)\n", "vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus\n", "vocab" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saving frequency list (vocabulary)... (in 'utf-8')\n", "Stored in .//output//Toy.nfreq\n" ] } ], "source": [ "vocab_fname = f\"{output_path}/{corpus_name}.nfreq\"\n", "vocab.save(vocab_fname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is only once per corpus. Once the main vocabulary list is compiled and stored, it can be further filtered (see [here](vocab.ipynb)). It can be used for the following purposes:\n", "\n", "- to simply extract the frequency of a lemma\n", "- to create co-occurrence matrices, as the compulsory `row_vocab` and optional `col_vocab` arguments of the `build_col_freq` method [below](#3.-Co-occurrence-matrix).\n", "- to select the target types at [token level](#5.-Basic-token-level)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Co-occurrence matrix\n", "\n", "The main class for matrices in Nephosem is the TypeTokenMatrix. You will want to create a co-occurrence matrix between all the types in your corpus -using all the items in your vocabulary. Note that if your corpus is large this can take a long time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1. Bag-of-words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like creating the vocabulary list, creating a co-occurrence frequency matrix has two steps: set up the `ColFreqHandler` object and running its `.build_col_freq()` method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n" ] } ], "source": [ "cfhan = ColFreqHandler(settings=settings, row_vocab = vocab, col_vocab = vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.build_col_freq()` method also has an optional `fname` argument requiring a list of paths or a path to a list of paths.\n", "In addition, `row_vocab` and `col_vocab` ask for `Vocab` objects like `vocab` above.\n", "The `row_vocab` argument, which indicates the node types to get co-occurrence information on, is compulsory." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building collocate frequency matrix...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6f6b95ff39c74d5cb5588dc166957cbf", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=47), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", ",/, NaN 2 2 NaN NaN NaN NaN ...\n", "a/D NaN 2 NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freqMTX = cfhan.build_col_freq()\n", "freqMTX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[This notebook](matrices.ipynb) shows you how you can play with TypeTokenMatrix objects.\n", "\n", "This is what it looks like if you only subset the row for 'girl/N' and remove the empty columns:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 39] 's/P ,/, a/D about/I about/R and/C apple/N ...\n", "girl/N 1 1 5 1 1 3 10 ..." ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freqMTX.submatrix(row = ['girl/N']).drop(axis = 1)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Saving matrix...\n", "Stored in file:\n", " .//output//Toy.bow.wcmx.pac\n" ] } ], "source": [ "freqMTX.save(f\"{output_path}/{corpus_name}.bow.wcmx.pac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Dependency-based\n", "\n", "Dependency-based models require yet another piece of information: templates. You can learn all about these templates in [this notebook](dependencies.ipynb).\n", "\n", "On the one hand, we have .graphml files that indicate relationships between elements. On the other, we have .xml files which specify the role of the node and features in the relationships. For example, in .graphml you would say that you want the relationship between a verb and its direct object; in the .xml, you would clarify that the verb is your feature (or context item) and the object is your target. You would also specify whether you want the lemma of the verb as the feature or, instead, the full path (just *eat/V* or `eat/V->dobj:#T`, with `#T` filling in the role of the target.\n", "\n", "For the type-level, we will exemplify with patterns that do not specify the *kind* of relationships but the number of steps, only selecting paths with one step between the target and the context word.\n", "\n", "First, we have to upload the files." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "path_graphml_fname = f\"{mydir}/templates/LEMMAPATH.template.graphml\"\n", "path_patterns = PatternGraph.read_graphml(path_graphml_fname)\n", "\n", "path_macro_fname = f\"{mydir}/templates/LEMMAPATH.target-feature-macro.xml\"\n", "path_macros = MacroGraph.read_xml(path_macro_fname, path_patterns)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building dependency features...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b2b9c6bbc19644e991e255965f9c73a3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Building matrix...\n" ] }, { "data": { "text/plain": [ "[54, 54] 's/P a/D about/I about/R all/P an/D and/C ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", "a/D NaN NaN NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "and/C NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The only difference between type- and token-level here is the \"mode\" argument\n", "path_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')\n", "path_dephan_type.read_templates(macros=path_macros)\n", "pathMTX = path_dephan_type.build_dependency()\n", "pathMTX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what it looks like if you only subset the row for 'girl/N' and remove the empty columns:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[1, 12] 's/P apple/N ask/V at/I boy/N by/I eat/V ...\n", "girl/N 1 1 1 1 2 1 6 ..." ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pathMTX.submatrix(row = [\"girl/N\"]).drop(axis = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following template asks the full dependency relation to be the feature, instead of just its lemma." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building dependency features...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f2d550d490a04c7a94b26ea535cea5c0", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Building matrix...\n" ] }, { "data": { "text/plain": [ "[1, 12] #T#->*:'s/P #T#->*:old/J #T#->*:the/D apple/N->*:#T# ask/V->*:#T# at/I->*:#T# boy/N->*:#T# ...\n", "girl/N 1 1 21 1 1 1 2 ..." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pathfull_macro_fname = f\"{mydir}/templates/LEMMAPATHfull.target-feature-macro.xml\"\n", "pathfull_macros = MacroGraph.read_xml(pathfull_macro_fname, path_patterns)\n", "\n", "pathfull_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')\n", "pathfull_dephan_type.read_templates(macros=pathfull_macros)\n", "\n", "pathfullMTX = pathfull_dephan_type.build_dependency()\n", "pathfullMTX.submatrix(row = [\"girl/N\"]).drop(axis = 1)\n", "# Note that the dependency itself is replaced by \"*\" because the regex in the patterns file does not capture it\n", "# (it doesn't have parentheses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In practice, we will be more interested in the transposed counterpart of this matrix: when we obtain token-level matrices of this kind, the patterns will be the columns and will need to be multiplied by a SOCC matrix where the patterns are the rows :)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Association measures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the things you will want to do is **compute association measures**, which will be the actual values of your vectors, either for type or token level matrices. This is done with `compute_association()`, a function that takes a TypeTokenMatrix, row and column Vocab objects and the kind of measure (check the documentation to find the possibilities).\n", "\n", "First we obtain the marginal frequencies of your reference matrix and convert them to Vocab object." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f7fcd948df7748c9aca2bd9ef2581efe", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=606), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "************************************\n", "function = compute_association\n", " time = 0.01725 sec\n", "************************************\n", "\n" ] }, { "data": { "text/plain": [ "[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", ",/, NaN 2.0384464 1.2940059 NaN NaN NaN NaN ...\n", "a/D NaN 1.2940059 NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nfreq = Vocab(freqMTX.sum(axis=1))\n", "cfreq = Vocab(freqMTX.sum(axis=0))\n", "ppmiMTX = compute_association(freqMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')\n", "ppmiMTX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should compute the marginal frequencies on your *full* reference matrix, but you may use a submatrix for `compute_association()` to just compute the values for selected items." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d658e9a06d044b57af122cd41218fed2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=53), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "************************************\n", "function = compute_association\n", " time = 0.01555 sec\n", "************************************\n", "\n" ] }, { "data": { "text/plain": [ "[1, 53] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "the/D 0.030027887 -0.611826 -0.43997574 -0.15229367 0.030027887 0.25317144 -1.0685844 ..." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subMTX = freqMTX.submatrix(row = [\"the/D\"]).drop(axis = 1, n_nonzero = 0)\n", "pmi_the = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'pmi')\n", "pmi_the" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Basic token level\n", "\n", "The first step to collecting tokens is selecting the types from which you will collect them. The lines below set a query just for 'girl/N'; if you wanted to use more lemmas you can just include them in the list, e.g. `vocab.subvocab(['girl/N', 'boy/N'])`." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [], "source": [ "query = vocab.subvocab([\"girl/N\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will first look at a [bag-of-words method](#5.1.-Bag-of-words), and then at three different dependency-based methods:\n", "\n", "- [Lemmarel](#5.2.-Lemmarel), where the dependency is selected by a specific set of relationships and the context feature is a lemma.\n", "- [Lemmapath](#5.3.-Lemmapath), where the dependency is selected based on the number of steps on the dependency path (like above) and the context feature is a lemma.\n", "- [Deppath](#5.4.-Deppath), where the dependency is selected based on the number of steps on the dependency path but the full dependency relation is the context feature. (Deprel is of course also possible but I will not show it.)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.1. Bag-of-words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As always, collecting tokens go into two steps: setting up the `TokenHandler` class and then running the `retrieve_tokens()` method (there are other alternative methods too, I understand). The `query` argument of the class is a Vocab object with the types from which we want the tokens.\n", "Among the important settings you might want to reconfigure are the **window span** (`settings['left-span']` and `settings['right-span']`) and `settings['single-boundary-machine']`, a regular expression to match lines that correspond to sentence (or whatever) boundaries, such as '' in this case.\n", "\n", "Next to `fnames`, the method (as well as the class itself) includes a `col_vocab` argument, which takes a `Vocab` object, to select which context words can be captured (rather than, by default, all context words).\n", "The `fnames` argument can be particularly useful here to avoid scanning all of a huge corpus if you only want a few hundred tokens." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Scanning tokens of queries in corpus...\n" ] }, { "data": { "text/plain": [ "[21, 39] which/W say/V she/P boy/N this/D about/I give/V ...\n", "girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.10/19 NaN NaN NaN NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.11/3 NaN NaN NaN 4 NaN NaN NaN ...\n", "girl/N/StanfDepSents.11/19 -2 NaN NaN NaN NaN NaN 1 ...\n", "girl/N/StanfDepSents.11/28 NaN NaN NaN 4 -4 NaN NaN ...\n", "girl/N/StanfDepSents.7/7 NaN NaN NaN -3 NaN NaN -2 ...\n", "girl/N/StanfDepSents.7/25 NaN NaN NaN -3 NaN 1 NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokhan = TokenHandler(query, settings=settings)\n", "tokens = tokhan.retrieve_tokens()\n", "tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2. Lemmarel\n", "\n", "In the first dependency-based model, we will look at templates where a noun is the target and the features could be the verb of which it is subject or direct object, its modifier or an item from which it depends via a preposition." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [], "source": [ "rel_graphml_fname = f\"{mydir}/templates/LEMMAREL.template.graphml\"\n", "rel_patterns = PatternGraph.read_graphml(rel_graphml_fname)\n", "\n", "rel_macro_fname = f\"{mydir}/templates/LEMMAREL.target-feature-macro.xml\"\n", "rel_macros = MacroGraph.read_xml(rel_macro_fname, rel_patterns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like in all other dependency-based models, we first create an object of the `DepRelHandler` class (with either mode 'type' or, in this case, 'token') and then we give the macros to the `.read_templates()` method." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building dependency features...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7885747a3cf1414bbbdff17421fbfdde", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Building matrix...\n" ] }, { "data": { "text/plain": [ "[15, 6] ask/V at/I eat/V give/V look/V sit/V\n", "girl/N/StanfDepSents.1/13 NaN 1 NaN NaN NaN NaN\n", "girl/N/StanfDepSents.1/20 NaN NaN 1 NaN NaN NaN\n", "girl/N/StanfDepSents.1/3 NaN NaN NaN NaN 1 NaN\n", "girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN 1\n", "girl/N/StanfDepSents.10/19 NaN NaN 1 NaN NaN NaN\n", "girl/N/StanfDepSents.11/19 NaN NaN NaN 1 NaN NaN\n", "girl/N/StanfDepSents.11/28 NaN NaN NaN NaN 1 NaN\n", "... ... ... ... ... ... ..." ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rel_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')\n", "rel_dephan.read_templates(macros=rel_macros)\n", "rel_tokens = rel_dephan.build_dependency()\n", "rel_tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3. Lemmapath\n", "\n", "This is the token-level counterpart of the type-level model shown above: the rows are individual instances instead of type-level vectors." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building dependency features...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "220f4ef8c2a64af3a24514ce15d3e634", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Building matrix...\n" ] }, { "data": { "text/plain": [ "[21, 12] 's/P apple/N ask/V at/I boy/N by/I eat/V ...\n", "girl/N/StanfDepSents.1/13 NaN NaN NaN 1 NaN NaN NaN ...\n", "girl/N/StanfDepSents.1/20 NaN NaN NaN NaN NaN NaN 1 ...\n", "girl/N/StanfDepSents.1/3 NaN NaN NaN NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.10/19 NaN NaN NaN NaN NaN NaN 1 ...\n", "girl/N/StanfDepSents.11/19 NaN NaN NaN NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.11/28 NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')\n", "path_dephan.read_templates(macros=path_macros)\n", "path_tokens = path_dephan.build_dependency()\n", "path_tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.4. Deppath\n", "\n", "This is the token-level counterpart of the second type-level model shown above; that (transposed) type-level model would serve as the second-order matrix for this token-level matrix." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: Not provide the temporary path!\n", "WARNING: Use the default tmp directory: '~/tmp'!\n", "Building dependency features...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c3d4740f9529444c9b6e647a316618a3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=12), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Building matrix...\n" ] }, { "data": { "text/plain": [ "[21, 12] #T#->*:'s/P #T#->*:old/J #T#->*:the/D apple/N->*:#T# ask/V->*:#T# at/I->*:#T# boy/N->*:#T# ...\n", "girl/N/StanfDepSents.1/13 NaN NaN 1 NaN NaN 1 NaN ...\n", "girl/N/StanfDepSents.1/20 NaN NaN 1 NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.1/3 NaN NaN 1 NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.10/13 NaN NaN 1 NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.10/19 NaN NaN 1 NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.11/19 NaN NaN 1 NaN NaN NaN NaN ...\n", "girl/N/StanfDepSents.11/28 NaN NaN 1 NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pathfull_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')\n", "pathfull_dephan.read_templates(macros=pathfull_macros)\n", "pathfull_tokens = pathfull_dephan.build_dependency()\n", "pathfull_tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Full token level\n", "\n", "### 6.1. Weight context words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The matrices from the previous step will have positions or counts as values. Before replacing the context words with their type-level vectors, we might want to weight them with some association measure, so that context words that are more attracted to the target have a larger influence in the final position of the token with which they co-occur. For that purpose we use `compute_token_weights()` with a weight matrix (e.g. with positive pmi) that includes the target type in its rows and the context words of the tokens in its columns." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c77fa0f9606f411b8a1eba9cf898f681", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=39), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "************************************\n", "function = compute_association\n", " time = 0.01584 sec\n", "************************************\n", "\n" ] }, { "data": { "text/plain": [ "[1, 39] which/W say/V she/P boy/N this/D about/I give/V ...\n", "girl/N 0.3507146 0.6383967 0.12757105 0.0 1.0438617 0.6383967 0.23293155 ..." ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subMTX = freqMTX.submatrix(row = query.get_item_list(), col = tokens.col_items).drop(axis = 1) #Of course, it's best to check for the intersection...\n", "weighter = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')\n", "weighter" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "weighted = compute_token_weights(tokens, weighter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2. Second-order dimensions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final step to obtain token-level vectors is to replace the (weighted) context words with their type-level vectors, by means of the `compute_token_vectors()` function.\n", "\n", "The first two arguments of this function, `tcWeightMTX` and `soccMTX`, are the token level and second-order type level matrices involved. Next to them there is an argument `operation` to decide how to merge the type level vectors of the context words to form the token-level vector: by default, it's addition, but it could also be multiplication or a weighted mean. In addition, the argument `normalization`, with L1 as default, sets whether and how the vectors should be normalized.\n", "\n", "The second-order matrix has to have, as rows, the columns of the token-level matrix, while the columns will be the final dimensions. " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[39, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "which/W NaN NaN NaN NaN NaN NaN NaN ...\n", "say/V NaN NaN NaN NaN NaN NaN NaN ...\n", "she/P NaN NaN NaN NaN NaN NaN NaN ...\n", "boy/N NaN NaN 0.07956182 0.59038746 NaN 0.99585253 0.772709 ...\n", "this/D NaN 2.9034438 NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "give/V NaN NaN 0.65492594 NaN NaN NaN 0.94260806 ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "socMTX = ppmiMTX.submatrix(row = weighted.col_items).drop(axis = 1)\n", "socMTX" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Operation: weighted mean 'token-feature weight matrix' X 'socc matrix'...\n" ] }, { "data": { "text/plain": [ "[21, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "girl/N/StanfDepSents.10/13 0.0008 NaN NaN NaN 0.0008 0.0066 NaN ...\n", "girl/N/StanfDepSents.10/19 0.0119 NaN 0.0041 0.0118 0.2003 0.0035 0.0130 ...\n", "girl/N/StanfDepSents.11/3 0.1028 0.0279 0.0251 NaN 0.0006 0.0048 NaN ...\n", "girl/N/StanfDepSents.11/19 0.0110 NaN 0.0117 0.0088 0.0110 0.0032 0.0220 ...\n", "girl/N/StanfDepSents.11/28 0.0533 0.1544 0.0130 NaN 0.0003 0.0025 NaN ...\n", "girl/N/StanfDepSents.7/7 0.0141 0.0302 0.0197 0.0113 0.0141 0.0283 0.0412 ...\n", "girl/N/StanfDepSents.7/25 0.0145 NaN 0.0050 0.1741 0.0179 0.0042 0.0158 ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokvecs = compute_token_vectors(weighted, socMTX, operation='weightedmean')\n", "tokvecs" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Saving matrix...\n", "Stored in file:\n", " .//output//Toy.ttmx.ppmi.pac\n" ] } ], "source": [ "tokvecs.save(f\"{output_path}/{corpus_name}.ttmx.ppmi.pac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Cosine distances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final element, `tokvecs`, is the actual token-level matrix we are interested in. We could use it to average vectors over a set of tokens or directly compute the distances or similarities between the vectors. See the documentation for options on different measures" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "************************************\n", "function = compute_distance\n", " time = 0.02593 sec\n", "************************************\n", "\n" ] }, { "data": { "text/plain": [ "[21, 21] girl/N/StanfDepSents.10/13 girl/N/StanfDepSents.10/19 girl/N/StanfDepSents.11/3 girl/N/StanfDepSents.11/19 girl/N/StanfDepSents.11/28 girl/N/StanfDepSents.7/7 girl/N/StanfDepSents.7/25 ...\n", "girl/N/StanfDepSents.10/13 0.0000 0.8721 0.8736 0.8654 0.8920 0.7566 0.8227 ...\n", "girl/N/StanfDepSents.10/19 0.8721 0.0000 0.9064 0.8111 0.8930 0.7274 0.6725 ...\n", "girl/N/StanfDepSents.11/3 0.8736 0.9064 0.0000 0.8789 0.3427 0.6394 0.8348 ...\n", "girl/N/StanfDepSents.11/19 0.8654 0.8111 0.8789 0.0000 0.8652 0.4694 0.7922 ...\n", "girl/N/StanfDepSents.11/28 0.8920 0.8930 0.3427 0.8652 0.0000 0.5428 0.8596 ...\n", "girl/N/StanfDepSents.7/7 0.7566 0.7274 0.6394 0.4694 0.5428 0.0000 0.6970 ...\n", "girl/N/StanfDepSents.7/25 0.8227 0.6725 0.8348 0.7922 0.8596 0.6970 0.0000 ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokdists = compute_distance(tokvecs)\n", "tokdists" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Saving matrix...\n", "Stored in file:\n", " .//output//Toy.ttmx.dist.pac\n" ] } ], "source": [ "tokdists.save(f\"{output_path}/{corpus_name}.ttmx.dist.pac\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, this could be used to compute the distances between the context words themselves!" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "************************************\n", "function = compute_distance\n", " time = 0.003047 sec\n", "************************************\n", "\n" ] }, { "data": { "text/plain": [ "[39, 39] which/W say/V she/P boy/N this/D about/I give/V ...\n", "which/W 0.0000 0.9714 0.9932 0.9494 0.9750 0.9251 0.5509 ...\n", "say/V 0.9714 0.0000 0.9934 0.8941 0.9592 0.9578 0.9752 ...\n", "she/P 0.9932 0.9934 0.0000 0.7166 0.9929 0.9878 0.9968 ...\n", "boy/N 0.9494 0.8941 0.7166 0.0000 0.9954 0.8361 0.7142 ...\n", "this/D 0.9750 0.9592 0.9929 0.9954 0.0000 0.9594 0.9838 ...\n", "about/I 0.9251 0.9578 0.9878 0.8361 0.9594 0.0000 0.9780 ...\n", "give/V 0.5509 0.9752 0.9968 0.7142 0.9838 0.9780 0.0000 ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "focdists = compute_distance(socMTX)\n", "focdists" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }