All-in-one nephosem¶

This tutorial shows the main tasks you can perform with the nephosem library, with the following steps:

Inital setup: load required libraries
Configuration: define settings specific for your case study, related mostly to how to read your corpus.
Frequency lists
Collocation matrices: create co-occurrence matrices with and without dependency information
Association measures
Basic token level: create token-level vectors, with and without dependency information
Full token level: weights and replaces first-order context words with their type-level vectors
Cosine distances

NOTE: Tips on manipulation of the different objects will be given in their respective tutorials.

0. Initial setup¶

[1]:

import numpy as np # for "booleanize()"
from scipy import sparse # for "booleanize()"
import logging # to keep debugging log
import sys
nephosemdir = "../../nephosem/"
sys.path.append(nephosemdir)

Once nephosem is in our path, we can import different classes and functions from the library, depending on the specific tasks you need to do.

[2]:

from nephosem.conf import ConfigLoader # to setup the configuration
from nephosem import Vocab, TypeTokenMatrix # to manage frequency lists and matrices
from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler # to generate frequency lists and matrices
from nephosem import compute_association, compute_distance # to compute PPMI and distances
from nephosem.specutils.mxcalc import compute_token_weights, compute_token_vectors # for token level
from nephosem.models.typetoken import build_tc_weight_matrix # for weighting at token level

# For dependencies
from nephosem.core.graph import SentenceGraph, MacroGraph, PatternGraph
from nephosem.models.deprel import DepRelHandler, read_sentence

1. Configuration¶

Depending on what you need, you will have to set up some useful paths as variables for your future filenames.

[3]:

mydir = f"./"
output_path = f"{mydir}/output/"
corpus_name = 'Toy'
logging.basicConfig(filename = f'{mydir}/log.log', level = logging.DEBUG)

The most important concrete step is to adapt the configuration file.

WARNING: You need to run the appropriate settings at the beginning of every script/notebook you run. Every part of the code of one same project has to use the same settings.

[4]:

conf = ConfigLoader()
settings = conf.settings
# If you already have your settings in a config file, you can load them:
# settings = conf.update_config('config.ini')

For this notebook, we will use a dataset of toy sentences in English annotated with Stanford dependencies, stored in ‘data’. This is what part of one of the files looks like:

[5]:

with open('data/StanfDepSents.1.conll', 'r') as f:
    lines = f.readlines()
for line in lines[:6]:
    print(line)

<s id="1">

The     DT      the     1       2       det

girl    NNS     girl    2       3       nsubj

looks   VBZ     look    3       0       ROOT

healthy JJ      healthy 4       3       acomp

</s>

On the one hand, we have token lines: each token is in a line with tab-separated attributes. In this case, they are: word form, part of speech, lemma, index in sentence, index of dependency head, dependency relation. On the other hand, we have lines with other information, in this case sentence delimiters.

Now we need to define the settings so that the code knows how to read the corpus: which lines count as tokens, where sentences end, and which are the different attributes of the corpus. In addition, we will specify what attributes we want for the definition of a type.

The line-machine setting is a regular expression that shoudl only match the lines that count as tokens, and in which the different attributes are captured by groups. In this case, we indicate that we have six sequences of non-tab characters ([^\t]), which are captured by parentheses and separated by tab characters.

The global-columns settings labels the different groups that line-machine has captured.

[6]:

settings['line-machine'] = '([^\t]+)\t([^\t])[^\t]*\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)' #Stanford corpus
settings['global-columns'] = 'word,pos,lemma,id,head,deprel'

The type, colloc and token settings indicate the format of the type, collocate and token ID’s, reusing the labels set in settings['global-columns']. Here the target and collocate types are set to the default (combination of ‘lemma’ and ‘pos’), and the token ID uses the values of the ‘lemma’ and ‘pos’ fields along with the file name/ID (‘fid’) and the line number starting from 1 (‘lid’), all separated by slashes. The ‘fid’ is computed as the basename of the filename, without extension.

[7]:

settings['type'] = 'lemma/pos'
settings['colloc'] = 'lemma/pos'
settings['token'] = 'lemma/pos/fid/lid'

If you have dependency-models, you will need to define a few extra settings: the format of the node and edges in the dependency graph and the labels of the index and head information. In other words, you map the labels given in settings['global-columns'] to specific roles in the dependencies.

[8]:

settings['node-attr'] = 'lemma,pos'
settings['edge-attr'] = 'deprel'
settings['currID'] = 'id'
settings['headID'] = 'head'

Finally, you can set up the file encoding and the paths for corpus and output. The code will run on all the files found in settings['corpus-path'], so if you only want to work on a subset, you can create a list of filenames (with full paths), store in a file, and input either the list or the filenames-path as fnames argument of any function that scans the corpus.

[9]:

settings['file-encoding'] = 'utf-8'
settings['outfile-encoding'] = 'utf-8'

settings['output-path'] = output_path
settings['corpus-path'] = f"{mydir}/data/"

1.1 Maleable settings¶

The previous settings must be defined only once at the beginning of the project and not be changed, since they indicate how to read the corpus.

The next two settings may be changed at different stages of the workflow as hyper-parameters:

The separator-line-machine setting is optional for bag-of-words models (it will exclude context words in a different sentence) but necessary for dependency-models: it tells the code where sentences end.
The left-span and right-span values specify the size of the bag-of-words window spans.

[10]:

settings['separator-line-machine'] = '</s>'
settings['left-span'] = 4
settings['right-span'] = 4

2. Frequency lists¶

The Vocab class is based on dictionaries. The steps to create one are twofold:

Set up an ItemFreqHandler class with the settings
Build the frequency list with its .build_item_freq() method. The fnames argument can be a list of paths or a path to a file with a list of paths. If it is not provided, the full content of corpus-path will be used.

[11]:

ifhan = ItemFreqHandler(settings = settings)
vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus
vocab

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building item frequency list...

[11]:

[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]

[12]:

vocab_fname = f"{output_path}/{corpus_name}.nfreq"
vocab.save(vocab_fname)

Saving frequency list (vocabulary)... (in 'utf-8')
Stored in .//output//Toy.nfreq

This is only once per corpus. Once the main vocabulary list is compiled and stored, it can be further filtered (see here). It can be used for the following purposes:

to simply extract the frequency of a lemma
to create co-occurrence matrices, as the compulsory row_vocab and optional col_vocab arguments of the build_col_freq method below.
to select the target types at token level

3. Co-occurrence matrix¶

The main class for matrices in Nephosem is the TypeTokenMatrix. You will want to create a co-occurrence matrix between all the types in your corpus -using all the items in your vocabulary. Note that if your corpus is large this can take a long time.

3.1. Bag-of-words¶

Like creating the vocabulary list, creating a co-occurrence frequency matrix has two steps: set up the ColFreqHandler object and running its .build_col_freq() method.

[13]:

cfhan = ColFreqHandler(settings=settings, row_vocab = vocab, col_vocab = vocab)

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!

The .build_col_freq() method also has an optional fname argument requiring a list of paths or a path to a list of paths. In addition, row_vocab and col_vocab ask for Vocab objects like vocab above. The row_vocab argument, which indicates the node types to get co-occurrence information on, is compulsory.

[14]:

freqMTX = cfhan.build_col_freq()
freqMTX

Building collocate frequency matrix...

[14]:

[55, 55]  's/P  ,/,  a/D  about/I  about/R  all/P  an/D  ...
's/P      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...
,/,       NaN   2    2    NaN      NaN      NaN    NaN   ...
a/D       NaN   2    NaN  NaN      NaN      NaN    NaN   ...
about/I   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...
about/R   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...
all/P     NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...
an/D      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...
...       ...   ...  ...  ...      ...      ...    ...   ...

This notebook shows you how you can play with TypeTokenMatrix objects.

This is what it looks like if you only subset the row for ‘girl/N’ and remove the empty columns:

[15]:

freqMTX.submatrix(row = ['girl/N']).drop(axis = 1)

[15]:

[1, 39]  's/P  ,/,  a/D  about/I  about/R  and/C  apple/N  ...
girl/N   1     1    5    1        1        3      10       ...

[16]:

freqMTX.save(f"{output_path}/{corpus_name}.bow.wcmx.pac")


Saving matrix...
Stored in file:
  .//output//Toy.bow.wcmx.pac

3.2 Dependency-based¶

Dependency-based models require yet another piece of information: templates. You can learn all about these templates in this notebook.

On the one hand, we have .graphml files that indicate relationships between elements. On the other, we have .xml files which specify the role of the node and features in the relationships. For example, in .graphml you would say that you want the relationship between a verb and its direct object; in the .xml, you would clarify that the verb is your feature (or context item) and the object is your target. You would also specify whether you want the lemma of the verb as the feature or, instead, the full path (just eat/V or eat/V->dobj:#T, with #T filling in the role of the target.

For the type-level, we will exemplify with patterns that do not specify the kind of relationships but the number of steps, only selecting paths with one step between the target and the context word.

First, we have to upload the files.

[17]:

path_graphml_fname = f"{mydir}/templates/LEMMAPATH.template.graphml"
path_patterns = PatternGraph.read_graphml(path_graphml_fname)

path_macro_fname = f"{mydir}/templates/LEMMAPATH.target-feature-macro.xml"
path_macros = MacroGraph.read_xml(path_macro_fname, path_patterns)

[18]:

# The only difference between type- and token-level here is the "mode" argument
path_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')
path_dephan_type.read_templates(macros=path_macros)
pathMTX = path_dephan_type.build_dependency()
pathMTX

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...


Building matrix...

[18]:

[54, 54]  's/P  a/D  about/I  about/R  all/P  an/D  and/C  ...
's/P      NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
a/D       NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
about/I   NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
about/R   NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
all/P     NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
an/D      NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
and/C     NaN   NaN  NaN      NaN      NaN    NaN   NaN    ...
...       ...   ...  ...      ...      ...    ...   ...    ...

This is what it looks like if you only subset the row for ‘girl/N’ and remove the empty columns:

[19]:

pathMTX.submatrix(row = ["girl/N"]).drop(axis = 1)

[19]:

[1, 12]  's/P  apple/N  ask/V  at/I  boy/N  by/I  eat/V  ...
girl/N   1     1        1      1     2      1     6      ...

The following template asks the full dependency relation to be the feature, instead of just its lemma.

[20]:

pathfull_macro_fname = f"{mydir}/templates/LEMMAPATHfull.target-feature-macro.xml"
pathfull_macros = MacroGraph.read_xml(pathfull_macro_fname, path_patterns)

pathfull_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')
pathfull_dephan_type.read_templates(macros=pathfull_macros)

pathfullMTX = pathfull_dephan_type.build_dependency()
pathfullMTX.submatrix(row = ["girl/N"]).drop(axis = 1)
# Note that the dependency itself is replaced by "*" because the regex in the patterns file does not capture it
# (it doesn't have parentheses)

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...


Building matrix...

[20]:

[1, 12]  #T#->*:'s/P  #T#->*:old/J  #T#->*:the/D  apple/N->*:#T#  ask/V->*:#T#  at/I->*:#T#  boy/N->*:#T#  ...
girl/N   1            1             21            1               1             1            2             ...

In practice, we will be more interested in the transposed counterpart of this matrix: when we obtain token-level matrices of this kind, the patterns will be the columns and will need to be multiplied by a SOCC matrix where the patterns are the rows :).

4. Association measures¶

One of the things you will want to do is compute association measures, which will be the actual values of your vectors, either for type or token level matrices. This is done with compute_association(), a function that takes a TypeTokenMatrix, row and column Vocab objects and the kind of measure (check the documentation to find the possibilities).

First we obtain the marginal frequencies of your reference matrix and convert them to Vocab object.

[21]:

nfreq = Vocab(freqMTX.sum(axis=1))
cfreq = Vocab(freqMTX.sum(axis=0))
ppmiMTX = compute_association(freqMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')
ppmiMTX



************************************
function    = compute_association
  time      = 0.01725 sec
************************************

[21]:

[55, 55]  's/P  ,/,        a/D        about/I  about/R  all/P  an/D  ...
's/P      NaN   NaN        NaN        NaN      NaN      NaN    NaN   ...
,/,       NaN   2.0384464  1.2940059  NaN      NaN      NaN    NaN   ...
a/D       NaN   1.2940059  NaN        NaN      NaN      NaN    NaN   ...
about/I   NaN   NaN        NaN        NaN      NaN      NaN    NaN   ...
about/R   NaN   NaN        NaN        NaN      NaN      NaN    NaN   ...
all/P     NaN   NaN        NaN        NaN      NaN      NaN    NaN   ...
an/D      NaN   NaN        NaN        NaN      NaN      NaN    NaN   ...
...       ...   ...        ...        ...      ...      ...    ...   ...

You should compute the marginal frequencies on your full reference matrix, but you may use a submatrix for compute_association() to just compute the values for selected items.

[22]:

subMTX = freqMTX.submatrix(row = ["the/D"]).drop(axis = 1, n_nonzero = 0)
pmi_the = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'pmi')
pmi_the



************************************
function    = compute_association
  time      = 0.01555 sec
************************************

[22]:

[1, 53]  's/P         ,/,        a/D          about/I      about/R      all/P       an/D        ...
the/D    0.030027887  -0.611826  -0.43997574  -0.15229367  0.030027887  0.25317144  -1.0685844  ...

5. Basic token level¶

The first step to collecting tokens is selecting the types from which you will collect them. The lines below set a query just for ‘girl/N’; if you wanted to use more lemmas you can just include them in the list, e.g. vocab.subvocab(['girl/N', 'boy/N']).

[23]:

query = vocab.subvocab(["girl/N"])

We will first look at a bag-of-words method, and then at three different dependency-based methods:

Lemmarel, where the dependency is selected by a specific set of relationships and the context feature is a lemma.
Lemmapath, where the dependency is selected based on the number of steps on the dependency path (like above) and the context feature is a lemma.
Deppath, where the dependency is selected based on the number of steps on the dependency path but the full dependency relation is the context feature. (Deprel is of course also possible but I will not show it.)

5.1. Bag-of-words¶

As always, collecting tokens go into two steps: setting up the TokenHandler class and then running the retrieve_tokens() method (there are other alternative methods too, I understand). The query argument of the class is a Vocab object with the types from which we want the tokens. Among the important settings you might want to reconfigure are the window span (settings['left-span'] and settings['right-span']) and settings['single-boundary-machine'], a regular expression to match lines that correspond to sentence (or whatever) boundaries, such as ‘’ in this case.

Next to fnames, the method (as well as the class itself) includes a col_vocab argument, which takes a Vocab object, to select which context words can be captured (rather than, by default, all context words). The fnames argument can be particularly useful here to avoid scanning all of a huge corpus if you only want a few hundred tokens.

[24]:

tokhan = TokenHandler(query, settings=settings)
tokens = tokhan.retrieve_tokens()
tokens

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Scanning tokens of queries in corpus...

[24]:

[21, 39]                    which/W  say/V  she/P  boy/N  this/D  about/I  give/V  ...
girl/N/StanfDepSents.10/13  NaN      NaN    NaN    NaN    NaN     NaN      NaN     ...
girl/N/StanfDepSents.10/19  NaN      NaN    NaN    NaN    NaN     NaN      NaN     ...
girl/N/StanfDepSents.11/3   NaN      NaN    NaN    4      NaN     NaN      NaN     ...
girl/N/StanfDepSents.11/19  -2       NaN    NaN    NaN    NaN     NaN      1       ...
girl/N/StanfDepSents.11/28  NaN      NaN    NaN    4      -4      NaN      NaN     ...
girl/N/StanfDepSents.7/7    NaN      NaN    NaN    -3     NaN     NaN      -2      ...
girl/N/StanfDepSents.7/25   NaN      NaN    NaN    -3     NaN     1        NaN     ...
...                         ...      ...    ...    ...    ...     ...      ...     ...

5.2. Lemmarel¶

In the first dependency-based model, we will look at templates where a noun is the target and the features could be the verb of which it is subject or direct object, its modifier or an item from which it depends via a preposition.

[25]:

rel_graphml_fname = f"{mydir}/templates/LEMMAREL.template.graphml"
rel_patterns = PatternGraph.read_graphml(rel_graphml_fname)

rel_macro_fname = f"{mydir}/templates/LEMMAREL.target-feature-macro.xml"
rel_macros = MacroGraph.read_xml(rel_macro_fname, rel_patterns)

Like in all other dependency-based models, we first create an object of the DepRelHandler class (with either mode ‘type’ or, in this case, ‘token’) and then we give the macros to the .read_templates() method.

[26]:

rel_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
rel_dephan.read_templates(macros=rel_macros)
rel_tokens = rel_dephan.build_dependency()
rel_tokens

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...


Building matrix...

[26]:

[15, 6]                     ask/V  at/I  eat/V  give/V  look/V  sit/V
girl/N/StanfDepSents.1/13   NaN    1     NaN    NaN     NaN     NaN
girl/N/StanfDepSents.1/20   NaN    NaN   1      NaN     NaN     NaN
girl/N/StanfDepSents.1/3    NaN    NaN   NaN    NaN     1       NaN
girl/N/StanfDepSents.10/13  NaN    NaN   NaN    NaN     NaN     1
girl/N/StanfDepSents.10/19  NaN    NaN   1      NaN     NaN     NaN
girl/N/StanfDepSents.11/19  NaN    NaN   NaN    1       NaN     NaN
girl/N/StanfDepSents.11/28  NaN    NaN   NaN    NaN     1       NaN
...                         ...    ...   ...    ...     ...     ...

5.3. Lemmapath¶

This is the token-level counterpart of the type-level model shown above: the rows are individual instances instead of type-level vectors.

[27]:

path_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
path_dephan.read_templates(macros=path_macros)
path_tokens = path_dephan.build_dependency()
path_tokens

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...


Building matrix...

[27]:

[21, 12]                    's/P  apple/N  ask/V  at/I  boy/N  by/I  eat/V  ...
girl/N/StanfDepSents.1/13   NaN   NaN      NaN    1     NaN    NaN   NaN    ...
girl/N/StanfDepSents.1/20   NaN   NaN      NaN    NaN   NaN    NaN   1      ...
girl/N/StanfDepSents.1/3    NaN   NaN      NaN    NaN   NaN    NaN   NaN    ...
girl/N/StanfDepSents.10/13  NaN   NaN      NaN    NaN   NaN    NaN   NaN    ...
girl/N/StanfDepSents.10/19  NaN   NaN      NaN    NaN   NaN    NaN   1      ...
girl/N/StanfDepSents.11/19  NaN   NaN      NaN    NaN   NaN    NaN   NaN    ...
girl/N/StanfDepSents.11/28  NaN   NaN      NaN    NaN   NaN    NaN   NaN    ...
...                         ...   ...      ...    ...   ...    ...   ...    ...

5.4. Deppath¶

This is the token-level counterpart of the second type-level model shown above; that (transposed) type-level model would serve as the second-order matrix for this token-level matrix.

[28]:

pathfull_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
pathfull_dephan.read_templates(macros=pathfull_macros)
pathfull_tokens = pathfull_dephan.build_dependency()
pathfull_tokens

WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...


Building matrix...

[28]:

[21, 12]                    #T#->*:'s/P  #T#->*:old/J  #T#->*:the/D  apple/N->*:#T#  ask/V->*:#T#  at/I->*:#T#  boy/N->*:#T#  ...
girl/N/StanfDepSents.1/13   NaN          NaN           1             NaN             NaN           1            NaN           ...
girl/N/StanfDepSents.1/20   NaN          NaN           1             NaN             NaN           NaN          NaN           ...
girl/N/StanfDepSents.1/3    NaN          NaN           1             NaN             NaN           NaN          NaN           ...
girl/N/StanfDepSents.10/13  NaN          NaN           1             NaN             NaN           NaN          NaN           ...
girl/N/StanfDepSents.10/19  NaN          NaN           1             NaN             NaN           NaN          NaN           ...
girl/N/StanfDepSents.11/19  NaN          NaN           1             NaN             NaN           NaN          NaN           ...
girl/N/StanfDepSents.11/28  NaN          NaN           1             NaN             NaN           NaN          NaN           ...
...                         ...          ...           ...           ...             ...           ...          ...           ...

6. Full token level¶

6.1. Weight context words¶

The matrices from the previous step will have positions or counts as values. Before replacing the context words with their type-level vectors, we might want to weight them with some association measure, so that context words that are more attracted to the target have a larger influence in the final position of the token with which they co-occur. For that purpose we use compute_token_weights() with a weight matrix (e.g. with positive pmi) that includes the target type in its rows and the context words of the tokens in its columns.

[29]:

subMTX = freqMTX.submatrix(row = query.get_item_list(), col = tokens.col_items).drop(axis = 1) #Of course, it's best to check for the intersection...
weighter = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')
weighter



************************************
function    = compute_association
  time      = 0.01584 sec
************************************

[29]:

[1, 39]  which/W    say/V      she/P       boy/N  this/D     about/I    give/V      ...
girl/N   0.3507146  0.6383967  0.12757105  0.0    1.0438617  0.6383967  0.23293155  ...

[30]:

weighted = compute_token_weights(tokens, weighter)

6.2. Second-order dimensions¶

The final step to obtain token-level vectors is to replace the (weighted) context words with their type-level vectors, by means of the compute_token_vectors() function.

The first two arguments of this function, tcWeightMTX and soccMTX, are the token level and second-order type level matrices involved. Next to them there is an argument operation to decide how to merge the type level vectors of the context words to form the token-level vector: by default, it’s addition, but it could also be multiplication or a weighted mean. In addition, the argument normalization, with L1 as default, sets whether and how the vectors should be normalized.

The second-order matrix has to have, as rows, the columns of the token-level matrix, while the columns will be the final dimensions.

[31]:

socMTX = ppmiMTX.submatrix(row = weighted.col_items).drop(axis = 1)
socMTX

[31]:

[39, 55]  's/P  ,/,        a/D         about/I     about/R  all/P       an/D        ...
which/W   NaN   NaN        NaN         NaN         NaN      NaN         NaN         ...
say/V     NaN   NaN        NaN         NaN         NaN      NaN         NaN         ...
she/P     NaN   NaN        NaN         NaN         NaN      NaN         NaN         ...
boy/N     NaN   NaN        0.07956182  0.59038746  NaN      0.99585253  0.772709    ...
this/D    NaN   2.9034438  NaN         NaN         NaN      NaN         NaN         ...
about/I   NaN   NaN        NaN         NaN         NaN      NaN         NaN         ...
give/V    NaN   NaN        0.65492594  NaN         NaN      NaN         0.94260806  ...
...       ...   ...        ...         ...         ...      ...         ...         ...

[32]:

tokvecs = compute_token_vectors(weighted, socMTX, operation='weightedmean')
tokvecs

  Operation: weighted mean 'token-feature weight matrix' X 'socc matrix'...

[32]:

[21, 55]                    's/P    ,/,     a/D     about/I  about/R  all/P   an/D    ...
girl/N/StanfDepSents.10/13  0.0008  NaN     NaN     NaN      0.0008   0.0066  NaN     ...
girl/N/StanfDepSents.10/19  0.0119  NaN     0.0041  0.0118   0.2003   0.0035  0.0130  ...
girl/N/StanfDepSents.11/3   0.1028  0.0279  0.0251  NaN      0.0006   0.0048  NaN     ...
girl/N/StanfDepSents.11/19  0.0110  NaN     0.0117  0.0088   0.0110   0.0032  0.0220  ...
girl/N/StanfDepSents.11/28  0.0533  0.1544  0.0130  NaN      0.0003   0.0025  NaN     ...
girl/N/StanfDepSents.7/7    0.0141  0.0302  0.0197  0.0113   0.0141   0.0283  0.0412  ...
girl/N/StanfDepSents.7/25   0.0145  NaN     0.0050  0.1741   0.0179   0.0042  0.0158  ...
...                         ...     ...     ...     ...      ...      ...     ...     ...

[33]:

tokvecs.save(f"{output_path}/{corpus_name}.ttmx.ppmi.pac")


Saving matrix...
Stored in file:
  .//output//Toy.ttmx.ppmi.pac

7. Cosine distances¶

The final element, tokvecs, is the actual token-level matrix we are interested in. We could use it to average vectors over a set of tokens or directly compute the distances or similarities between the vectors. See the documentation for options on different measures

[34]:

tokdists = compute_distance(tokvecs)
tokdists


************************************
function    = compute_distance
  time      = 0.02593 sec
************************************

[34]:

[21, 21]                    girl/N/StanfDepSents.10/13  girl/N/StanfDepSents.10/19  girl/N/StanfDepSents.11/3  girl/N/StanfDepSents.11/19  girl/N/StanfDepSents.11/28  girl/N/StanfDepSents.7/7  girl/N/StanfDepSents.7/25  ...
girl/N/StanfDepSents.10/13  0.0000                      0.8721                      0.8736                     0.8654                      0.8920                      0.7566                    0.8227                     ...
girl/N/StanfDepSents.10/19  0.8721                      0.0000                      0.9064                     0.8111                      0.8930                      0.7274                    0.6725                     ...
girl/N/StanfDepSents.11/3   0.8736                      0.9064                      0.0000                     0.8789                      0.3427                      0.6394                    0.8348                     ...
girl/N/StanfDepSents.11/19  0.8654                      0.8111                      0.8789                     0.0000                      0.8652                      0.4694                    0.7922                     ...
girl/N/StanfDepSents.11/28  0.8920                      0.8930                      0.3427                     0.8652                      0.0000                      0.5428                    0.8596                     ...
girl/N/StanfDepSents.7/7    0.7566                      0.7274                      0.6394                     0.4694                      0.5428                      0.0000                    0.6970                     ...
girl/N/StanfDepSents.7/25   0.8227                      0.6725                      0.8348                     0.7922                      0.8596                      0.6970                    0.0000                     ...
...                         ...                         ...                         ...                        ...                         ...                         ...                       ...                        ...

[35]:

tokdists.save(f"{output_path}/{corpus_name}.ttmx.dist.pac")


Saving matrix...
Stored in file:
  .//output//Toy.ttmx.dist.pac

Of course, this could be used to compute the distances between the context words themselves!

[36]:

focdists = compute_distance(socMTX)
focdists


************************************
function    = compute_distance
  time      = 0.003047 sec
************************************

[36]:

[39, 39]  which/W  say/V   she/P   boy/N   this/D  about/I  give/V  ...
which/W   0.0000   0.9714  0.9932  0.9494  0.9750  0.9251   0.5509  ...
say/V     0.9714   0.0000  0.9934  0.8941  0.9592  0.9578   0.9752  ...
she/P     0.9932   0.9934  0.0000  0.7166  0.9929  0.9878   0.9968  ...
boy/N     0.9494   0.8941  0.7166  0.0000  0.9954  0.8361   0.7142  ...
this/D    0.9750   0.9592  0.9929  0.9954  0.0000  0.9594   0.9838  ...
about/I   0.9251   0.9578  0.9878  0.8361  0.9594  0.0000   0.9780  ...
give/V    0.5509   0.9752  0.9968  0.7142  0.9838  0.9780   0.0000  ...
...       ...      ...     ...     ...     ...     ...      ...     ...