All-in-one nephosem¶
This tutorial shows the main tasks you can perform with the nephosem
library, with the following steps:
Inital setup: load required libraries
Configuration: define settings specific for your case study, related mostly to how to read your corpus.
Collocation matrices: create co-occurrence matrices with and without dependency information
Basic token level: create token-level vectors, with and without dependency information
Full token level: weights and replaces first-order context words with their type-level vectors
NOTE: Tips on manipulation of the different objects will be given in their respective tutorials.
0. Initial setup¶
[1]:
import numpy as np # for "booleanize()"
from scipy import sparse # for "booleanize()"
import logging # to keep debugging log
import sys
nephosemdir = "../../nephosem/"
sys.path.append(nephosemdir)
Once nephosem
is in our path, we can import different classes and functions from the library, depending on the specific tasks you need to do.
[2]:
from nephosem.conf import ConfigLoader # to setup the configuration
from nephosem import Vocab, TypeTokenMatrix # to manage frequency lists and matrices
from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler # to generate frequency lists and matrices
from nephosem import compute_association, compute_distance # to compute PPMI and distances
from nephosem.specutils.mxcalc import compute_token_weights, compute_token_vectors # for token level
from nephosem.models.typetoken import build_tc_weight_matrix # for weighting at token level
# For dependencies
from nephosem.core.graph import SentenceGraph, MacroGraph, PatternGraph
from nephosem.models.deprel import DepRelHandler, read_sentence
1. Configuration¶
Depending on what you need, you will have to set up some useful paths as variables for your future filenames.
[3]:
mydir = f"./"
output_path = f"{mydir}/output/"
corpus_name = 'Toy'
logging.basicConfig(filename = f'{mydir}/log.log', level = logging.DEBUG)
The most important concrete step is to adapt the configuration file.
WARNING: You need to run the appropriate settings at the beginning of every script/notebook you run. Every part of the code of one same project has to use the same settings.
[4]:
conf = ConfigLoader()
settings = conf.settings
# If you already have your settings in a config file, you can load them:
# settings = conf.update_config('config.ini')
For this notebook, we will use a dataset of toy sentences in English annotated with Stanford dependencies, stored in ‘data’. This is what part of one of the files looks like:
[5]:
with open('data/StanfDepSents.1.conll', 'r') as f:
lines = f.readlines()
for line in lines[:6]:
print(line)
<s id="1">
The DT the 1 2 det
girl NNS girl 2 3 nsubj
looks VBZ look 3 0 ROOT
healthy JJ healthy 4 3 acomp
</s>
On the one hand, we have token lines: each token is in a line with tab-separated attributes. In this case, they are: word form, part of speech, lemma, index in sentence, index of dependency head, dependency relation. On the other hand, we have lines with other information, in this case sentence delimiters.
Now we need to define the settings so that the code knows how to read the corpus: which lines count as tokens, where sentences end, and which are the different attributes of the corpus. In addition, we will specify what attributes we want for the definition of a type.
The line-machine
setting is a regular expression that shoudl only match the lines that count as tokens, and in which the different attributes are captured by groups. In this case, we indicate that we have six sequences of non-tab characters ([^\t]
), which are captured by parentheses and separated by tab characters.
The global-columns
settings labels the different groups that line-machine
has captured.
[6]:
settings['line-machine'] = '([^\t]+)\t([^\t])[^\t]*\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)' #Stanford corpus
settings['global-columns'] = 'word,pos,lemma,id,head,deprel'
The type
, colloc
and token
settings indicate the format of the type, collocate and token ID’s, reusing the labels set in settings['global-columns']
. Here the target and collocate types are set to the default (combination of ‘lemma’ and ‘pos’), and the token ID uses the values of the ‘lemma’ and ‘pos’ fields along with the file name/ID (‘fid’) and the line number starting from 1 (‘lid’), all separated by slashes. The ‘fid’ is computed as the basename of the filename, without
extension.
[7]:
settings['type'] = 'lemma/pos'
settings['colloc'] = 'lemma/pos'
settings['token'] = 'lemma/pos/fid/lid'
If you have dependency-models, you will need to define a few extra settings: the format of the node and edges in the dependency graph and the labels of the index and head information. In other words, you map the labels given in settings['global-columns']
to specific roles in the dependencies.
[8]:
settings['node-attr'] = 'lemma,pos'
settings['edge-attr'] = 'deprel'
settings['currID'] = 'id'
settings['headID'] = 'head'
Finally, you can set up the file encoding and the paths for corpus and output. The code will run on all the files found in settings['corpus-path']
, so if you only want to work on a subset, you can create a list of filenames (with full paths), store in a file, and input either the list or the filenames-path as fnames
argument of any function that scans the corpus.
[9]:
settings['file-encoding'] = 'utf-8'
settings['outfile-encoding'] = 'utf-8'
settings['output-path'] = output_path
settings['corpus-path'] = f"{mydir}/data/"
1.1 Maleable settings¶
The previous settings must be defined only once at the beginning of the project and not be changed, since they indicate how to read the corpus.
The next two settings may be changed at different stages of the workflow as hyper-parameters:
The
separator-line-machine
setting is optional for bag-of-words models (it will exclude context words in a different sentence) but necessary for dependency-models: it tells the code where sentences end.The
left-span
andright-span
values specify the size of the bag-of-words window spans.
[10]:
settings['separator-line-machine'] = '</s>'
settings['left-span'] = 4
settings['right-span'] = 4
2. Frequency lists¶
The Vocab class is based on dictionaries. The steps to create one are twofold:
Set up an ItemFreqHandler class with the settings
Build the frequency list with its
.build_item_freq()
method. Thefnames
argument can be a list of paths or a path to a file with a list of paths. If it is not provided, the full content ofcorpus-path
will be used.
[11]:
ifhan = ItemFreqHandler(settings = settings)
vocab = ifhan.build_item_freq() # by default it uses multiprocessor, which is overkill with the toy corpus
vocab
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building item frequency list...
[11]:
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]
[12]:
vocab_fname = f"{output_path}/{corpus_name}.nfreq"
vocab.save(vocab_fname)
Saving frequency list (vocabulary)... (in 'utf-8')
Stored in .//output//Toy.nfreq
This is only once per corpus. Once the main vocabulary list is compiled and stored, it can be further filtered (see here). It can be used for the following purposes:
to simply extract the frequency of a lemma
to create co-occurrence matrices, as the compulsory
row_vocab
and optionalcol_vocab
arguments of thebuild_col_freq
method below.to select the target types at token level
3. Co-occurrence matrix¶
The main class for matrices in Nephosem is the TypeTokenMatrix. You will want to create a co-occurrence matrix between all the types in your corpus -using all the items in your vocabulary. Note that if your corpus is large this can take a long time.
3.1. Bag-of-words¶
Like creating the vocabulary list, creating a co-occurrence frequency matrix has two steps: set up the ColFreqHandler
object and running its .build_col_freq()
method.
[13]:
cfhan = ColFreqHandler(settings=settings, row_vocab = vocab, col_vocab = vocab)
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
The .build_col_freq()
method also has an optional fname
argument requiring a list of paths or a path to a list of paths. In addition, row_vocab
and col_vocab
ask for Vocab
objects like vocab
above. The row_vocab
argument, which indicates the node types to get co-occurrence information on, is compulsory.
[14]:
freqMTX = cfhan.build_col_freq()
freqMTX
Building collocate frequency matrix...
[14]:
[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...
's/P NaN NaN NaN NaN NaN NaN NaN ...
,/, NaN 2 2 NaN NaN NaN NaN ...
a/D NaN 2 NaN NaN NaN NaN NaN ...
about/I NaN NaN NaN NaN NaN NaN NaN ...
about/R NaN NaN NaN NaN NaN NaN NaN ...
all/P NaN NaN NaN NaN NaN NaN NaN ...
an/D NaN NaN NaN NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ...
This notebook shows you how you can play with TypeTokenMatrix objects.
This is what it looks like if you only subset the row for ‘girl/N’ and remove the empty columns:
[15]:
freqMTX.submatrix(row = ['girl/N']).drop(axis = 1)
[15]:
[1, 39] 's/P ,/, a/D about/I about/R and/C apple/N ...
girl/N 1 1 5 1 1 3 10 ...
[16]:
freqMTX.save(f"{output_path}/{corpus_name}.bow.wcmx.pac")
Saving matrix...
Stored in file:
.//output//Toy.bow.wcmx.pac
3.2 Dependency-based¶
Dependency-based models require yet another piece of information: templates. You can learn all about these templates in this notebook.
On the one hand, we have .graphml files that indicate relationships between elements. On the other, we have .xml files which specify the role of the node and features in the relationships. For example, in .graphml you would say that you want the relationship between a verb and its direct object; in the .xml, you would clarify that the verb is your feature (or context item) and the object is your target. You would also specify whether you want the lemma of the verb as the feature or, instead, the
full path (just eat/V or eat/V->dobj:#T
, with #T
filling in the role of the target.
For the type-level, we will exemplify with patterns that do not specify the kind of relationships but the number of steps, only selecting paths with one step between the target and the context word.
First, we have to upload the files.
[17]:
path_graphml_fname = f"{mydir}/templates/LEMMAPATH.template.graphml"
path_patterns = PatternGraph.read_graphml(path_graphml_fname)
path_macro_fname = f"{mydir}/templates/LEMMAPATH.target-feature-macro.xml"
path_macros = MacroGraph.read_xml(path_macro_fname, path_patterns)
[18]:
# The only difference between type- and token-level here is the "mode" argument
path_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')
path_dephan_type.read_templates(macros=path_macros)
pathMTX = path_dephan_type.build_dependency()
pathMTX
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...
Building matrix...
[18]:
[54, 54] 's/P a/D about/I about/R all/P an/D and/C ...
's/P NaN NaN NaN NaN NaN NaN NaN ...
a/D NaN NaN NaN NaN NaN NaN NaN ...
about/I NaN NaN NaN NaN NaN NaN NaN ...
about/R NaN NaN NaN NaN NaN NaN NaN ...
all/P NaN NaN NaN NaN NaN NaN NaN ...
an/D NaN NaN NaN NaN NaN NaN NaN ...
and/C NaN NaN NaN NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ...
This is what it looks like if you only subset the row for ‘girl/N’ and remove the empty columns:
[19]:
pathMTX.submatrix(row = ["girl/N"]).drop(axis = 1)
[19]:
[1, 12] 's/P apple/N ask/V at/I boy/N by/I eat/V ...
girl/N 1 1 1 1 2 1 6 ...
The following template asks the full dependency relation to be the feature, instead of just its lemma.
[20]:
pathfull_macro_fname = f"{mydir}/templates/LEMMAPATHfull.target-feature-macro.xml"
pathfull_macros = MacroGraph.read_xml(pathfull_macro_fname, path_patterns)
pathfull_dephan_type = DepRelHandler(settings, workers=4, targets = vocab, mode='type')
pathfull_dephan_type.read_templates(macros=pathfull_macros)
pathfullMTX = pathfull_dephan_type.build_dependency()
pathfullMTX.submatrix(row = ["girl/N"]).drop(axis = 1)
# Note that the dependency itself is replaced by "*" because the regex in the patterns file does not capture it
# (it doesn't have parentheses)
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...
Building matrix...
[20]:
[1, 12] #T#->*:'s/P #T#->*:old/J #T#->*:the/D apple/N->*:#T# ask/V->*:#T# at/I->*:#T# boy/N->*:#T# ...
girl/N 1 1 21 1 1 1 2 ...
In practice, we will be more interested in the transposed counterpart of this matrix: when we obtain token-level matrices of this kind, the patterns will be the columns and will need to be multiplied by a SOCC matrix where the patterns are the rows :).
4. Association measures¶
One of the things you will want to do is compute association measures, which will be the actual values of your vectors, either for type or token level matrices. This is done with compute_association()
, a function that takes a TypeTokenMatrix, row and column Vocab objects and the kind of measure (check the documentation to find the possibilities).
First we obtain the marginal frequencies of your reference matrix and convert them to Vocab object.
[21]:
nfreq = Vocab(freqMTX.sum(axis=1))
cfreq = Vocab(freqMTX.sum(axis=0))
ppmiMTX = compute_association(freqMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')
ppmiMTX
************************************
function = compute_association
time = 0.01725 sec
************************************
[21]:
[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...
's/P NaN NaN NaN NaN NaN NaN NaN ...
,/, NaN 2.0384464 1.2940059 NaN NaN NaN NaN ...
a/D NaN 1.2940059 NaN NaN NaN NaN NaN ...
about/I NaN NaN NaN NaN NaN NaN NaN ...
about/R NaN NaN NaN NaN NaN NaN NaN ...
all/P NaN NaN NaN NaN NaN NaN NaN ...
an/D NaN NaN NaN NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ...
You should compute the marginal frequencies on your full reference matrix, but you may use a submatrix for compute_association()
to just compute the values for selected items.
[22]:
subMTX = freqMTX.submatrix(row = ["the/D"]).drop(axis = 1, n_nonzero = 0)
pmi_the = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'pmi')
pmi_the
************************************
function = compute_association
time = 0.01555 sec
************************************
[22]:
[1, 53] 's/P ,/, a/D about/I about/R all/P an/D ...
the/D 0.030027887 -0.611826 -0.43997574 -0.15229367 0.030027887 0.25317144 -1.0685844 ...
5. Basic token level¶
The first step to collecting tokens is selecting the types from which you will collect them. The lines below set a query just for ‘girl/N’; if you wanted to use more lemmas you can just include them in the list, e.g. vocab.subvocab(['girl/N', 'boy/N'])
.
[23]:
query = vocab.subvocab(["girl/N"])
We will first look at a bag-of-words method, and then at three different dependency-based methods:
Lemmarel, where the dependency is selected by a specific set of relationships and the context feature is a lemma.
Lemmapath, where the dependency is selected based on the number of steps on the dependency path (like above) and the context feature is a lemma.
Deppath, where the dependency is selected based on the number of steps on the dependency path but the full dependency relation is the context feature. (Deprel is of course also possible but I will not show it.)
5.1. Bag-of-words¶
As always, collecting tokens go into two steps: setting up the TokenHandler
class and then running the retrieve_tokens()
method (there are other alternative methods too, I understand). The query
argument of the class is a Vocab object with the types from which we want the tokens. Among the important settings you might want to reconfigure are the window span (settings['left-span']
and settings['right-span']
) and settings['single-boundary-machine']
, a regular expression
to match lines that correspond to sentence (or whatever) boundaries, such as ‘’ in this case.
Next to fnames
, the method (as well as the class itself) includes a col_vocab
argument, which takes a Vocab
object, to select which context words can be captured (rather than, by default, all context words). The fnames
argument can be particularly useful here to avoid scanning all of a huge corpus if you only want a few hundred tokens.
[24]:
tokhan = TokenHandler(query, settings=settings)
tokens = tokhan.retrieve_tokens()
tokens
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Scanning tokens of queries in corpus...
[24]:
[21, 39] which/W say/V she/P boy/N this/D about/I give/V ...
girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN NaN NaN ...
girl/N/StanfDepSents.10/19 NaN NaN NaN NaN NaN NaN NaN ...
girl/N/StanfDepSents.11/3 NaN NaN NaN 4 NaN NaN NaN ...
girl/N/StanfDepSents.11/19 -2 NaN NaN NaN NaN NaN 1 ...
girl/N/StanfDepSents.11/28 NaN NaN NaN 4 -4 NaN NaN ...
girl/N/StanfDepSents.7/7 NaN NaN NaN -3 NaN NaN -2 ...
girl/N/StanfDepSents.7/25 NaN NaN NaN -3 NaN 1 NaN ...
... ... ... ... ... ... ... ... ...
5.2. Lemmarel¶
In the first dependency-based model, we will look at templates where a noun is the target and the features could be the verb of which it is subject or direct object, its modifier or an item from which it depends via a preposition.
[25]:
rel_graphml_fname = f"{mydir}/templates/LEMMAREL.template.graphml"
rel_patterns = PatternGraph.read_graphml(rel_graphml_fname)
rel_macro_fname = f"{mydir}/templates/LEMMAREL.target-feature-macro.xml"
rel_macros = MacroGraph.read_xml(rel_macro_fname, rel_patterns)
Like in all other dependency-based models, we first create an object of the DepRelHandler
class (with either mode ‘type’ or, in this case, ‘token’) and then we give the macros to the .read_templates()
method.
[26]:
rel_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
rel_dephan.read_templates(macros=rel_macros)
rel_tokens = rel_dephan.build_dependency()
rel_tokens
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...
Building matrix...
[26]:
[15, 6] ask/V at/I eat/V give/V look/V sit/V
girl/N/StanfDepSents.1/13 NaN 1 NaN NaN NaN NaN
girl/N/StanfDepSents.1/20 NaN NaN 1 NaN NaN NaN
girl/N/StanfDepSents.1/3 NaN NaN NaN NaN 1 NaN
girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN 1
girl/N/StanfDepSents.10/19 NaN NaN 1 NaN NaN NaN
girl/N/StanfDepSents.11/19 NaN NaN NaN 1 NaN NaN
girl/N/StanfDepSents.11/28 NaN NaN NaN NaN 1 NaN
... ... ... ... ... ... ...
5.3. Lemmapath¶
This is the token-level counterpart of the type-level model shown above: the rows are individual instances instead of type-level vectors.
[27]:
path_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
path_dephan.read_templates(macros=path_macros)
path_tokens = path_dephan.build_dependency()
path_tokens
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...
Building matrix...
[27]:
[21, 12] 's/P apple/N ask/V at/I boy/N by/I eat/V ...
girl/N/StanfDepSents.1/13 NaN NaN NaN 1 NaN NaN NaN ...
girl/N/StanfDepSents.1/20 NaN NaN NaN NaN NaN NaN 1 ...
girl/N/StanfDepSents.1/3 NaN NaN NaN NaN NaN NaN NaN ...
girl/N/StanfDepSents.10/13 NaN NaN NaN NaN NaN NaN NaN ...
girl/N/StanfDepSents.10/19 NaN NaN NaN NaN NaN NaN 1 ...
girl/N/StanfDepSents.11/19 NaN NaN NaN NaN NaN NaN NaN ...
girl/N/StanfDepSents.11/28 NaN NaN NaN NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ...
5.4. Deppath¶
This is the token-level counterpart of the second type-level model shown above; that (transposed) type-level model would serve as the second-order matrix for this token-level matrix.
[28]:
pathfull_dephan = DepRelHandler(settings, workers=4, targets=query, mode='token')
pathfull_dephan.read_templates(macros=pathfull_macros)
pathfull_tokens = pathfull_dephan.build_dependency()
pathfull_tokens
WARNING: Not provide the temporary path!
WARNING: Use the default tmp directory: '~/tmp'!
Building dependency features...
Building matrix...
[28]:
[21, 12] #T#->*:'s/P #T#->*:old/J #T#->*:the/D apple/N->*:#T# ask/V->*:#T# at/I->*:#T# boy/N->*:#T# ...
girl/N/StanfDepSents.1/13 NaN NaN 1 NaN NaN 1 NaN ...
girl/N/StanfDepSents.1/20 NaN NaN 1 NaN NaN NaN NaN ...
girl/N/StanfDepSents.1/3 NaN NaN 1 NaN NaN NaN NaN ...
girl/N/StanfDepSents.10/13 NaN NaN 1 NaN NaN NaN NaN ...
girl/N/StanfDepSents.10/19 NaN NaN 1 NaN NaN NaN NaN ...
girl/N/StanfDepSents.11/19 NaN NaN 1 NaN NaN NaN NaN ...
girl/N/StanfDepSents.11/28 NaN NaN 1 NaN NaN NaN NaN ...
... ... ... ... ... ... ... ... ...
6. Full token level¶
6.1. Weight context words¶
The matrices from the previous step will have positions or counts as values. Before replacing the context words with their type-level vectors, we might want to weight them with some association measure, so that context words that are more attracted to the target have a larger influence in the final position of the token with which they co-occur. For that purpose we use compute_token_weights()
with a weight matrix (e.g. with positive pmi) that includes the target type in its rows and the
context words of the tokens in its columns.
[29]:
subMTX = freqMTX.submatrix(row = query.get_item_list(), col = tokens.col_items).drop(axis = 1) #Of course, it's best to check for the intersection...
weighter = compute_association(subMTX, nfreq=nfreq, cfreq=cfreq, meas = 'ppmi')
weighter
************************************
function = compute_association
time = 0.01584 sec
************************************
[29]:
[1, 39] which/W say/V she/P boy/N this/D about/I give/V ...
girl/N 0.3507146 0.6383967 0.12757105 0.0 1.0438617 0.6383967 0.23293155 ...
[30]:
weighted = compute_token_weights(tokens, weighter)
6.2. Second-order dimensions¶
The final step to obtain token-level vectors is to replace the (weighted) context words with their type-level vectors, by means of the compute_token_vectors()
function.
The first two arguments of this function, tcWeightMTX
and soccMTX
, are the token level and second-order type level matrices involved. Next to them there is an argument operation
to decide how to merge the type level vectors of the context words to form the token-level vector: by default, it’s addition, but it could also be multiplication or a weighted mean. In addition, the argument normalization
, with L1 as default, sets whether and how the vectors should be normalized.
The second-order matrix has to have, as rows, the columns of the token-level matrix, while the columns will be the final dimensions.
[31]:
socMTX = ppmiMTX.submatrix(row = weighted.col_items).drop(axis = 1)
socMTX
[31]:
[39, 55] 's/P ,/, a/D about/I about/R all/P an/D ...
which/W NaN NaN NaN NaN NaN NaN NaN ...
say/V NaN NaN NaN NaN NaN NaN NaN ...
she/P NaN NaN NaN NaN NaN NaN NaN ...
boy/N NaN NaN 0.07956182 0.59038746 NaN 0.99585253 0.772709 ...
this/D NaN 2.9034438 NaN NaN NaN NaN NaN ...
about/I NaN NaN NaN NaN NaN NaN NaN ...
give/V NaN NaN 0.65492594 NaN NaN NaN 0.94260806 ...
... ... ... ... ... ... ... ... ...
[32]:
tokvecs = compute_token_vectors(weighted, socMTX, operation='weightedmean')
tokvecs
Operation: weighted mean 'token-feature weight matrix' X 'socc matrix'...
[32]:
[21, 55] 's/P ,/, a/D about/I about/R all/P an/D ...
girl/N/StanfDepSents.10/13 0.0008 NaN NaN NaN 0.0008 0.0066 NaN ...
girl/N/StanfDepSents.10/19 0.0119 NaN 0.0041 0.0118 0.2003 0.0035 0.0130 ...
girl/N/StanfDepSents.11/3 0.1028 0.0279 0.0251 NaN 0.0006 0.0048 NaN ...
girl/N/StanfDepSents.11/19 0.0110 NaN 0.0117 0.0088 0.0110 0.0032 0.0220 ...
girl/N/StanfDepSents.11/28 0.0533 0.1544 0.0130 NaN 0.0003 0.0025 NaN ...
girl/N/StanfDepSents.7/7 0.0141 0.0302 0.0197 0.0113 0.0141 0.0283 0.0412 ...
girl/N/StanfDepSents.7/25 0.0145 NaN 0.0050 0.1741 0.0179 0.0042 0.0158 ...
... ... ... ... ... ... ... ... ...
[33]:
tokvecs.save(f"{output_path}/{corpus_name}.ttmx.ppmi.pac")
Saving matrix...
Stored in file:
.//output//Toy.ttmx.ppmi.pac
7. Cosine distances¶
The final element, tokvecs
, is the actual token-level matrix we are interested in. We could use it to average vectors over a set of tokens or directly compute the distances or similarities between the vectors. See the documentation for options on different measures
[34]:
tokdists = compute_distance(tokvecs)
tokdists
************************************
function = compute_distance
time = 0.02593 sec
************************************
[34]:
[21, 21] girl/N/StanfDepSents.10/13 girl/N/StanfDepSents.10/19 girl/N/StanfDepSents.11/3 girl/N/StanfDepSents.11/19 girl/N/StanfDepSents.11/28 girl/N/StanfDepSents.7/7 girl/N/StanfDepSents.7/25 ...
girl/N/StanfDepSents.10/13 0.0000 0.8721 0.8736 0.8654 0.8920 0.7566 0.8227 ...
girl/N/StanfDepSents.10/19 0.8721 0.0000 0.9064 0.8111 0.8930 0.7274 0.6725 ...
girl/N/StanfDepSents.11/3 0.8736 0.9064 0.0000 0.8789 0.3427 0.6394 0.8348 ...
girl/N/StanfDepSents.11/19 0.8654 0.8111 0.8789 0.0000 0.8652 0.4694 0.7922 ...
girl/N/StanfDepSents.11/28 0.8920 0.8930 0.3427 0.8652 0.0000 0.5428 0.8596 ...
girl/N/StanfDepSents.7/7 0.7566 0.7274 0.6394 0.4694 0.5428 0.0000 0.6970 ...
girl/N/StanfDepSents.7/25 0.8227 0.6725 0.8348 0.7922 0.8596 0.6970 0.0000 ...
... ... ... ... ... ... ... ... ...
[35]:
tokdists.save(f"{output_path}/{corpus_name}.ttmx.dist.pac")
Saving matrix...
Stored in file:
.//output//Toy.ttmx.dist.pac
Of course, this could be used to compute the distances between the context words themselves!
[36]:
focdists = compute_distance(socMTX)
focdists
************************************
function = compute_distance
time = 0.003047 sec
************************************
[36]:
[39, 39] which/W say/V she/P boy/N this/D about/I give/V ...
which/W 0.0000 0.9714 0.9932 0.9494 0.9750 0.9251 0.5509 ...
say/V 0.9714 0.0000 0.9934 0.8941 0.9592 0.9578 0.9752 ...
she/P 0.9932 0.9934 0.0000 0.7166 0.9929 0.9878 0.9968 ...
boy/N 0.9494 0.8941 0.7166 0.0000 0.9954 0.8361 0.7142 ...
this/D 0.9750 0.9592 0.9929 0.9954 0.0000 0.9594 0.9838 ...
about/I 0.9251 0.9578 0.9878 0.8361 0.9594 0.0000 0.9780 ...
give/V 0.5509 0.9752 0.9968 0.7142 0.9838 0.9780 0.0000 ...
... ... ... ... ... ... ... ... ...