The full python workflow for semasiological token-level clouds

This notebook walks you through all the steps, from setting up the code based on your corpus to getting token-level distance matrices. They can be based on individual lemmas (like here) or on groups of related lemmas (like here).

0. Initial setup

The first step is to load some packages, including both nephosem and semasioFlow. Then we export functions directly from semasioFlow (although ConfigLoader is extracted from nephosem).

import os
import sys
import logging
sys.path.append("../../nephosem/") # my path to nephosem
sys.path.append("../semasioFlow/") # my path to semasioFlow
from semasioFlow import ConfigLoader
from semasioFlow.load import loadVocab, loadMacro, loadColloc, loadFocRegisters
from semasioFlow.sample import sampleTypes
from semasioFlow.focmodels import createBow, createRel, createPath
from semasioFlow.socmodels import targetPPMI, weightTokens, createSoc
from semasioFlow.utils import plotPatterns

1. Configuration

Depending on what you need, you will have to set up some useful paths settings. I like to have at least the path to my project (mydir), an output path within (mydir + "output") and a GitHub path for the datasets that I will use in the visualization. There is no real reason not to have everything together, except that I did not think of it at the moment. (Actually, there is: the GitHub stuff will be public and huge data would not be included. How much do we want to have public?)

mydir = "./"
output_path = f"{mydir}/output/create-clouds/"
nephovis_path = f"{mydir}/for-nephovis/"
logging.basicConfig(filename = f'{mydir}/testlog.log', level = logging.DEBUG)
necessary_subfolders = ['vocab', 'cws', 'registers', 'tokens']
for sf in necessary_subfolders:
    if not os.path.exists(output_path + sf):
        os.makedirs(output_path + sf)

The variables with paths is just meant to make it easier to manipulate filenames. The most important concrete step is to adapt the configuration file.

conf = ConfigLoader()
settings = conf.update_config('config.ini')
settings['output-path'] = output_path

corpus_name = 'Toy'
print(settings['type'], settings['colloc'], settings['token'])
lemma/pos lemma/pos lemma/pos/fid/lid

2. Frequency lists

The frequency lists are the first thing to create, but once you have them, you just load them. So what we are going to do here is define the filename where we would store the frequency list (in this case, where it is actually stored), and if it exists it loads it; if it doesn’t, it creates and store it. After generating a full frequency list, we might want to filter it and store different versions in the vocab folder.

full_name = f"{output_path}/vocab/{corpus_name}.nodefreq"
full = loadVocab(full_name, settings)
[('the/D', 53),('boy/N', 25),('eat/V', 22) ... ('ten/C', 1),('ask/V', 1),('about/I', 1)]

3. Boolean token-level matrices

Even though we first think of the type leven and only afterwards of the token level, with this workflow we don’t really need to touch type level until after we obtain the boolean token-level matrices, that is, until we need to use PPMI values to select or weight the context words.

As a first step, we need the type or list of types we want to run; for example "heet/adj" or ["vernietig/verb", "verniel/verb"], and we subset the vocabulary for that query.

query = full.subvocab(["girl/N", "boy/N"])
type_name = "child"
[('girl/N', 21),('boy/N', 25)]

We could generate the tokens for all tokens (which in a real sample could be in the thousands), or create a random selection with a certain number and then only use those files. The output of semasioFlow.sampleTypes() includes a list of token IDs as well as the list of filenames that suffices to extract those tokens. We can then use the new list of filenames when we collect tokens, and the list of tokens to subset the resulting matrices. In addition, the optional argument concordance can take a filename to which a raw concordance based on the current settings can be stored (see this tutorial).

Of course, to keep the sample fixed it would be more useful to generate the list, store it and then retrieve it in future runs.

NOTE: By default, semasioFlow.sampleTypes() will only sample one token of each type per file. To override this, add oneperfile = False.

import json
import os.path

fnames = [settings['corpus-path'] + '/' + x for x in os.listdir(settings['corpus-path'])]
tokenlist_fname = f"{mydir}/filelist.json"
if os.path.exists(tokenlist_fname):
    with open(tokenlist_fname, "r") as f:
        tokenlist, fnameSample = json.load(f).values()
    tokenlist, fnameSample = sampleTypes({'girl/N' : 20, 'boy/N' : 20}, fnames, settings, oneperfile = False)
    with open(tokenlist_fname, "w") as f:
        json.dump({"tokenlist" : tokenlist, "fnames" : fnameSample}, f)

3.1 Bag-of-words

The code to generate one matrix is very straightforward, but what if we want to use different combinations of parameter settings to create multiple matrices?

The code below assumes that the boolean BOW matrices may vary across three parameters:

  • foc_win: window size, which is set with numbers for let and right window. This has the settings above for default

  • foc_pos: part-of-speech filter, which will actually be set as a previously filtered list of context words. By default, all context words are included.

  • bound: the match for sentence boundaries and whether the models respond to them or not. By default, sentence boundaries are ignored.

nounverbs = [x for x in full.get_item_list() if x.rsplit("/", 1)[1] in ["N", "V"]]
foc_win = [(3, 3), (5, 5), (5, 3)] # optional window sizes
foc_pos = {
    "all" : full[full.freq > 2].get_item_list(), # only frequency filter
    "nounverbs" : nounverbs # only nouns and verbs
bound = { "match" : "^</s>$", "values" : [True, False]}

The function below combines a number of necessary functions:

  • it creates a loop over the different combinations of parameter settings specified

  • it collects the tokens and computes and filters the corresponding matrices

  • it transforms the matrices in “boolean” integer matrices, with only 0’s and 1’s

  • it stores the matrices in their respective files

  • it records the combinations of parameter settings and which values are taken by each model

  • it records the context words captured by each model for each token

  • it returns both records to be stored wherever you want

bowdata = createBow(query, settings, type_name = type_name, fnames = fnameSample, tokenlist = tokenlist,
        foc_win = foc_win, foc_pos = foc_pos, bound = bound)
bowdata.to_csv(f"{output_path}/registers/{type_name}.bow-models.tsv", sep = "\t", index_label = "_model")
For the rest, we go to R!

The R code is in the processClouds notebook, which uses the semcloud package.

Bonus: context word detail

from semasioFlow.contextwords import listContextwords
cws = listContextwords(type_name, tokenlist, fnameSample, settings, left_win=15, right_win = 15)
100%|██████████| 11/11 [00:00<00:00, 417.83it/s]
cw deprel distance head id lemma path pos position rep_path same_sentence side steps target_lemma token_id word
boy/N/StanfDepSents.1/9/L0 the/D det 1 2 1 the #T->det:the D L0 #T->det:Cw True L 1 child boy/N/StanfDepSents.1/9 The
boy/N/StanfDepSents.1/9/L1 NaN <s id="2"> 2 <s id="2"> <s id="2"> <s id="2"> NaN <s id="2"> L1 NaN False L NaN child boy/N/StanfDepSents.1/9 <s id="2">
boy/N/StanfDepSents.1/9/L2 NaN </s> 3 </s> </s> </s> NaN </s> L2 NaN False L NaN child boy/N/StanfDepSents.1/9 </s>
boy/N/StanfDepSents.1/9/L3 healthy/J acomp 4 3 4 healthy NaN J L3 NaN False L NaN child boy/N/StanfDepSents.1/9 healthy
boy/N/StanfDepSents.1/9/L4 look/V ROOT 5 0 3 look NaN V L4 NaN False L NaN child boy/N/StanfDepSents.1/9 looks
boy/N/StanfDepSents.1/9/L5 girl/N nsubj 6 3 2 girl NaN N L5 NaN False L NaN child boy/N/StanfDepSents.1/9 girl
boy/N/StanfDepSents.1/9/L6 the/D det 7 2 1 the NaN D L6 NaN False L NaN child boy/N/StanfDepSents.1/9 The
boy/N/StanfDepSents.1/9/L7 NaN <s id="1"> 8 <s id="1"> <s id="1"> <s id="1"> NaN <s id="1"> L7 NaN False L NaN child boy/N/StanfDepSents.1/9 <s id="1">
boy/N/StanfDepSents.1/9/R0 look/V ROOT 1 0 3 look look->nsubj:#T V R0 Cw->nsubj:#T True R 1 child boy/N/StanfDepSents.1/9 looks
boy/N/StanfDepSents.1/9/R1 at/I prep 2 3 4 at look->[nsubj:#T,prep:at] I R1 look->[nsubj:#T,prep:Cw] True R 2 child boy/N/StanfDepSents.1/9 at
boy/N/StanfDepSents.1/9/R10 girl/N nsubj 11 3 2 girl NaN N R10 NaN False R NaN child boy/N/StanfDepSents.1/9 girl
boy/N/StanfDepSents.1/9/R11 eat/V ROOT 12 0 3 eat NaN V R11 NaN False R NaN child boy/N/StanfDepSents.1/9 eats
boy/N/StanfDepSents.1/9/R12 less/R advmod 13 5 4 less NaN R R12 NaN False R NaN child boy/N/StanfDepSents.1/9 less
boy/N/StanfDepSents.1/9/R13 healthy/J amod 14 6 5 healthy NaN J R13 NaN False R NaN child boy/N/StanfDepSents.1/9 healthy
boy/N/StanfDepSents.1/9/R14 food/N dobj 15 3 6 food NaN N R14 NaN False R NaN child boy/N/StanfDepSents.1/9 food
boy/N/StanfDepSents.1/9/R2 the/D det 3 6 5 the look->[nsubj:#T,prep:at->pobj:girl->det:the] D R2 look->[nsubj:#T,prep:at->pobj:girl->det:Cw] True R 4 child boy/N/StanfDepSents.1/9 the
boy/N/StanfDepSents.1/9/R3 girl/N pobj 4 4 6 girl look->[nsubj:#T,prep:at->pobj:girl] N R3 look->[nsubj:#T,prep:at->pobj:Cw] True R 3 child boy/N/StanfDepSents.1/9 girl
boy/N/StanfDepSents.1/9/R4 as/I mark 5 9 7 as look->[nsubj:#T,advcl:eat->mark:as] I R4 look->[nsubj:#T,advcl:eat->mark:Cw] True R 3 child boy/N/StanfDepSents.1/9 as
boy/N/StanfDepSents.1/9/R5 she/P nsubj 6 9 8 she look->[nsubj:#T,advcl:eat->nsubj:she] P R5 look->[nsubj:#T,advcl:eat->nsubj:Cw] True R 3 child boy/N/StanfDepSents.1/9 she
boy/N/StanfDepSents.1/9/R6 eat/V advcl 7 3 9 eat look->[nsubj:#T,advcl:eat] V R6 look->[nsubj:#T,advcl:Cw] True R 2 child boy/N/StanfDepSents.1/9 eats
boy/N/StanfDepSents.1/9/R7 NaN </s> 8 </s> </s> </s> NaN </s> R7 NaN False R NaN child boy/N/StanfDepSents.1/9 </s>
boy/N/StanfDepSents.1/9/R8 NaN <s id="3"> 9 <s id="3"> <s id="3"> <s id="3"> NaN <s id="3"> R8 NaN False R NaN child boy/N/StanfDepSents.1/9 <s id="3">
boy/N/StanfDepSents.1/9/R9 the/D det 10 2 1 the NaN D R9 NaN False R NaN child boy/N/StanfDepSents.1/9 The
boy/N/StanfDepSents.1/9/target boy/N nsubj 0 3 2 boy #T N target #T True target 0 child boy/N/StanfDepSents.1/9 boy
boy/N/StanfDepSents.12/12/L0 the/D det 1 2 1 the #T->det:the D L0 #T->det:Cw True L 1 child boy/N/StanfDepSents.12/12 The
boy/N/StanfDepSents.12/12/L1 NaN <s id="35"> 2 <s id="35"> <s id="35"> <s id="35"> NaN <s id="35"> L1 NaN False L NaN child boy/N/StanfDepSents.12/12 <s id="35">
boy/N/StanfDepSents.12/12/L10 NaN <s id="34"> 11 <s id="34"> <s id="34"> <s id="34"> NaN <s id="34"> L10 NaN False L NaN child boy/N/StanfDepSents.12/12 <s id="34">
boy/N/StanfDepSents.12/12/L2 NaN </s> 3 </s> </s> </s> NaN </s> L2 NaN False L NaN child boy/N/StanfDepSents.12/12 </s>
boy/N/StanfDepSents.12/12/L3 tasty/J acomp 4 6 7 tasty NaN J L3 NaN False L NaN child boy/N/StanfDepSents.12/12 tasty
boy/N/StanfDepSents.12/12/L4 be/V ROOT 5 0 6 be NaN V L4 NaN False L NaN child boy/N/StanfDepSents.12/12 are
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
girl/N/StanfDepSents.8/3/target girl/N nsubj 0 3 2 girl #T N target #T True target 0 child girl/N/StanfDepSents.8/3 girl
girl/N/StanfDepSents.9/14/L0 old/J amod 1 3 2 old #T->amod:old J L0 #T->amod:Cw True L 1 child girl/N/StanfDepSents.9/14 older
girl/N/StanfDepSents.9/14/L1 the/D det 2 3 1 the #T->det:the D L1 #T->det:Cw True L 1 child girl/N/StanfDepSents.9/14 The
girl/N/StanfDepSents.9/14/L10 the/D det 11 4 2 the NaN D L10 NaN False L NaN child girl/N/StanfDepSents.9/14 the
girl/N/StanfDepSents.9/14/L11 all/P predet 12 4 1 all NaN P L11 NaN False L NaN child girl/N/StanfDepSents.9/14 All
girl/N/StanfDepSents.9/14/L12 NaN <s id="25"> 13 <s id="25"> <s id="25"> <s id="25"> NaN <s id="25"> L12 NaN False L NaN child girl/N/StanfDepSents.9/14 <s id="25">
girl/N/StanfDepSents.9/14/L2 NaN <s id="26"> 3 <s id="26"> <s id="26"> <s id="26"> NaN <s id="26"> L2 NaN False L NaN child girl/N/StanfDepSents.9/14 <s id="26">
girl/N/StanfDepSents.9/14/L3 NaN </s> 4 </s> </s> </s> NaN </s> L3 NaN False L NaN child girl/N/StanfDepSents.9/14 </s>
girl/N/StanfDepSents.9/14/L4 apple/N dobj 5 6 8 apple NaN N L4 NaN False L NaN child girl/N/StanfDepSents.9/14 apple
girl/N/StanfDepSents.9/14/L5 an/D det 6 8 7 an NaN D L5 NaN False L NaN child girl/N/StanfDepSents.9/14 an
girl/N/StanfDepSents.9/14/L6 give/V ROOT 7 0 6 give NaN V L6 NaN False L NaN child girl/N/StanfDepSents.9/14 given
girl/N/StanfDepSents.9/14/L7 be/V auxpass 8 6 5 be NaN V L7 NaN False L NaN child girl/N/StanfDepSents.9/14 were
girl/N/StanfDepSents.9/14/L8 boy/N nsubjpass 9 6 4 boy NaN N L8 NaN False L NaN child girl/N/StanfDepSents.9/14 boys
girl/N/StanfDepSents.9/14/L9 old/J amod 10 4 3 old NaN J L9 NaN False L NaN child girl/N/StanfDepSents.9/14 old
girl/N/StanfDepSents.9/14/R0 look/V ROOT 1 0 4 look look->nsubj:#T V R0 Cw->nsubj:#T True R 1 child girl/N/StanfDepSents.9/14 looks
girl/N/StanfDepSents.9/14/R1 at/I prep 2 4 5 at look->[nsubj:#T,prep:at] I R1 look->[nsubj:#T,prep:Cw] True R 2 child girl/N/StanfDepSents.9/14 at
girl/N/StanfDepSents.9/14/R10 be/V ROOT 11 0 2 be NaN V R10 NaN False R NaN child girl/N/StanfDepSents.9/14 are
girl/N/StanfDepSents.9/14/R11 healthy/J acomp 12 2 3 healthy NaN J R11 NaN False R NaN child girl/N/StanfDepSents.9/14 healthy
girl/N/StanfDepSents.9/14/R12 for/I prep 13 3 4 for NaN I R12 NaN False R NaN child girl/N/StanfDepSents.9/14 for
girl/N/StanfDepSents.9/14/R13 boy/N pobj 14 4 5 boy NaN N R13 NaN False R NaN child girl/N/StanfDepSents.9/14 boys
girl/N/StanfDepSents.9/14/R14 NaN </s> 15 </s> </s> </s> NaN </s> R14 NaN False R NaN child girl/N/StanfDepSents.9/14 </s>
girl/N/StanfDepSents.9/14/R2 a/D det 3 7 6 a look->[nsubj:#T,prep:at->pobj:boy->det:a] D R2 look->[nsubj:#T,prep:at->pobj:boy->det:Cw] True R 4 child girl/N/StanfDepSents.9/14 a
girl/N/StanfDepSents.9/14/R3 boy/N pobj 4 5 7 boy look->[nsubj:#T,prep:at->pobj:boy] N R3 look->[nsubj:#T,prep:at->pobj:Cw] True R 3 child girl/N/StanfDepSents.9/14 boy
girl/N/StanfDepSents.9/14/R4 in/I prep 5 7 8 in look->[nsubj:#T,prep:at->pobj:boy->prep:in] I R4 look->[nsubj:#T,prep:at->pobj:boy->prep:Cw] True R 4 child girl/N/StanfDepSents.9/14 in
girl/N/StanfDepSents.9/14/R5 the/D det 6 10 9 the look->[nsubj:#T,prep:at->pobj:boy->prep:in->po... D R5 look->[nsubj:#T,prep:at->pobj:boy->prep:in->po... True R 6 child girl/N/StanfDepSents.9/14 the
girl/N/StanfDepSents.9/14/R6 house/N pobj 7 8 10 house look->[nsubj:#T,prep:at->pobj:boy->prep:in->po... N R6 look->[nsubj:#T,prep:at->pobj:boy->prep:in->po... True R 5 child girl/N/StanfDepSents.9/14 house
girl/N/StanfDepSents.9/14/R7 NaN </s> 8 </s> </s> </s> NaN </s> R7 NaN False R NaN child girl/N/StanfDepSents.9/14 </s>
girl/N/StanfDepSents.9/14/R8 NaN <s id="27"> 9 <s id="27"> <s id="27"> <s id="27"> NaN <s id="27"> R8 NaN False R NaN child girl/N/StanfDepSents.9/14 <s id="27">
girl/N/StanfDepSents.9/14/R9 apple/N nsubj 10 2 1 apple NaN N R9 NaN False R NaN child girl/N/StanfDepSents.9/14 Apples
girl/N/StanfDepSents.9/14/target girl/N nsubj 0 4 3 girl #T N target #T True target 0 child girl/N/StanfDepSents.9/14 girl

917 rows × 16 columns

cw_fname = f"{nephovis_path}/{type_name}/{type_name}.cws.detail.tsv"
cws.to_csv(cw_fname, sep = "\t", index_label = "cw_id")

From this table, it is relatively straightforward to extract concordances and highlight the context words that match certain filters. Note that by default the left contexts are in reverse order.