nephosem.models package

Submodules

nephosem.models.cbc module

Algorithms

Usage examples

Initialize a vocabulary with a Python dict e.g.

>>> from nephosem.algos import cbc
class nephosem.models.cbc.CBC(elements, freqmx, measmx=None, distmx=None, k=100, theta1=0.35, theta2=0.25, prune_method='distance', t='median', score_metric='without_size', highest_score=False, num_iter=1000, workers=- 1)

Bases: object

cluster(num_eles=- 1, multicore=True)

Main method of Cluster by Committee.

Parameters
  • num_eles (int, optional) – Cluster only the first num_eles elements. Notes : Normally for test.

  • multicore (bool, optional) – Use multicore method or not. Default True.

Returns

(Cs, Rs) – A list of committees and a list of residues (of each recursion).

Return type

tuple

nephosem.models.deprel module

class nephosem.models.deprel.DepRelHandler(settings, workers=0, targets=None, mode='type', features=None, **kwargs)

Bases: nephosem.core.handler.BaseHandler

Handler Class for processing dependency relations

build_dependency(fnames=None)

Build a dependency frequency matrix for corpus files provided.

Parameters
  • fnames (str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.

  • targets (list of str or Vocab, optional) – Target types/words to process. If this is provided, only process these targets when matching the sentence with macros. Else, all possible targets would be checked when matching sentences.

Returns

features

Return type

iterable of TemplateGraph

Examples

# create a DepRelHandler instance >>> freqMTX = dephan.build_dependency() >>> freqMTX [11, 24] ->[agent_/->[nsubjpass_boy/N],_apple/N] ->[agent_/->[nsubjpass_girl/N],_apple/N] ->[nsubj_apple/N] ->[nsubj_boy/N] ->[nsubj_girl/N] <-nsubj_/V->[acomp_healthy/JJ] <-nsubj_/V->[acomp_old/JJ] … apple/N NaN NaN NaN NaN NaN 3 NaN … ask/V NaN NaN NaN 1 NaN NaN NaN … be/V NaN NaN 4 2 NaN NaN NaN … boy/N NaN NaN NaN NaN NaN 1 1 … eat/V 1 1 NaN 7 6 NaN NaN … girl/N NaN NaN NaN NaN NaN 1 NaN … give/V NaN NaN NaN 3 1 NaN NaN … … … … … … … … … …

build_matrix_by_matches()

Build a frequency matrix by the matches.

process(fnames, **kwargs)
queue_factorint, optional

Multiplier for size of queue -> size = number of workers * queue_factor.

read_templates(fname=None, macros=None, encoding='utf-8')

Read the templates from a CSV/TSV file. The file has lines of content like the following (including a header):

ID Target Regex Feature Regex Tareget Description Feature Description ID 1 (?P<LEMMA>w+)/(?P<POS>N)w* <-(?P<DEPREL>nsubj)$ (?P<LEMMA>w+)/(?P<POS>V)w* noun subject of verb 1

Parameters
  • fname (str, optional) – File name of the templates file.

  • macros (iterable of TemplateGraph, optional) – TemplateGraph instances when not passing the file name.

  • encoding (str, default 'utf-8') – File encoding of the template file.

Raises

ValueError – If either of the fname or templates is not provided.

Examples

>>> dephan = DepRelHandler(settings)
>>> template_fname = "{}/tests/data/DependencyFeatureTemplates.subgroup.tsv".format(nephosem.rootdir)
>>> dephan.read_templates(fname=template_fname)
>>> dephan.templates[0]
<-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w*
>>> templates = deepcopy(dephan.templates)
>>> dephan.read_templates(macros=templates)
>>> dephan.templates[0]
<-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w*
update_dep_rel(fname, macros, **kwargs)
update_dep_rel_token(fname, macros, **kwargs)

This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern

2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided

update_dep_rel_type(fname, macros, **kwargs)

This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern

2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided

nephosem.models.deprel.read_sentence(filename, formatter=None, encoding='utf-8')

Read sentences from corpus file.

Parameters
  • filename (str) –

  • formatter (nephosem.CorpusFormatter) –

  • encoding (str) – default ‘utf-8’

Returns

Return type

generator of sentences (tuple(int, string))

nephosem.models.typetoken module

This module implements the Bag-of-words models.

Other models

Usage examples

Initialize a model with e.g.

>>> from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler
>>> from nephosem.tests.utils import common_texts, get_tmpfile
>>> from nephosem.models import TypeToken
>>>
>>> path = get_tmpfile("models.model")
>>>
>>> model = TypeToken(common_texts, window=(5,5), min_count=1, workers=4)
>>> model.save("models.model")
class nephosem.models.typetoken.ColFreqHandler(settings, workers=0, row_vocab=None, col_vocab=None, **kwargs)

Bases: nephosem.core.handler.BaseHandler

build_col_freq(fnames=None, row_vocab=None, col_vocab=None)

The function will treat all different word types as possible target or context words.

Parameters
  • fnames (str or list of str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus or list of file names of corpus. Format: corpus_name + settings[“fnames-ext”]

  • row_vocab (Vocab) – If it is not provided here or when initializing the class, the code will stop.

  • col_vocab (Vocab) –

Returns

Return type

TypeTokenMatrix

chunksize = 1000000
static dict2matrix(mtx_dict, row_items, col_items, classname=<class 'nephosem.core.matrix.TypeTokenMatrix'>)
property nocolvocab
process(fnames, **kwargs)

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:
  • ItemFreqHandler: returns a Vocab object,

  • ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,

  • TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(matrix, win, **kwargs)

Process matches in the right window.

tmpindicator = 'col.freq'
update_one_file(filename, data, **kwargs)

This is the template method updating data of one corpus file.

update_one_match(matrix, win, lid=0, **kwargs)

Update co-occurrence frequency matrix with current window.

Parameters
  • matrix (3-tuple) – Includes dict of dict, row item list and column item list.

  • match (Match) – A regular expression match object.

  • lid (int) – Line number (1-based).

  • win (Window) – This is a Window object which records current items in span. The center item in window is the target word. And it has context words of left span and right span stored in two queues.

class nephosem.models.typetoken.ItemFreqHandler(settings, workers=0, **kwargs)

Bases: nephosem.core.handler.BaseHandler

build_item_freq(fnames=None)

Make a list of all word types that occurred in the corpus and write in json format.

Parameters

fnames (str or list of str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus) or list of file names. If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.

Returns

vocabulary

Return type

Vocab

process(fnames, **kwargs)

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:
  • ItemFreqHandler: returns a Vocab object,

  • ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,

  • TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

tmpindicator = 'item.freq'
update_one_file(filename, data, **kwargs)

Process lines in file (filename), and add frequencies to vocab. !!! this function modifies vocab !!!

Parameters
  • data (Vocab) – The Vocab object to be updated.

  • filename (str) – The corpus file name to process

class nephosem.models.typetoken.TokenHandler(queries, settings=None, workers=0, row_vocab=None, col_vocab=None, **kwargs)

Bases: nephosem.core.handler.BaseHandler

Handler Class for retrieving tokens

chunksize = 1000000
process(fnames, **kwargs)

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:
  • ItemFreqHandler: returns a Vocab object,

  • ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,

  • TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(type2toks, win, fid='fname', **kwargs)

When we meet the end of an article / block, we have to check the remaining nodes which are in the right window

retrieve_tokens(fnames=None)

Scan/Retrieve tokens from corpus files.

Parameters

fnames (str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus. Format: corpus_name + settings[“fnames-ext”]

Returns

Return type

TypeTokenMatrix

tmpindicator = 'tok.app'
update_one_file(filename, data, **kwargs)

This is the template method updating data of one corpus file.

update_one_match(type2toks, win, fid='fname', **kwargs)
Parameters
  • type2toks (dict) – A dict mapping from a type string to the token nodes of it.

  • win

  • fid (str) – The file id in a token string.

  • kwargs

class nephosem.models.typetoken.TypeToken(settings=None, corpus_name=None)

Bases: object

Train, use and evaluate models model.

fv
Type

collocate frequencies matrix

tv
Type

token vectors

vocabulary

This object represents the (all items) vocabulary of a corpus.

row_vocab

Row vocabulary.

col_vocab

Column vocabulary.

build_col_freq(fnames=None, row_vocab=None, col_vocab=None, multicore=True, prog_bar=True)
build_frequency_list(fnames=None, multicore=True, prog_bar=True)

Alias method of build_vocab().

build_token_vectors(tcWeightMTX=None, soccMTX=None, operation='addition')

Build token vectors.

Parameters
  • soccMTX (TypeTokenMatrix) – Second order collocate matrix.

  • tcWeightMTX (TypeTokenMatrix) – Token-Context weight matrix.

  • operation (str) – ‘addition’, ‘multiplication’

build_token_weights(tcPositionMTX=None, twMTX=None)

Build token-context weight matrix.

Parameters
  • tcPositionMTX (token context position matrix) –

    target words
    tokens | |
    | —————

  • twMTX (type weight matrix, ex. 'pmi' (transposed)) –

    target words

    context | … |

    features | … x … |
    (types) | … |

Returns

Return type

TypeTokenMatrix

build_vocab(fnames=None, multicore=True, prog_bar=True)

A caller method of TypeToken model. It calls the same method ItemFreqHandler.build_vocab.

Parameters
  • fnames (str) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.

  • multicore (bool) – Use multicore processing or not.

  • prog_bar (bool) – Show progress bar or not.

Returns

Return type

Vocab

See also

build_vocab

build_vocab

compute_association(freqmx=None, meas='ppmi')

Compute association measures matrix.

Parameters
  • freqmx

  • meas (str) – Could be: ‘pmi’, ‘ppmi’, ‘lik’, ‘chisq’, ‘zscore’, ‘dice’, ‘deltap’, ‘logratio’.

compute_distance(measmx=None, metric='cosine')

Compute distance matrix.

Parameters
  • measmx (TypeTokenMatrix) –

  • metric (str) – Could be: ‘cos’, ‘rank’

compute_similarity(measmx=None, metric='cosine', rank=False, axis=0)

Compute similarity matrix.

Parameters
  • measmx (TypeTokenMatrix) –

  • metric (str) – Could be: ‘cos’, ‘rank’

compute_simrank(simmx=None, distance=False, reverse=False)

Compute similarity rank matrix.

Parameters
  • simmx

  • distance

  • reverse

fetch_tokens(queries, fnames=None, multicore=True, prog_bar=True)
get_settings()
get_vocab()
make_token_colloc(type_nodes=None, colloc_vocab=None, prog_bar=True)
read_queries(queries)
retrieve_tokens(fnames=None, queries=None, row_vocab=None, col_vocab=None, multicore=True, prog_bar=True)
sample_tokens(n=300, method='random')
select_types(type_list)
Parameters

type_list (list of str) –

Module contents