nephosem.models package¶

Submodules¶

nephosem.models.cbc module¶

Algorithms

Usage examples¶

Initialize a vocabulary with a Python dict e.g.

>>> from nephosem.algos import cbc

class nephosem.models.cbc.CBC(elements, freqmx, measmx=None, distmx=None, k=100, theta1=0.35, theta2=0.25, prune_method='distance', t='median', score_metric='without_size', highest_score=False, num_iter=1000, workers=- 1)¶

Bases: object

cluster(num_eles=- 1, multicore=True)¶

Main method of Cluster by Committee.

Parameters

num_eles (int, optional) – Cluster only the first num_eles elements. Notes : Normally for test.
multicore (bool, optional) – Use multicore method or not. Default True.

Returns

(Cs, Rs) – A list of committees and a list of residues (of each recursion).

Return type

tuple

nephosem.models.deprel module¶

class nephosem.models.deprel.DepRelHandler(settings, workers=0, targets=None, mode='type', features=None, **kwargs)¶

Bases: nephosem.core.handler.BaseHandler

Handler Class for processing dependency relations

build_dependency(fnames=None)¶

Build a dependency frequency matrix for corpus files provided.

Parameters

fnames (str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
targets (list of str or Vocab, optional) – Target types/words to process. If this is provided, only process these targets when matching the sentence with macros. Else, all possible targets would be checked when matching sentences.

Returns

features

Return type

iterable of TemplateGraph

Examples

# create a DepRelHandler instance >>> freqMTX = dephan.build_dependency() >>> freqMTX [11, 24] ->[agent_/->[nsubjpass_boy/N],_apple/N] ->[agent_/->[nsubjpass_girl/N],_apple/N] ->[nsubj_apple/N] ->[nsubj_boy/N] ->[nsubj_girl/N] <-nsubj_/V->[acomp_healthy/JJ] <-nsubj_/V->[acomp_old/JJ] … apple/N NaN NaN NaN NaN NaN 3 NaN … ask/V NaN NaN NaN 1 NaN NaN NaN … be/V NaN NaN 4 2 NaN NaN NaN … boy/N NaN NaN NaN NaN NaN 1 1 … eat/V 1 1 NaN 7 6 NaN NaN … girl/N NaN NaN NaN NaN NaN 1 NaN … give/V NaN NaN NaN 3 1 NaN NaN … … … … … … … … … …

build_matrix_by_matches()¶: Build a frequency matrix by the matches.

process(fnames, **kwargs)¶

queue_factorint, optional: Multiplier for size of queue -> size = number of workers * queue_factor.

read_templates(fname=None, macros=None, encoding='utf-8')¶

Read the templates from a CSV/TSV file. The file has lines of content like the following (including a header):

ID Target Regex Feature Regex Tareget Description Feature Description ID 1 (?P<LEMMA>w+)/(?P<POS>N)w* <-(?P<DEPREL>nsubj)$ (?P<LEMMA>w+)/(?P<POS>V)w* noun subject of verb 1

Parameters

fname (str, optional) – File name of the templates file.
macros (iterable of TemplateGraph, optional) – TemplateGraph instances when not passing the file name.
encoding (str, default 'utf-8') – File encoding of the template file.

Raises

ValueError – If either of the fname or templates is not provided.

Examples

>>> dephan = DepRelHandler(settings)
>>> template_fname = "{}/tests/data/DependencyFeatureTemplates.subgroup.tsv".format(nephosem.rootdir)
>>> dephan.read_templates(fname=template_fname)
>>> dephan.templates[0]
<-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w*
>>> templates = deepcopy(dephan.templates)
>>> dephan.read_templates(macros=templates)
>>> dephan.templates[0]
<-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w*

update_dep_rel(fname, macros, **kwargs)¶

update_dep_rel_token(fname, macros, **kwargs)¶: This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern

2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided

update_dep_rel_type(fname, macros, **kwargs)¶: This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern

2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided

nephosem.models.deprel.read_sentence(filename, formatter=None, encoding='utf-8')¶

Read sentences from corpus file.

Parameters

filename (str) –
formatter (nephosem.CorpusFormatter) –
encoding (str) – default ‘utf-8’

Returns

Return type

generator of sentences (tuple(int, string))

nephosem.models.typetoken module¶

This module implements the Bag-of-words models.

Other models¶

Usage examples¶

Initialize a model with e.g.

>>> from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler
>>> from nephosem.tests.utils import common_texts, get_tmpfile
>>> from nephosem.models import TypeToken
>>>
>>> path = get_tmpfile("models.model")
>>>
>>> model = TypeToken(common_texts, window=(5,5), min_count=1, workers=4)
>>> model.save("models.model")

class nephosem.models.typetoken.ColFreqHandler(settings, workers=0, row_vocab=None, col_vocab=None, **kwargs)¶

Bases: nephosem.core.handler.BaseHandler

build_col_freq(fnames=None, row_vocab=None, col_vocab=None)¶

The function will treat all different word types as possible target or context words.

Parameters

fnames (str or list of str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus or list of file names of corpus. Format: corpus_name + settings[“fnames-ext”]
row_vocab (Vocab) – If it is not provided here or when initializing the class, the code will stop.
col_vocab (Vocab) –

Returns

Return type

TypeTokenMatrix

chunksize = 1000000¶

static dict2matrix(mtx_dict, row_items, col_items, classname=<class 'nephosem.core.matrix.TypeTokenMatrix'>)¶

property nocolvocab¶

process(fnames, **kwargs)¶

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:

ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(matrix, win, **kwargs)¶: Process matches in the right window.

tmpindicator = 'col.freq'¶

update_one_file(filename, data, **kwargs)¶: This is the template method updating data of one corpus file.

update_one_match(matrix, win, lid=0, **kwargs)¶

Update co-occurrence frequency matrix with current window.

Parameters

matrix (3-tuple) – Includes dict of dict, row item list and column item list.
match (Match) – A regular expression match object.
lid (int) – Line number (1-based).
win (Window) – This is a Window object which records current items in span. The center item in window is the target word. And it has context words of left span and right span stored in two queues.

class nephosem.models.typetoken.ItemFreqHandler(settings, workers=0, **kwargs)¶

Bases: nephosem.core.handler.BaseHandler

build_item_freq(fnames=None)¶

Make a list of all word types that occurred in the corpus and write in json format.

Parameters: fnames (str or list of str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus) or list of file names. If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
Returns: vocabulary
Return type: Vocab

process(fnames, **kwargs)¶

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:

ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

tmpindicator = 'item.freq'¶

update_one_file(filename, data, **kwargs)¶

Process lines in file (filename), and add frequencies to vocab. !!! this function modifies vocab !!!

Parameters

data (Vocab) – The Vocab object to be updated.
filename (str) – The corpus file name to process

class nephosem.models.typetoken.TokenHandler(queries, settings=None, workers=0, row_vocab=None, col_vocab=None, **kwargs)¶

Bases: nephosem.core.handler.BaseHandler

Handler Class for retrieving tokens

chunksize = 1000000¶

process(fnames, **kwargs)¶

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:

ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(type2toks, win, fid='fname', **kwargs)¶: When we meet the end of an article / block, we have to check the remaining nodes which are in the right window

retrieve_tokens(fnames=None)¶

Scan/Retrieve tokens from corpus files.

Parameters: fnames (str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus. Format: corpus_name + settings[“fnames-ext”]
Returns
Return type: TypeTokenMatrix

tmpindicator = 'tok.app'¶

update_one_file(filename, data, **kwargs)¶: This is the template method updating data of one corpus file.

update_one_match(type2toks, win, fid='fname', **kwargs)¶

Parameters

type2toks (dict) – A dict mapping from a type string to the token nodes of it.
win –
fid (str) – The file id in a token string.
kwargs –

class nephosem.models.typetoken.TypeToken(settings=None, corpus_name=None)¶

Bases: object

Train, use and evaluate models model.

fv¶

Type: collocate frequencies matrix

tv¶

Type: token vectors

vocabulary¶: This object represents the (all items) vocabulary of a corpus.

row_vocab¶: Row vocabulary.

col_vocab¶: Column vocabulary.

build_col_freq(fnames=None, row_vocab=None, col_vocab=None, multicore=True, prog_bar=True)¶

build_frequency_list(fnames=None, multicore=True, prog_bar=True)¶: Alias method of build_vocab().

build_token_vectors(tcWeightMTX=None, soccMTX=None, operation='addition')¶

Build token vectors.

Parameters

soccMTX (TypeTokenMatrix) – Second order collocate matrix.
tcWeightMTX (TypeTokenMatrix) – Token-Context weight matrix.
operation (str) – ‘addition’, ‘multiplication’

build_token_weights(tcPositionMTX=None, twMTX=None)¶

Build token-context weight matrix.

Parameters

tcPositionMTX (token context position matrix) –

target words

|

tokens | |

| —————
twMTX (type weight matrix, ex. 'pmi' (transposed)) –

target words

context | … |

features | … x … |

(types) | … |

Returns

Return type

TypeTokenMatrix

build_vocab(fnames=None, multicore=True, prog_bar=True)¶

A caller method of TypeToken model. It calls the same method ItemFreqHandler.build_vocab.

Parameters

fnames (str) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
multicore (bool) – Use multicore processing or not.
prog_bar (bool) – Show progress bar or not.

Returns

Return type

Vocab

nephosem.models package¶

Submodules¶

nephosem.models.cbc module¶

Usage examples¶

nephosem.models.deprel module¶

nephosem.models.typetoken module¶

Other models¶

Usage examples¶

Module contents¶