nephosem.models package¶
Submodules¶
nephosem.models.cbc module¶
Algorithms
Usage examples¶
Initialize a vocabulary with a Python dict e.g.
>>> from nephosem.algos import cbc
- class nephosem.models.cbc.CBC(elements, freqmx, measmx=None, distmx=None, k=100, theta1=0.35, theta2=0.25, prune_method='distance', t='median', score_metric='without_size', highest_score=False, num_iter=1000, workers=- 1)¶
Bases:
object
- cluster(num_eles=- 1, multicore=True)¶
Main method of Cluster by Committee.
- Parameters
num_eles (int, optional) – Cluster only the first num_eles elements.
Notes
: Normally for test.multicore (bool, optional) – Use multicore method or not. Default True.
- Returns
(Cs, Rs) – A list of committees and a list of residues (of each recursion).
- Return type
tuple
nephosem.models.deprel module¶
- class nephosem.models.deprel.DepRelHandler(settings, workers=0, targets=None, mode='type', features=None, **kwargs)¶
Bases:
nephosem.core.handler.BaseHandler
Handler Class for processing dependency relations
- build_dependency(fnames=None)¶
Build a dependency frequency matrix for corpus files provided.
- Parameters
fnames (str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
targets (list of str or
Vocab
, optional) – Target types/words to process. If this is provided, only process these targets when matching the sentence with macros. Else, all possible targets would be checked when matching sentences.
- Returns
features
- Return type
iterable of
TemplateGraph
Examples
# create a DepRelHandler instance >>> freqMTX = dephan.build_dependency() >>> freqMTX [11, 24] ->[agent_/->[nsubjpass_boy/N],_apple/N] ->[agent_/->[nsubjpass_girl/N],_apple/N] ->[nsubj_apple/N] ->[nsubj_boy/N] ->[nsubj_girl/N] <-nsubj_/V->[acomp_healthy/JJ] <-nsubj_/V->[acomp_old/JJ] … apple/N NaN NaN NaN NaN NaN 3 NaN … ask/V NaN NaN NaN 1 NaN NaN NaN … be/V NaN NaN 4 2 NaN NaN NaN … boy/N NaN NaN NaN NaN NaN 1 1 … eat/V 1 1 NaN 7 6 NaN NaN … girl/N NaN NaN NaN NaN NaN 1 NaN … give/V NaN NaN NaN 3 1 NaN NaN … … … … … … … … … …
- build_matrix_by_matches()¶
Build a frequency matrix by the matches.
- process(fnames, **kwargs)¶
- queue_factorint, optional
Multiplier for size of queue -> size = number of workers * queue_factor.
- read_templates(fname=None, macros=None, encoding='utf-8')¶
Read the templates from a CSV/TSV file. The file has lines of content like the following (including a header):
ID Target Regex Feature Regex Tareget Description Feature Description ID 1 (?P<LEMMA>w+)/(?P<POS>N)w* <-(?P<DEPREL>nsubj)$ (?P<LEMMA>w+)/(?P<POS>V)w* noun subject of verb 1
- Parameters
fname (str, optional) – File name of the templates file.
macros (iterable of
TemplateGraph
, optional) – TemplateGraph instances when not passing the file name.encoding (str, default 'utf-8') – File encoding of the template file.
- Raises
ValueError – If either of the fname or templates is not provided.
Examples
>>> dephan = DepRelHandler(settings) >>> template_fname = "{}/tests/data/DependencyFeatureTemplates.subgroup.tsv".format(nephosem.rootdir) >>> dephan.read_templates(fname=template_fname) >>> dephan.templates[0] <-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w* >>> templates = deepcopy(dephan.templates) >>> dephan.read_templates(macros=templates) >>> dephan.templates[0] <-(?P<DEPREL>nsubj)$ (?P<LEMMA>\w+)/(?P<POS>V)\w*
- update_dep_rel(fname, macros, **kwargs)¶
- update_dep_rel_token(fname, macros, **kwargs)¶
This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern
2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided
- update_dep_rel_type(fname, macros, **kwargs)¶
This is the real method that is used for processing!!! Procedures: 1. read sentences from the corpus file 2. for each sentence, match every template/pattern
2.1 if targets are provided, only match the sentence which satisfies the targets 2.2 so the matching should be a target-feature matching 2.3 this is a way of speeding up the process when the targets are provided
- nephosem.models.deprel.read_sentence(filename, formatter=None, encoding='utf-8')¶
Read sentences from corpus file.
- Parameters
filename (str) –
formatter (nephosem.CorpusFormatter) –
encoding (str) – default ‘utf-8’
- Returns
- Return type
generator of sentences (tuple(int, string))
nephosem.models.typetoken module¶
This module implements the Bag-of-words models.
Other models¶
Usage examples¶
Initialize a model with e.g.
>>> from nephosem import ItemFreqHandler, ColFreqHandler, TokenHandler
>>> from nephosem.tests.utils import common_texts, get_tmpfile
>>> from nephosem.models import TypeToken
>>>
>>> path = get_tmpfile("models.model")
>>>
>>> model = TypeToken(common_texts, window=(5,5), min_count=1, workers=4)
>>> model.save("models.model")
- class nephosem.models.typetoken.ColFreqHandler(settings, workers=0, row_vocab=None, col_vocab=None, **kwargs)¶
Bases:
nephosem.core.handler.BaseHandler
- build_col_freq(fnames=None, row_vocab=None, col_vocab=None)¶
The function will treat all different word types as possible target or context words.
- Parameters
fnames (str or list of str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus or list of file names of corpus. Format: corpus_name + settings[“fnames-ext”]
row_vocab (
Vocab
) – If it is not provided here or when initializing the class, the code will stop.col_vocab (
Vocab
) –
- Returns
- Return type
TypeTokenMatrix
- chunksize = 1000000¶
- static dict2matrix(mtx_dict, row_items, col_items, classname=<class 'nephosem.core.matrix.TypeTokenMatrix'>)¶
- property nocolvocab¶
- process(fnames, **kwargs)¶
Process files in the fnames list.
- Parameters
fnames (iterable) – This fnames is a list of file names.
- Returns
- For different tasks, this method returns different objects:
ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).
- Return type
object
- process_right_window(matrix, win, **kwargs)¶
Process matches in the right window.
- tmpindicator = 'col.freq'¶
- update_one_file(filename, data, **kwargs)¶
This is the template method updating data of one corpus file.
- update_one_match(matrix, win, lid=0, **kwargs)¶
Update co-occurrence frequency matrix with current window.
- Parameters
matrix (3-tuple) – Includes dict of dict, row item list and column item list.
match (
Match
) – A regular expression match object.lid (int) – Line number (1-based).
win (
Window
) – This is a Window object which records current items in span. The center item in window is the target word. And it has context words of left span and right span stored in two queues.
- class nephosem.models.typetoken.ItemFreqHandler(settings, workers=0, **kwargs)¶
Bases:
nephosem.core.handler.BaseHandler
- build_item_freq(fnames=None)¶
Make a list of all word types that occurred in the corpus and write in json format.
- Parameters
fnames (str or list of str, optional) – Path of file recording corpus file names (‘fnames’ file of a corpus) or list of file names. If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
- Returns
vocabulary
- Return type
Vocab
- process(fnames, **kwargs)¶
Process files in the fnames list.
- Parameters
fnames (iterable) – This fnames is a list of file names.
- Returns
- For different tasks, this method returns different objects:
ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).
- Return type
object
- tmpindicator = 'item.freq'¶
- update_one_file(filename, data, **kwargs)¶
Process lines in file (filename), and add frequencies to vocab. !!! this function modifies vocab !!!
- Parameters
data (
Vocab
) – The Vocab object to be updated.filename (str) – The corpus file name to process
- class nephosem.models.typetoken.TokenHandler(queries, settings=None, workers=0, row_vocab=None, col_vocab=None, **kwargs)¶
Bases:
nephosem.core.handler.BaseHandler
Handler Class for retrieving tokens
- chunksize = 1000000¶
- process(fnames, **kwargs)¶
Process files in the fnames list.
- Parameters
fnames (iterable) – This fnames is a list of file names.
- Returns
- For different tasks, this method returns different objects:
ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).
- Return type
object
- process_right_window(type2toks, win, fid='fname', **kwargs)¶
When we meet the end of an article / block, we have to check the remaining nodes which are in the right window
- retrieve_tokens(fnames=None)¶
Scan/Retrieve tokens from corpus files.
- Parameters
fnames (str, optional) – Filename of a file which records all (a user wants to process) file names of a corpus. Format: corpus_name + settings[“fnames-ext”]
- Returns
- Return type
TypeTokenMatrix
- tmpindicator = 'tok.app'¶
- update_one_file(filename, data, **kwargs)¶
This is the template method updating data of one corpus file.
- update_one_match(type2toks, win, fid='fname', **kwargs)¶
- Parameters
type2toks (dict) – A dict mapping from a type string to the token nodes of it.
win –
fid (str) – The file id in a token string.
kwargs –
- class nephosem.models.typetoken.TypeToken(settings=None, corpus_name=None)¶
Bases:
object
Train, use and evaluate models model.
- fv¶
- Type
collocate frequencies matrix
- tv¶
- Type
token vectors
- vocabulary¶
This object represents the (all items) vocabulary of a corpus.
- row_vocab¶
Row vocabulary.
- col_vocab¶
Column vocabulary.
- build_col_freq(fnames=None, row_vocab=None, col_vocab=None, multicore=True, prog_bar=True)¶
- build_frequency_list(fnames=None, multicore=True, prog_bar=True)¶
Alias method of build_vocab().
- build_token_vectors(tcWeightMTX=None, soccMTX=None, operation='addition')¶
Build token vectors.
- Parameters
soccMTX (
TypeTokenMatrix
) – Second order collocate matrix.tcWeightMTX (
TypeTokenMatrix
) – Token-Context weight matrix.operation (str) – ‘addition’, ‘multiplication’
- build_token_weights(tcPositionMTX=None, twMTX=None)¶
Build token-context weight matrix.
- Parameters
tcPositionMTX (token context position matrix) –
- target words
- tokens | |
- | —————
twMTX (type weight matrix, ex. 'pmi' (transposed)) –
- target words
context | … |
- features | … x … |
- (types) | … |
- Returns
- Return type
TypeTokenMatrix
- build_vocab(fnames=None, multicore=True, prog_bar=True)¶
A caller method of TypeToken model. It calls the same method ItemFreqHandler.build_vocab.
- Parameters
fnames (str) – Path of file recording corpus file names (‘fnames’ file of a corpus). If this is provided, only the files recorded in this fnames file would be processed. Else, all files and folders inside the ‘corpus-path’ of settings would be processed.
multicore (bool) – Use multicore processing or not.
prog_bar (bool) – Show progress bar or not.
- Returns
- Return type
Vocab
See also
build_vocab
build_vocab
- compute_association(freqmx=None, meas='ppmi')¶
Compute association measures matrix.
- Parameters
freqmx –
meas (str) – Could be: ‘pmi’, ‘ppmi’, ‘lik’, ‘chisq’, ‘zscore’, ‘dice’, ‘deltap’, ‘logratio’.
- compute_distance(measmx=None, metric='cosine')¶
Compute distance matrix.
- Parameters
measmx (
TypeTokenMatrix
) –metric (str) – Could be: ‘cos’, ‘rank’
- compute_similarity(measmx=None, metric='cosine', rank=False, axis=0)¶
Compute similarity matrix.
- Parameters
measmx (
TypeTokenMatrix
) –metric (str) – Could be: ‘cos’, ‘rank’
- compute_simrank(simmx=None, distance=False, reverse=False)¶
Compute similarity rank matrix.
- Parameters
simmx –
distance –
reverse –
- fetch_tokens(queries, fnames=None, multicore=True, prog_bar=True)¶
- get_settings()¶
- get_vocab()¶
- make_token_colloc(type_nodes=None, colloc_vocab=None, prog_bar=True)¶
- read_queries(queries)¶
- retrieve_tokens(fnames=None, queries=None, row_vocab=None, col_vocab=None, multicore=True, prog_bar=True)¶
- sample_tokens(n=300, method='random')¶
- select_types(type_list)¶
- Parameters
type_list (list of str) –