nephosem.core package¶

Submodules¶

nephosem.core.graph module¶

class nephosem.core.graph.DiGraph¶

Bases: object

graph¶

Type: DiGraph

add_edge(e_id, from_v, to_v, **kwargs)¶: Add an edge to graph.

add_node(v_id, **kwargs)¶: Add a node with id and label (optional) to graph.

property edges¶

in_degree(v)¶

property istree¶

property nodes¶

out_degree(v)¶

predecessors(v)¶

successors(v)¶

class nephosem.core.graph.Graph(sentence=None, id2node=None)¶

Bases: object

add_edge(e_from_node, e_to_node, e_label)¶: Add an edge to graph.

add_node(v_id, v_label=None)¶: Add a node with id and label (optional) to graph.

build_graph(sentence=None, id2node=None)¶

Build a graph

Parameters

sentence (iterable) – A list of dependency relations.
id2node (dict) – Node id to node string mapping.

build_graph_raw(sentence)¶

Build a graph from raw text (of a sentence)

Parameters: sentence (iterable) – A list of strings

property edges¶

match(path)¶

Match a graph with path.

Parameters: path (PathTemplate) –
Returns: valid matches
Return type: iterable

property nodes¶

class nephosem.core.graph.MacroGraph(pattern, target_idx=- 1, feature_idx=- 1, target_filter={}, feature_filter={})¶

Bases: nephosem.core.graph.PatternGraph

Class representing a feature graph inherited from the class PatternGraph. So it will have the same structure of the template from which it is generated. The generating process of a feature object would be: * 1. replicate a (tree) structure of the template * 2. set target node index * 3. set feature properties for each node (except for the target) and each edge.

The feature properties (i.e. True or False) would be stored in attributes of nodes and edges

add_match(matched_nodes, matched_edges)¶

Add matched nodes and edges

Parameters

matched_nodes (dict) – mapping from node index to item string
matched_edges (dict) – mapping from edge index to relation string

connector = '/'¶: Class representing a dependency template tree/graph

feature(index=0)¶: Transform a matched node and edge to a feature string

feature_full(index=0)¶

feature_old(index=0)¶: Deprecated

feature_simple(index=0)¶: The result feature string is a normal type. We only have one feature node (without edge) in the result feature representation.

get_edge_repr(eid, edge_attrs)¶: Get representation of an (matched) edge.

get_node_repr(nid, node_attrs)¶: Get representation of a (matched) node.

classmethod parse_macro_xml(macroxml, id2patt)¶

Example XML: ```xml <target-feature-macro id=”1”>

<sub-graph-pattern id=”1”/> <target nodeID=”2”>

<description>Empty</description>

</target> <feature nodeID=”1”>

<description>Words that depend directly on the target.</description>

</feature>

</target-feature-macro> ```

Parameters

macroxml (Element) – XML element of a macro
id2patt (dict) – The dict mapping index to pattern

Returns

Return type

class: ~nephosem.core.graph.MacroGraph

preorder_recur(node_dict, edge_dict, curr=0, reprs=None)¶

classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)¶

Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.

Parameters

fname (str) – Filename of feature patterns in Kris’ notation.
target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.
feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.
sep (str, default ' ') – Delimiter to use.

classmethod read_xml(fname, patterns)¶

Example XML: ```xml <?xml version=”1.0” encoding=”UTF-8”?> <target-feature-list>

<target-fmt>

<node-fmt>
<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>

</node-fmt>

</target-fmt> <feature-fmt>

<node-fmt>
<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>

</node-fmt> <edge-fmt>DEPREL</edge-fmt>

</feature-fmt> <target-feature-macro id=”1”>

…

</target-feature-macro> …

</target-feature-list> ```

Parameters

fname (str) –
patterns (list, of PatternGraph) –

Returns

Return type

list, of MacroGraph

set_feature(feature_filter={})¶

set_target(target)¶

show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶

show_match(index=1, v_label='label', e_label='rel', figsize=(5.0, 5.0))¶

property size¶

target(index=0, mode='type')¶: Return the target of the index match.

class nephosem.core.graph.PatternGraph(nodes=None, edges=None, graph=None)¶

Bases: nephosem.core.graph.DiGraph

connector = '/'¶: Class representing a dependency template tree/graph

static islinear(template)¶

match_edge(edge, idx=0)¶

Match the passing sentence edge with the corresponding edge in the pattern graph.

Parameters

edge (dict) – A dict of edge attributes of the sentence graph. For most cases, there is only one attribute DEPREL.
idx (int or tuple of int) – An integer index of the corresponding edge of the pattern graph.

match_node(node, idx=0)¶

Match the passing sentence node with the corresponding node in the pattern graph.

Parameters

node (dict) – A dict of node attributes of the sentence graph. e.g. {‘FORM’: ‘The’, ‘POS’: ‘DT’, ‘LEMMA’: ‘the’}
idx (int) – An integer index of the corresponding node of the pattern graph.

classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)¶

Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.

Parameters

fname (str) – Filename of feature patterns in Kris’ notation.
target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.
feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.
sep (str, default ' ') – Delimiter to use.

classmethod read_graphml(fname)¶

show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶

class nephosem.core.graph.SentenceGraph(nodes=None, edges=None, sentence=None, formatter=None, mode='type', fname='', settings=None)¶

Bases: nephosem.core.graph.DiGraph

build_graph(sentence)¶

Build a graph from raw text (of a sentence)

Parameters: sentence (iterable) – A list of strings

generate_graph(nodes, edges)¶

match_pattern(macro)¶

Match a sentence with a feature pattern (and target pair). Append the results to the attribute lists matched_nodes and matched_edges of the feature object. A matched node is a dict mapping node index to type string. A matched edge is a dict mapping edge index to dependency relation. e.g. one match would be:

nodes: {1: ‘boy/NN’, 2: ‘give/V’, 3: ‘girl/NN’}

edges: {(2, 1): ‘nsubj’, (2, 3): ‘iobj’}

Parameters: macro (MacroGraph) – The feature pattern to be matched to the sentence.

match_target_feature(feature)¶

Match a graph with a path (a tree/graph object).

Parameters: feature (MacroGraph) –
Returns: valid matches
Return type: iterable

show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶

nephosem.core.graph.match_sub_template(sentence=None, feature=None, valid_nodes=None, valid_edges=None)¶

nephosem.core.handler module¶

Handler Class

Example

>>>

class nephosem.core.handler.BaseHandler(settings, workers=0, **kwargs)¶

Bases: nephosem.core.handler.Paralleler

This is a base class of all handler classes.

Contains framework for multicore method. The purpose of this class is to provide a reference interface for concrete handler implementations. At the same time, functionality that we expect to be common for those implementations is provided here to avoid code duplication.

A typical procedure of processing a corpus would be

tmpindicator¶

A string indicator for the temporary folder name used by all methods of this Class.

Type: str

settings¶

Type: dict

corpus_path¶

Type: str

output_path¶

Type: str

encoding¶

Default ‘utf-8’

Type: str

input_encoding¶

File encoding of input corpus files. Default ‘utf-8’

Type: str

output_encoding¶

File encoding of output files. Could be different with input_encoding. e.g. : If input_encoding is ‘latin-1’, but we don’t want to use it for output files. We could use ‘utf-8’ for output files. Default ‘utf-8’.

Type: str

Notes

A subclass should initialize the following attributes:

self.settings - settings dict

prepare_fnames(fnames=None)¶

Prepare corpus file names based on the fnames file path or the corpus path. If a valid fnames is passed, read all file names recorded inside this file. If not, use the corpus directory self.corpus_path and read all file names inside this folder.

Parameters: fnames (str or list, optional) – If str, then it is the filename of a file which records all (a user wants to process) file names of a corpus. If list, then it contains a list of filenames.

process(fnames, **kwargs)¶

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:

ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(*args, **kwargs)¶

tmpindicator = ''¶

update_one_file(filename, data, **kwargs)¶: This is the template method updating data of one corpus file.

update_one_match(*args, **kwargs)¶

class nephosem.core.handler.Paralleler(workers=0, **kwargs)¶

Bases: object

This is a base class of all classes that work parallel. This class contains a framework for parallel tasks. Here is the description of the procedures:

The main entrance is the method process(). It creates a job queue (for tasks sent to all sub-processes)

and a result queue (for results returned by all sub-processes). It also creates a number of worker processes, each of whom will execute the method _worker_loop(). After creating worker processes, it produces the job queue by the input fnames (data). The _worker_loop() will process the tasks in the job queue and return the results in the result queue. Finally it uses _process_results() method to post process (merge) the results of each worker process into one final Python object. * The default _job_producer() method produces the job queue by feeding with the input fnames. One fname will be fetched and processed by a worker process (_worker_loop()). A sub-class inheriting Paralleler could over-write this method and feed other data into the job queue. Just ensure that one piece of data could be used by the _do_process_job() method inside the _worker_loop(). * The _worker_loop() method will fetch one piece of data (default, one fname) and send it to the method _do_process_job() and send the corresponding result to the result queue. Any sub-class should implement this method for different tasks. * The _do_process_job gets one piece of data, send it to a function for processing, and returns the result back to the _worker_loop() method. Any sub-class should implement this method.

workers¶

Number of CPU cores to be used parallel.

Type: integer

tmpdir¶

The path of the temporary directory used by the Class

Type: str

Notes

property pid¶

process(fnames, **kwargs)¶

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:

ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

property subtmpdir¶: When the class is instanced in a subprocess, a temporary folder for this subprocess will be created for its temporary files.

nephosem.core.matrix module¶

Matrix Classes

Usage examples¶

Construct a TypeTokenMatrix with a Python dict e.g.

>>> from nephosem.tests.utils import common_texts
>>> from nephosem import TypeTokenMatrix
>>>
>>>

class nephosem.core.matrix.TypeTokenMatrix(matrix, row_items, col_items, deep=True, **kwargs)¶

Bases: nephosem.core.matrix.BaseMatrix

Examples

Construction of a toy TypeTokenMatrix object:

>>> row_items = ['row0', 'row1', 'row2']
>>> col_items = ['col0', 'col1', 'col2', 'col3']

>>> sparr = np.array([[-5, 0, -3, -2], [-1, 0, 0, 1], [2, 0, 4, 5]])
>>> spMTX = TypeTokenMatrix(spmx, row_items, col_items)
>>> print(spMTX)

>>> dsmx = np.array([[-5, -4, -3, -2], [-1, 0, 0, 1], [2, 3, 4, 5]])
>>> nmMTX = TypeTokenMatrix(dsmx, row_items, col_items)
>>> print(nmMTX)

>>> sqarr = np.array([1, 2, 3, 4, 5, 6])
>>> from scipy.spatial.distance import squareform
>>> sqmx = squareform(sqarr)
>>> sqMTX = TypeTokenMatrix(sqmx, col_items, col_items)
>>> print(sqMTX)

property colid2item¶

concatenate(targetmx, axis=0)¶

Concatenate target matrix with self.

Parameters

targetmx (TypeTokenMatrix) – Target matrix
axis (int) – Axis of concatenation. If axis = 0, concatenate the targetmx as the new rows of self matrix. If axis = 1, concatenate the targetmx as the new columns of self matrix.

copy()¶

count_nonzero(axis=0)¶: Count the number of nonzero values for each row or each column.

property dataframe¶

deepcopy()¶

describe()¶: Generates descriptive information of the matrix TODO: improve

drop(axis=0, n_nonzero=0, **kwargs)¶

Drop rows which has fewer or equal nonzero values than n_nonzero.

Parameters

axis (int) – If axis is 0, drop rows that satisfy the given criteria. If axis is 1, drop columns that satisfy the given criteria.
n_nonzero (int) – The number of nonzero values in each row. If n_nonzero is 0, drop all empty rows. If n_nonzero is 1, drop all rows that only have 1 nonzero value or less. …

Returns

Dropped matrix

Return type

TypeTokenMatrix

drop_empty(axis=0, explicit=False)¶

Drop empty rows and return a dropped matrix.

Parameters

axis (int) – 0 or 1
explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.

drop_empty_rows(explicit=False)¶

Drop empty rows and return a dropped matrix.

Parameters: explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.

drop_zero_rows()¶: Drop rows with only zero values and return the dropped matrix

empty_rows(explicit=False)¶

Show types/tokens which have empty rows If this matrix is a token-context weight matrix, the row of a token would have all zero ppmi values.

Parameters: explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that have no stored values are empty.
Returns
Return type: a list of row indices

equal(othermx)¶

classmethod from_dataframe(df, issparse=True)¶

get_colloc_contexts(item)¶

Get collocate context features.

Parameters: item –

get_matrix()¶

Get matrix.

Returns
Return type: numpy.ndarray or scipy.sparse.csr_matrix
Raises: NotImplementedError –

property item2colid¶

Return a dict mapping from (column) items to corresponding indices. If self._item2colid is generated, just return it, else, create the self._item2colid dict and return it.

Returns: self._item2colid
Return type: dict

property item2rowid¶

Return a dict mapping from (row) items to corresponding indices. If self._item2rowid is generated, just return it, else, create the self._item2rowid dict and return it.

Returns: self._item2rowid
Return type: dict

classmethod load(filename, encoding='utf-8', pack=True)¶

Parameters

filename (".../xx.wcmx.freq.pac") –
encoding (str) – Default ‘utf-8’.
pack (True or False) – Indicate the file is packaged or not

Returns

Return type

meta data and matrix

merge(targetmx)¶

Merge two TypeTokenMatrix objects.

Parameters: targetmx (TypeTokenMatrix) – Target matrix

property meta_data¶: Meta data of the matrix.

most_similar(item, k=10, descending=False)¶

Get most similar items of the target item.

Parameters

item (str) – Row item (word)
k (int) – Number of returned similar items.
descending (bool) – If descending is True, sort the elements according to descending order of the values. Else, sort the elements according to ascending order of the values. The values would be distance or similarity. So for similarity matrix, set descending to True, as we want elements with largest values (similarities). For distance matrix, set descending to False. For similarity rank matrix, same to similarity matrix.

Returns

Return type

a list of elements

multiply(other)¶

print_matrix(n_rows=7, n_cols=7)¶

Prints n_rows and n_cols of the matrix. If either is set to None, print the standard amount.

Parameters

n_rows –
n_cols –

classmethod read_csv(filename, sep='\t', index_col=None, header='infer', issparse=False, encoding='utf-8')¶

Read a comma(tab)-separated values (csv/tsv) file.

Parameters

filename (str) – Filename of the csv file.
sep (str, default ' ') – Field delimiter to use.
header (int or list of ints, default 'infer') – Row number(s) to use as the column names, and the start of the data.
index_col (int or sequence or False, default None) – Column to use as the row labels of the DataFrame.
issparse (bool) – True for sparse matrix (i.e. frequency matrix). False for dense matrix (i.e. distance matrix).
encoding (str, default 'utf-8') – Encoding to use for UTF when reading/writing (ex. ‘utf-8’).

reorder(item_list, axis=0)¶

Reorder the matrix based on a new item list.

Parameters

item_list (list of str) – A list of (string) items.
axis (int, optional) – 0 for row, 1 for column

property rowid2item¶

sample(percent=0.1, seed=- 1, replace=False)¶

Sample the matrix based on row

Parameters

percent (float) – percentage of row dimension
seed (int, default) – Random seed form sampling. When the seed is set to a non-default value (non-negative), the method uses this seed for numpy random sampling operation.

save(filename, encoding='utf-8', pack=True, verbose=True)¶

property shape¶

spmatrix_to_dict()¶: Only for sparse matrix

submatrix(row=None, col=None)¶

Select a submatrix. If self is a sparse matrix (i.e. word-context frequency matrix), you can either specify only row or only col or both. If self is a square matrix (i.e. word-word distance matrix), normally you should select a square submatrix, therefore specify both row and col with the same list.

Parameters

row (iterable (list of str)) – Only support a list of str
col (iterable (list of str)) – Only support a list of str

Returns

submatrix

Return type

TypeTokenMatrix

sum(axis=None)¶

Sum the matrix over the given axis. If the axis is None, sum over both rows and columns, returning a scalar.

axis = 1

———– ^ | 1, 0, 2 | [3] | 0, 3, 4 | [7] ———–

[1, 3, 6] > axis = 0

Collocate present Collocate absent Totals

Node present c_a_b c_a_nb R1 Node absent c_na_b c_na_nb R2 Totals C1 C2 N

Parameters: axis (int) – If axis == 1, sum over rows. If axis == 0, sum over columns. If axis is None, return the total sum value (scalar).
Returns: A python dict with row/column items as keys and sum of that row/column as values.
Return type: dict

to_csv(filename, sep='\t', index=True, header=True, encoding='utf-8', verbose=True)¶

Write DataFrame to a comma-separated values (csv) file.

Parameters

filename (str) –
sep (character, default ' ') – Field delimiter for the output file.
index (boolean, default True) – Write row names (index).
header (boolean, default True) – Write out the column names.
encoding (string, optional) – A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
verbose (boolean) –

todense()¶

transpose()¶

Computes the transpose of a matrix.

Returns: The transpose.
Return type: numpy.ndarray

nephosem.core.terms module¶

class nephosem.core.terms.CorpusFormatter(settings)¶

Bases: object

connector = '/'¶

get(match, column, fid=None, lid=None)¶: Get the content of the corresponding column from a corpus line.

get_colloc(match)¶: Get token string from match object.

get_token(match, fid, lid)¶: Get token string from match object.

get_type(match)¶: Get type string from match object.

left_bound_machine(line)¶

match_line(line, form=None)¶

right_bound_machine(line)¶

separator_line_machine(line)¶

single_bound_machine(line)¶

class nephosem.core.terms.Getter(settings)¶

Bases: object

static get_func(get_form_string)¶

Parameters: get_form_string (str) –

get_item(match, form)¶

get_token(match, fid, lid)¶

get_type(match)¶

init_machine(line)¶

left_bound_machine(line)¶

property lemma¶

property pos¶

right_bound_machine(line)¶

separator_line_machine(line)¶

single_bound_machine(line)¶

token_line_machine(line, fid, lid)¶

property word¶

word_line_machine(line)¶

class nephosem.core.terms.ItemNode(match=None, formatter=None, word=None, lemma=None, pos=None, type_fmt=None, colloc_fmt=None, **kwargs)¶

Bases: object

This class represents an item node parsed by line-machine regular expression. The parsed item node consists of ‘word’, ‘lemma’ and ‘pos’ (if a file line has them).

connector = '/'¶

to_colloc(colloc_fmt=None)¶

Get a collocate string based on the item node.

Parameters: colloc_fmt (str) – colloc format string, i.e. ‘lemma/pos’

to_type(type_fmt=None)¶

Get a type string based on the item node.

Parameters: type_fmt (str) – type format string, i.e. ‘lemma/pos’

class nephosem.core.terms.TokenNode(token_str=None, token_fmt=None, match=None, formatter=None, fid='unknown', lid='-1', word=None, pos=None, lemma=None, lcollocs=None, rcollocs=None, **kwargs)¶

Bases: nephosem.core.terms.ItemNode

A TokenNode normal has its left and right context/collocate ItemNodes.

connector¶

For connecting ‘word’, ‘lemma’, ‘pos’, ‘fid’ and ‘lid’. default ‘/’

Type: str

lcollocs¶

Type: a list of left collocates

rcollocs¶

one collocates is an ItemNode object

Type: a list of right collocates

connector = '/'¶

classmethod gen_token_from_json_data(json_data)¶

property json_data¶

property lspan¶

property rspan¶

property token¶

class nephosem.core.terms.TypeNode(match=None, formatter=None, type_fmt=None, type_str=None, word=None, lemma=None, pos=None, tokens=None, **kwargs)¶

Bases: nephosem.core.terms.ItemNode

This Class represents a type node which in the token level contains all its token appearances.

The followings are some important attributes. .. attribute:: lemma

type

str

pos¶

Type: str

tokens¶

an appearance includes the token and its collocate types

Type: a list of appearances / tokens

append_token(token)¶

property collocs¶

connector = '/'¶

property freq¶: frequency of the type

get_collocs()¶: Get the collocates of all tokens

classmethod load(filename, encoding='utf-8')¶

classmethod merge(tns)¶

Merge a list of TypeNode instances into one

Parameters: tns (a list of TypeNode instances) –

sample(n=300, method='random')¶

Select n tokens/appearances from all.

Parameters

n (int) – default is 300
method (str) – ‘random’, …

Returns

Return type

A new TypeNode object

save(filename, fmt='json', encoding='utf-8', verbose=True)¶

property type¶

class nephosem.core.terms.Window(lspan=10, rspan=10)¶

Bases: object

left_span¶

left span, window size of left collocates

Type: int

right_span¶

right span, window size of right collocates

Type: int

left¶

left window

Type: deque

right¶

right window

Type: deque

node¶: center node

static init_span(size)¶

update(cur)¶

current window: [l1, …] [node] [r1, …] update cur => [l2, …, node] [r1] [r2, …, cur]

Parameters: cur –

nephosem.core.vocab module¶

Vocabulary Class

Usage examples¶

Initialize a vocabulary with a Python dict e.g.

>>> from nephosem.tests.utils import common_texts
>>> from nephosem import Vocab
>>>
>>>

class nephosem.core.vocab.Vocab(data=None, encoding='utf-8')¶

Bases: object

copy()¶: Just to have a better name for deepcopy().

property dataframe¶: Generate dataframe dynamically every time it is called. Sort items first by frequency (descending) and then by alphabetic ascending order.

deepcopy()¶

describe()¶: Give a description of Vocab.

equal(vocab2)¶: Check whether two vocabularies are equal.

get_dict()¶

get_item_list(sorting='alpha', descending=False)¶

Get a sorted list of items based on a sorting order. Calls utils.sort_dict().

Parameters

sorting (str) – ‘freq’ for frequency order, ‘alpha’ for alphabetic order.
descending (bool) – If True, sort dict by descending order of ‘sorting’. Else, sort dict by ascending order of ‘sorting’.

Returns

sorted list of items in the vocabulary

Return type

list

increment(key, inc=1)¶: Increment the value of a key by ‘inc’.

isEmpty()¶

items()¶: Same as Python dict.items()

keys()¶

classmethod load(filename, encoding='utf-8', fmt='json')¶

Load vocabulary (frequency list) from file. The default file format to load the vocabulary is ‘json’.

Parameters

filename (str) –
encoding (str) – ‘utf-8’, ‘latin-1’, …
fmt (str) – ‘json’, ‘plain’, ‘txt’ (same as plain)

Returns

Return type

class: ~nephosem.Vocab

make_type_file(type_list, out_fname, encoding='utf-8')¶

This method could be used in the token level workflow for generating a typeSelection file

Parameters

type_list (a list of types) –
out_fname (output file name) –
encoding –

match(column_name='item', pattern='.')¶

Match items by a given regular expression pattern.

Parameters

pattern (str) – Regular expression pattern
column_name (str) – ‘item’ or ‘freq’, normally only use ‘item’

Returns

Return type

list

static regex_item(item, pattern)¶: Match an item by a given regular expression pattern.

save(filename, encoding=None, fmt='json', verbose=True)¶

Save vocabulary to file.

Parameters

filename (str) –
encoding (str) – Encoding format: ‘utf-8’, ‘latin-1’ … If not provided, use encoding of Vocab.
fmt (str) –
File format: ‘json’, ‘plain’. The default file format is ‘json’. The ‘plain’ format would save frequency dict in the following format:

type-string[TAB]frequency

One type per line.
verbose (bool) – Show information or not.

select_items(word)¶: This method takes a word (or lemma) as input and returns a Vocab object. Whether the provided word matches the items in vocab or not, should depend on the type-format of the items. If item is ‘lemma/pos’, then ‘lemma’ string should be provided. If item is ‘word/pos’, then ‘word’ string should be provided.

select_subsets(specif_words, n=300, method='random', indent='')¶

select subsets / n appearances Here, we select, for each word, which n items(appearances) will be retrieved from the corpus.

Parameters

specif_words – a list of (specified) words
n – number of selected appearance if n > the frequency of an item, select all appearances else, randomly (default) select n appearances
method – selecting methods: ‘random’, …
indent – indentation

setFILTER(value)¶

subvocab(items)¶: Select a sub vocab by a list of items. If an item is not in the vocab, its frequency is zero.

sum()¶: Get total sum of all frequencies. Just a slightly better method name.

sum_freq()¶: Get total sum of all frequencies.

values()¶

nephosem.core package¶

Submodules¶

nephosem.core.graph module¶

nephosem.core.handler module¶

nephosem.core.matrix module¶

Usage examples¶

nephosem.core.terms module¶

nephosem.core.vocab module¶

Usage examples¶

Module contents¶