nephosem.core package¶
Submodules¶
nephosem.core.graph module¶
- class nephosem.core.graph.DiGraph¶
Bases:
object
- graph¶
- Type
DiGraph
- add_edge(e_id, from_v, to_v, **kwargs)¶
Add an edge to graph.
- add_node(v_id, **kwargs)¶
Add a node with id and label (optional) to graph.
- property edges¶
- in_degree(v)¶
- property istree¶
- property nodes¶
- out_degree(v)¶
- predecessors(v)¶
- successors(v)¶
- class nephosem.core.graph.Graph(sentence=None, id2node=None)¶
Bases:
object
- add_edge(e_from_node, e_to_node, e_label)¶
Add an edge to graph.
- add_node(v_id, v_label=None)¶
Add a node with id and label (optional) to graph.
- build_graph(sentence=None, id2node=None)¶
Build a graph
- Parameters
sentence (iterable) – A list of dependency relations.
id2node (dict) – Node id to node string mapping.
- build_graph_raw(sentence)¶
Build a graph from raw text (of a sentence)
- Parameters
sentence (iterable) – A list of strings
- property edges¶
- match(path)¶
Match a graph with path.
- Parameters
path (
PathTemplate
) –- Returns
valid matches
- Return type
iterable
- property nodes¶
- class nephosem.core.graph.MacroGraph(pattern, target_idx=- 1, feature_idx=- 1, target_filter={}, feature_filter={})¶
Bases:
nephosem.core.graph.PatternGraph
Class representing a feature graph inherited from the class PatternGraph. So it will have the same structure of the template from which it is generated. The generating process of a feature object would be: * 1. replicate a (tree) structure of the template * 2. set target node index * 3. set feature properties for each node (except for the target) and each edge.
The feature properties (i.e. True or False) would be stored in attributes of nodes and edges
- add_match(matched_nodes, matched_edges)¶
Add matched nodes and edges
- Parameters
matched_nodes (dict) – mapping from node index to item string
matched_edges (dict) – mapping from edge index to relation string
- connector = '/'¶
Class representing a dependency template tree/graph
- feature(index=0)¶
Transform a matched node and edge to a feature string
- feature_full(index=0)¶
- feature_old(index=0)¶
Deprecated
- feature_simple(index=0)¶
The result feature string is a normal type. We only have one feature node (without edge) in the result feature representation.
- get_edge_repr(eid, edge_attrs)¶
Get representation of an (matched) edge.
- get_node_repr(nid, node_attrs)¶
Get representation of a (matched) node.
- classmethod parse_macro_xml(macroxml, id2patt)¶
Example XML: ```xml <target-feature-macro id=”1”>
<sub-graph-pattern id=”1”/> <target nodeID=”2”>
<description>Empty</description>
</target> <feature nodeID=”1”>
<description>Words that depend directly on the target.</description>
</feature>
- Parameters
macroxml (
Element
) – XML element of a macroid2patt (dict) – The dict mapping index to pattern
- Returns
- Return type
class: ~nephosem.core.graph.MacroGraph
- preorder_recur(node_dict, edge_dict, curr=0, reprs=None)¶
- classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)¶
Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.
- Parameters
fname (str) – Filename of feature patterns in Kris’ notation.
target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.
feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.
sep (str, default ' ') – Delimiter to use.
- classmethod read_xml(fname, patterns)¶
Example XML: ```xml <?xml version=”1.0” encoding=”UTF-8”?> <target-feature-list>
- <target-fmt>
- <node-fmt>
<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>
</node-fmt>
</target-fmt> <feature-fmt>
- <node-fmt>
<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>
</node-fmt> <edge-fmt>DEPREL</edge-fmt>
</feature-fmt> <target-feature-macro id=”1”>
…
</target-feature-macro> …
- Parameters
fname (str) –
patterns (list, of
PatternGraph
) –
- Returns
- Return type
list, of
MacroGraph
- set_feature(feature_filter={})¶
- set_target(target)¶
- show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶
- show_match(index=1, v_label='label', e_label='rel', figsize=(5.0, 5.0))¶
- property size¶
- target(index=0, mode='type')¶
Return the target of the index match.
- class nephosem.core.graph.PatternGraph(nodes=None, edges=None, graph=None)¶
Bases:
nephosem.core.graph.DiGraph
- connector = '/'¶
Class representing a dependency template tree/graph
- static islinear(template)¶
- match_edge(edge, idx=0)¶
Match the passing sentence edge with the corresponding edge in the pattern graph.
- Parameters
edge (dict) – A dict of edge attributes of the sentence graph. For most cases, there is only one attribute DEPREL.
idx (int or tuple of int) – An integer index of the corresponding edge of the pattern graph.
- match_node(node, idx=0)¶
Match the passing sentence node with the corresponding node in the pattern graph.
- Parameters
node (dict) – A dict of node attributes of the sentence graph. e.g. {‘FORM’: ‘The’, ‘POS’: ‘DT’, ‘LEMMA’: ‘the’}
idx (int) – An integer index of the corresponding node of the pattern graph.
- classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)¶
Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.
- Parameters
fname (str) – Filename of feature patterns in Kris’ notation.
target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.
feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.
sep (str, default ' ') – Delimiter to use.
- classmethod read_graphml(fname)¶
- show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶
- class nephosem.core.graph.SentenceGraph(nodes=None, edges=None, sentence=None, formatter=None, mode='type', fname='', settings=None)¶
Bases:
nephosem.core.graph.DiGraph
- build_graph(sentence)¶
Build a graph from raw text (of a sentence)
- Parameters
sentence (iterable) – A list of strings
- generate_graph(nodes, edges)¶
- match_pattern(macro)¶
Match a sentence with a feature pattern (and target pair). Append the results to the attribute lists matched_nodes and matched_edges of the feature object. A matched node is a dict mapping node index to type string. A matched edge is a dict mapping edge index to dependency relation. e.g. one match would be:
nodes: {1: ‘boy/NN’, 2: ‘give/V’, 3: ‘girl/NN’}
edges: {(2, 1): ‘nsubj’, (2, 3): ‘iobj’}
- Parameters
macro (
MacroGraph
) – The feature pattern to be matched to the sentence.
- match_target_feature(feature)¶
Match a graph with a path (a tree/graph object).
- Parameters
feature (
MacroGraph
) –- Returns
valid matches
- Return type
iterable
- show(v_label='label', e_label='rel', figsize=(5.0, 5.0))¶
- nephosem.core.graph.match_sub_template(sentence=None, feature=None, valid_nodes=None, valid_edges=None)¶
nephosem.core.handler module¶
Handler Class
Example
>>>
- class nephosem.core.handler.BaseHandler(settings, workers=0, **kwargs)¶
Bases:
nephosem.core.handler.Paralleler
This is a base class of all handler classes.
Contains framework for multicore method. The purpose of this class is to provide a reference interface for concrete handler implementations. At the same time, functionality that we expect to be common for those implementations is provided here to avoid code duplication.
A typical procedure of processing a corpus would be
- tmpindicator¶
A string indicator for the temporary folder name used by all methods of this Class.
- Type
str
- settings¶
- Type
dict
- corpus_path¶
- Type
str
- output_path¶
- Type
str
- encoding¶
Default ‘utf-8’
- Type
str
- input_encoding¶
File encoding of input corpus files. Default ‘utf-8’
- Type
str
- output_encoding¶
File encoding of output files. Could be different with input_encoding. e.g. : If input_encoding is ‘latin-1’, but we don’t want to use it for output files. We could use ‘utf-8’ for output files. Default ‘utf-8’.
- Type
str
Notes
A subclass should initialize the following attributes:
self.settings - settings dict
- prepare_fnames(fnames=None)¶
Prepare corpus file names based on the fnames file path or the corpus path. If a valid fnames is passed, read all file names recorded inside this file. If not, use the corpus directory self.corpus_path and read all file names inside this folder.
- Parameters
fnames (str or list, optional) – If str, then it is the filename of a file which records all (a user wants to process) file names of a corpus. If list, then it contains a list of filenames.
- process(fnames, **kwargs)¶
Process files in the fnames list.
- Parameters
fnames (iterable) – This fnames is a list of file names.
- Returns
- For different tasks, this method returns different objects:
ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).
- Return type
object
- process_right_window(*args, **kwargs)¶
- tmpindicator = ''¶
- update_one_file(filename, data, **kwargs)¶
This is the template method updating data of one corpus file.
- update_one_match(*args, **kwargs)¶
- class nephosem.core.handler.Paralleler(workers=0, **kwargs)¶
Bases:
object
This is a base class of all classes that work parallel. This class contains a framework for parallel tasks. Here is the description of the procedures:
The main entrance is the method process(). It creates a job queue (for tasks sent to all sub-processes)
and a result queue (for results returned by all sub-processes). It also creates a number of worker processes, each of whom will execute the method _worker_loop(). After creating worker processes, it produces the job queue by the input fnames (data). The _worker_loop() will process the tasks in the job queue and return the results in the result queue. Finally it uses _process_results() method to post process (merge) the results of each worker process into one final Python object. * The default _job_producer() method produces the job queue by feeding with the input fnames. One fname will be fetched and processed by a worker process (_worker_loop()). A sub-class inheriting Paralleler could over-write this method and feed other data into the job queue. Just ensure that one piece of data could be used by the _do_process_job() method inside the _worker_loop(). * The _worker_loop() method will fetch one piece of data (default, one fname) and send it to the method _do_process_job() and send the corresponding result to the result queue. Any sub-class should implement this method for different tasks. * The _do_process_job gets one piece of data, send it to a function for processing, and returns the result back to the _worker_loop() method. Any sub-class should implement this method.
- workers¶
Number of CPU cores to be used parallel.
- Type
integer
- tmpdir¶
The path of the temporary directory used by the Class
- Type
str
Notes
- property pid¶
- process(fnames, **kwargs)¶
Process files in the fnames list.
- Parameters
fnames (iterable) – This fnames is a list of file names.
- Returns
- For different tasks, this method returns different objects:
ItemFreqHandler: returns a Vocab object,
ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,
TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).
- Return type
object
- property subtmpdir¶
When the class is instanced in a subprocess, a temporary folder for this subprocess will be created for its temporary files.
nephosem.core.matrix module¶
Matrix Classes
Usage examples¶
Construct a TypeTokenMatrix with a Python dict e.g.
>>> from nephosem.tests.utils import common_texts
>>> from nephosem import TypeTokenMatrix
>>>
>>>
- class nephosem.core.matrix.TypeTokenMatrix(matrix, row_items, col_items, deep=True, **kwargs)¶
Bases:
nephosem.core.matrix.BaseMatrix
Examples
Construction of a toy TypeTokenMatrix object:
>>> row_items = ['row0', 'row1', 'row2'] >>> col_items = ['col0', 'col1', 'col2', 'col3']
>>> sparr = np.array([[-5, 0, -3, -2], [-1, 0, 0, 1], [2, 0, 4, 5]]) >>> spMTX = TypeTokenMatrix(spmx, row_items, col_items) >>> print(spMTX)
>>> dsmx = np.array([[-5, -4, -3, -2], [-1, 0, 0, 1], [2, 3, 4, 5]]) >>> nmMTX = TypeTokenMatrix(dsmx, row_items, col_items) >>> print(nmMTX)
>>> sqarr = np.array([1, 2, 3, 4, 5, 6]) >>> from scipy.spatial.distance import squareform >>> sqmx = squareform(sqarr) >>> sqMTX = TypeTokenMatrix(sqmx, col_items, col_items) >>> print(sqMTX)
- property colid2item¶
- concatenate(targetmx, axis=0)¶
Concatenate target matrix with self.
- Parameters
targetmx (
TypeTokenMatrix
) – Target matrixaxis (int) – Axis of concatenation. If axis = 0, concatenate the targetmx as the new rows of self matrix. If axis = 1, concatenate the targetmx as the new columns of self matrix.
- copy()¶
- count_nonzero(axis=0)¶
Count the number of nonzero values for each row or each column.
- property dataframe¶
- deepcopy()¶
- describe()¶
Generates descriptive information of the matrix TODO: improve
- drop(axis=0, n_nonzero=0, **kwargs)¶
Drop rows which has fewer or equal nonzero values than n_nonzero.
- Parameters
axis (int) – If axis is 0, drop rows that satisfy the given criteria. If axis is 1, drop columns that satisfy the given criteria.
n_nonzero (int) – The number of nonzero values in each row. If n_nonzero is 0, drop all empty rows. If n_nonzero is 1, drop all rows that only have 1 nonzero value or less. …
- Returns
Dropped matrix
- Return type
TypeTokenMatrix
- drop_empty(axis=0, explicit=False)¶
Drop empty rows and return a dropped matrix.
- Parameters
axis (int) – 0 or 1
explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.
- drop_empty_rows(explicit=False)¶
Drop empty rows and return a dropped matrix.
- Parameters
explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.
- drop_zero_rows()¶
Drop rows with only zero values and return the dropped matrix
- empty_rows(explicit=False)¶
Show types/tokens which have empty rows If this matrix is a token-context weight matrix, the row of a token would have all zero ppmi values.
- Parameters
explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that have no stored values are empty.
- Returns
- Return type
a list of row indices
- equal(othermx)¶
- classmethod from_dataframe(df, issparse=True)¶
- get_colloc_contexts(item)¶
Get collocate context features.
- Parameters
item –
- get_matrix()¶
Get matrix.
- Returns
- Return type
numpy.ndarray or scipy.sparse.csr_matrix
- Raises
NotImplementedError –
- property item2colid¶
Return a dict mapping from (column) items to corresponding indices. If self._item2colid is generated, just return it, else, create the self._item2colid dict and return it.
- Returns
self._item2colid
- Return type
dict
- property item2rowid¶
Return a dict mapping from (row) items to corresponding indices. If self._item2rowid is generated, just return it, else, create the self._item2rowid dict and return it.
- Returns
self._item2rowid
- Return type
dict
- classmethod load(filename, encoding='utf-8', pack=True)¶
- Parameters
filename (".../xx.wcmx.freq.pac") –
encoding (str) – Default ‘utf-8’.
pack (True or False) – Indicate the file is packaged or not
- Returns
- Return type
meta data and matrix
- merge(targetmx)¶
Merge two TypeTokenMatrix objects.
- Parameters
targetmx (
TypeTokenMatrix
) – Target matrix
- property meta_data¶
Meta data of the matrix.
- most_similar(item, k=10, descending=False)¶
Get most similar items of the target item.
- Parameters
item (str) – Row item (word)
k (int) – Number of returned similar items.
descending (bool) – If descending is True, sort the elements according to descending order of the values. Else, sort the elements according to ascending order of the values. The values would be distance or similarity. So for similarity matrix, set descending to True, as we want elements with largest values (similarities). For distance matrix, set descending to False. For similarity rank matrix, same to similarity matrix.
- Returns
- Return type
a list of elements
- multiply(other)¶
- print_matrix(n_rows=7, n_cols=7)¶
Prints n_rows and n_cols of the matrix. If either is set to None, print the standard amount.
- Parameters
n_rows –
n_cols –
- classmethod read_csv(filename, sep='\t', index_col=None, header='infer', issparse=False, encoding='utf-8')¶
Read a comma(tab)-separated values (csv/tsv) file.
- Parameters
filename (str) – Filename of the csv file.
sep (str, default ' ') – Field delimiter to use.
header (int or list of ints, default 'infer') – Row number(s) to use as the column names, and the start of the data.
index_col (int or sequence or False, default None) – Column to use as the row labels of the DataFrame.
issparse (bool) – True for sparse matrix (i.e. frequency matrix). False for dense matrix (i.e. distance matrix).
encoding (str, default 'utf-8') – Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
- reorder(item_list, axis=0)¶
Reorder the matrix based on a new item list.
- Parameters
item_list (list of str) – A list of (string) items.
axis (int, optional) – 0 for row, 1 for column
- property rowid2item¶
- sample(percent=0.1, seed=- 1, replace=False)¶
Sample the matrix based on row
- Parameters
percent (float) – percentage of row dimension
seed (int, default) – Random seed form sampling. When the seed is set to a non-default value (non-negative), the method uses this seed for numpy random sampling operation.
- save(filename, encoding='utf-8', pack=True, verbose=True)¶
- property shape¶
- spmatrix_to_dict()¶
Only for sparse matrix
- submatrix(row=None, col=None)¶
Select a submatrix. If self is a sparse matrix (i.e. word-context frequency matrix), you can either specify only row or only col or both. If self is a square matrix (i.e. word-word distance matrix), normally you should select a square submatrix, therefore specify both row and col with the same list.
- Parameters
row (iterable (list of str)) – Only support a list of str
col (iterable (list of str)) – Only support a list of str
- Returns
submatrix
- Return type
TypeTokenMatrix
- sum(axis=None)¶
Sum the matrix over the given axis. If the axis is None, sum over both rows and columns, returning a scalar.
axis = 1
———– ^ | 1, 0, 2 | [3] | 0, 3, 4 | [7] ———–
[1, 3, 6] > axis = 0
Collocate present Collocate absent Totals
Node present c_a_b c_a_nb R1 Node absent c_na_b c_na_nb R2 Totals C1 C2 N
- Parameters
axis (int) – If axis == 1, sum over rows. If axis == 0, sum over columns. If axis is None, return the total sum value (scalar).
- Returns
A python dict with row/column items as keys and sum of that row/column as values.
- Return type
dict
- to_csv(filename, sep='\t', index=True, header=True, encoding='utf-8', verbose=True)¶
Write DataFrame to a comma-separated values (csv) file.
- Parameters
filename (str) –
sep (character, default ' ') – Field delimiter for the output file.
index (boolean, default True) – Write row names (index).
header (boolean, default True) – Write out the column names.
encoding (string, optional) – A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
verbose (boolean) –
- todense()¶
- transpose()¶
Computes the transpose of a matrix.
- Returns
The transpose.
- Return type
numpy.ndarray
nephosem.core.terms module¶
- class nephosem.core.terms.CorpusFormatter(settings)¶
Bases:
object
- connector = '/'¶
- get(match, column, fid=None, lid=None)¶
Get the content of the corresponding column from a corpus line.
- get_colloc(match)¶
Get token string from match object.
- get_token(match, fid, lid)¶
Get token string from match object.
- get_type(match)¶
Get type string from match object.
- left_bound_machine(line)¶
- match_line(line, form=None)¶
- right_bound_machine(line)¶
- separator_line_machine(line)¶
- single_bound_machine(line)¶
- class nephosem.core.terms.Getter(settings)¶
Bases:
object
- static get_func(get_form_string)¶
- Parameters
get_form_string (str) –
- get_item(match, form)¶
- get_token(match, fid, lid)¶
- get_type(match)¶
- init_machine(line)¶
- left_bound_machine(line)¶
- property lemma¶
- property pos¶
- right_bound_machine(line)¶
- separator_line_machine(line)¶
- single_bound_machine(line)¶
- token_line_machine(line, fid, lid)¶
- property word¶
- word_line_machine(line)¶
- class nephosem.core.terms.ItemNode(match=None, formatter=None, word=None, lemma=None, pos=None, type_fmt=None, colloc_fmt=None, **kwargs)¶
Bases:
object
This class represents an item node parsed by line-machine regular expression. The parsed item node consists of ‘word’, ‘lemma’ and ‘pos’ (if a file line has them).
- connector = '/'¶
- to_colloc(colloc_fmt=None)¶
Get a collocate string based on the item node.
- Parameters
colloc_fmt (str) – colloc format string, i.e. ‘lemma/pos’
- to_type(type_fmt=None)¶
Get a type string based on the item node.
- Parameters
type_fmt (str) – type format string, i.e. ‘lemma/pos’
- class nephosem.core.terms.TokenNode(token_str=None, token_fmt=None, match=None, formatter=None, fid='unknown', lid='-1', word=None, pos=None, lemma=None, lcollocs=None, rcollocs=None, **kwargs)¶
Bases:
nephosem.core.terms.ItemNode
A TokenNode normal has its left and right context/collocate ItemNodes.
- connector¶
For connecting ‘word’, ‘lemma’, ‘pos’, ‘fid’ and ‘lid’. default ‘/’
- Type
str
- lcollocs¶
- Type
a list of left collocates
- rcollocs¶
one collocates is an ItemNode object
- Type
a list of right collocates
- connector = '/'¶
- classmethod gen_token_from_json_data(json_data)¶
- property json_data¶
- property lspan¶
- property rspan¶
- property token¶
- class nephosem.core.terms.TypeNode(match=None, formatter=None, type_fmt=None, type_str=None, word=None, lemma=None, pos=None, tokens=None, **kwargs)¶
Bases:
nephosem.core.terms.ItemNode
This Class represents a type node which in the token level contains all its token appearances.
The followings are some important attributes. .. attribute:: lemma
- type
str
- pos¶
- Type
str
- tokens¶
an appearance includes the token and its collocate types
- Type
a list of appearances / tokens
- append_token(token)¶
- property collocs¶
- connector = '/'¶
- property freq¶
frequency of the type
- get_collocs()¶
Get the collocates of all tokens
- classmethod load(filename, encoding='utf-8')¶
- classmethod merge(tns)¶
Merge a list of TypeNode instances into one
- Parameters
tns (a list of TypeNode instances) –
- sample(n=300, method='random')¶
Select n tokens/appearances from all.
- Parameters
n (int) – default is 300
method (str) – ‘random’, …
- Returns
- Return type
A new TypeNode object
- save(filename, fmt='json', encoding='utf-8', verbose=True)¶
- property type¶
- class nephosem.core.terms.Window(lspan=10, rspan=10)¶
Bases:
object
- left_span¶
left span, window size of left collocates
- Type
int
- right_span¶
right span, window size of right collocates
- Type
int
- left¶
left window
- Type
deque
- right¶
right window
- Type
deque
- node¶
center node
- static init_span(size)¶
- update(cur)¶
current window: [l1, …] [node] [r1, …] update cur => [l2, …, node] [r1] [r2, …, cur]
- Parameters
cur –
nephosem.core.vocab module¶
Vocabulary Class
Usage examples¶
Initialize a vocabulary with a Python dict e.g.
>>> from nephosem.tests.utils import common_texts
>>> from nephosem import Vocab
>>>
>>>
- class nephosem.core.vocab.Vocab(data=None, encoding='utf-8')¶
Bases:
object
- copy()¶
Just to have a better name for deepcopy().
- property dataframe¶
Generate dataframe dynamically every time it is called. Sort items first by frequency (descending) and then by alphabetic ascending order.
- deepcopy()¶
- describe()¶
Give a description of Vocab.
- equal(vocab2)¶
Check whether two vocabularies are equal.
- get_dict()¶
- get_item_list(sorting='alpha', descending=False)¶
Get a sorted list of items based on a sorting order. Calls utils.sort_dict().
- Parameters
sorting (str) – ‘freq’ for frequency order, ‘alpha’ for alphabetic order.
descending (bool) – If True, sort dict by descending order of ‘sorting’. Else, sort dict by ascending order of ‘sorting’.
- Returns
sorted list of items in the vocabulary
- Return type
list
- increment(key, inc=1)¶
Increment the value of a key by ‘inc’.
- isEmpty()¶
- items()¶
Same as Python dict.items()
- keys()¶
- classmethod load(filename, encoding='utf-8', fmt='json')¶
Load vocabulary (frequency list) from file. The default file format to load the vocabulary is ‘json’.
- Parameters
filename (str) –
encoding (str) – ‘utf-8’, ‘latin-1’, …
fmt (str) – ‘json’, ‘plain’, ‘txt’ (same as plain)
- Returns
- Return type
class: ~nephosem.Vocab
- make_type_file(type_list, out_fname, encoding='utf-8')¶
This method could be used in the token level workflow for generating a typeSelection file
- Parameters
type_list (a list of types) –
out_fname (output file name) –
encoding –
- match(column_name='item', pattern='.')¶
Match items by a given regular expression pattern.
- Parameters
pattern (str) – Regular expression pattern
column_name (str) – ‘item’ or ‘freq’, normally only use ‘item’
- Returns
- Return type
list
- static regex_item(item, pattern)¶
Match an item by a given regular expression pattern.
- save(filename, encoding=None, fmt='json', verbose=True)¶
Save vocabulary to file.
- Parameters
filename (str) –
encoding (str) – Encoding format: ‘utf-8’, ‘latin-1’ … If not provided, use encoding of Vocab.
fmt (str) –
File format: ‘json’, ‘plain’. The default file format is ‘json’. The ‘plain’ format would save frequency dict in the following format:
type-string[TAB]frequency
One type per line.
verbose (bool) – Show information or not.
- select_items(word)¶
This method takes a word (or lemma) as input and returns a Vocab object. Whether the provided word matches the items in vocab or not, should depend on the type-format of the items. If item is ‘lemma/pos’, then ‘lemma’ string should be provided. If item is ‘word/pos’, then ‘word’ string should be provided.
- select_subsets(specif_words, n=300, method='random', indent='')¶
select subsets / n appearances Here, we select, for each word, which n items(appearances) will be retrieved from the corpus.
- Parameters
specif_words – a list of (specified) words
n – number of selected appearance if n > the frequency of an item, select all appearances else, randomly (default) select n appearances
method – selecting methods: ‘random’, …
indent – indentation
- setFILTER(value)¶
- subvocab(items)¶
Select a sub vocab by a list of items. If an item is not in the vocab, its frequency is zero.
- sum()¶
Get total sum of all frequencies. Just a slightly better method name.
- sum_freq()¶
Get total sum of all frequencies.
- values()¶