nephosem.core package

Submodules

nephosem.core.graph module

class nephosem.core.graph.DiGraph

Bases: object

graph
Type

DiGraph

add_edge(e_id, from_v, to_v, **kwargs)

Add an edge to graph.

add_node(v_id, **kwargs)

Add a node with id and label (optional) to graph.

property edges
in_degree(v)
property istree
property nodes
out_degree(v)
predecessors(v)
successors(v)
class nephosem.core.graph.Graph(sentence=None, id2node=None)

Bases: object

add_edge(e_from_node, e_to_node, e_label)

Add an edge to graph.

add_node(v_id, v_label=None)

Add a node with id and label (optional) to graph.

build_graph(sentence=None, id2node=None)

Build a graph

Parameters
  • sentence (iterable) – A list of dependency relations.

  • id2node (dict) – Node id to node string mapping.

build_graph_raw(sentence)

Build a graph from raw text (of a sentence)

Parameters

sentence (iterable) – A list of strings

property edges
match(path)

Match a graph with path.

Parameters

path (PathTemplate) –

Returns

valid matches

Return type

iterable

property nodes
class nephosem.core.graph.MacroGraph(pattern, target_idx=- 1, feature_idx=- 1, target_filter={}, feature_filter={})

Bases: nephosem.core.graph.PatternGraph

Class representing a feature graph inherited from the class PatternGraph. So it will have the same structure of the template from which it is generated. The generating process of a feature object would be: * 1. replicate a (tree) structure of the template * 2. set target node index * 3. set feature properties for each node (except for the target) and each edge.

The feature properties (i.e. True or False) would be stored in attributes of nodes and edges

add_match(matched_nodes, matched_edges)

Add matched nodes and edges

Parameters
  • matched_nodes (dict) – mapping from node index to item string

  • matched_edges (dict) – mapping from edge index to relation string

connector = '/'

Class representing a dependency template tree/graph

feature(index=0)

Transform a matched node and edge to a feature string

feature_full(index=0)
feature_old(index=0)

Deprecated

feature_simple(index=0)

The result feature string is a normal type. We only have one feature node (without edge) in the result feature representation.

get_edge_repr(eid, edge_attrs)

Get representation of an (matched) edge.

get_node_repr(nid, node_attrs)

Get representation of a (matched) node.

classmethod parse_macro_xml(macroxml, id2patt)

Example XML: ```xml <target-feature-macro id=”1”>

<sub-graph-pattern id=”1”/> <target nodeID=”2”>

<description>Empty</description>

</target> <feature nodeID=”1”>

<description>Words that depend directly on the target.</description>

</feature>

</target-feature-macro> ```

Parameters
  • macroxml (Element) – XML element of a macro

  • id2patt (dict) – The dict mapping index to pattern

Returns

Return type

class: ~nephosem.core.graph.MacroGraph

preorder_recur(node_dict, edge_dict, curr=0, reprs=None)
classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)

Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.

Parameters
  • fname (str) – Filename of feature patterns in Kris’ notation.

  • target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.

  • feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.

  • sep (str, default ' ') – Delimiter to use.

classmethod read_xml(fname, patterns)

Example XML: ```xml <?xml version=”1.0” encoding=”UTF-8”?> <target-feature-list>

<target-fmt>
<node-fmt>

<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>

</node-fmt>

</target-fmt> <feature-fmt>

<node-fmt>

<LEMMA group=”1”/> <POS group=”1”/> <string connector=”/”>LEMMA/POS</string>

</node-fmt> <edge-fmt>DEPREL</edge-fmt>

</feature-fmt> <target-feature-macro id=”1”>

</target-feature-macro> …

</target-feature-list> ```

Parameters
Returns

Return type

list, of MacroGraph

set_feature(feature_filter={})
set_target(target)
show(v_label='label', e_label='rel', figsize=(5.0, 5.0))
show_match(index=1, v_label='label', e_label='rel', figsize=(5.0, 5.0))
property size
target(index=0, mode='type')

Return the target of the index match.

class nephosem.core.graph.PatternGraph(nodes=None, edges=None, graph=None)

Bases: nephosem.core.graph.DiGraph

connector = '/'

Class representing a dependency template tree/graph

static islinear(template)
match_edge(edge, idx=0)

Match the passing sentence edge with the corresponding edge in the pattern graph.

Parameters
  • edge (dict) – A dict of edge attributes of the sentence graph. For most cases, there is only one attribute DEPREL.

  • idx (int or tuple of int) – An integer index of the corresponding edge of the pattern graph.

match_node(node, idx=0)

Match the passing sentence node with the corresponding node in the pattern graph.

Parameters
  • node (dict) – A dict of node attributes of the sentence graph. e.g. {‘FORM’: ‘The’, ‘POS’: ‘DT’, ‘LEMMA’: ‘the’}

  • idx (int) – An integer index of the corresponding node of the pattern graph.

classmethod read_csv(fname, target_colname='Target Regex', feature_colname='Feature Regex', sep='\t', header=0, **kwargs)

Read feature patterns from a CSV/TSV file. This method uses pandas.read_csv(). If there is any reading error, please refer to the documentation of pandas.

Parameters
  • fname (str) – Filename of feature patterns in Kris’ notation.

  • target_colname (str, default 'Target Regex') – Column name of the target regular expression in the CSV/TSV file.

  • feature_colname (str, default 'Feature Regex') – Column name of the feature regular expression in the CSV/TSV file.

  • sep (str, default ' ') – Delimiter to use.

classmethod read_graphml(fname)
show(v_label='label', e_label='rel', figsize=(5.0, 5.0))
class nephosem.core.graph.SentenceGraph(nodes=None, edges=None, sentence=None, formatter=None, mode='type', fname='', settings=None)

Bases: nephosem.core.graph.DiGraph

build_graph(sentence)

Build a graph from raw text (of a sentence)

Parameters

sentence (iterable) – A list of strings

generate_graph(nodes, edges)
match_pattern(macro)

Match a sentence with a feature pattern (and target pair). Append the results to the attribute lists matched_nodes and matched_edges of the feature object. A matched node is a dict mapping node index to type string. A matched edge is a dict mapping edge index to dependency relation. e.g. one match would be:

  • nodes: {1: ‘boy/NN’, 2: ‘give/V’, 3: ‘girl/NN’}

  • edges: {(2, 1): ‘nsubj’, (2, 3): ‘iobj’}

Parameters

macro (MacroGraph) – The feature pattern to be matched to the sentence.

match_target_feature(feature)

Match a graph with a path (a tree/graph object).

Parameters

feature (MacroGraph) –

Returns

valid matches

Return type

iterable

show(v_label='label', e_label='rel', figsize=(5.0, 5.0))
nephosem.core.graph.match_sub_template(sentence=None, feature=None, valid_nodes=None, valid_edges=None)

nephosem.core.handler module

Handler Class

Example

>>>
class nephosem.core.handler.BaseHandler(settings, workers=0, **kwargs)

Bases: nephosem.core.handler.Paralleler

This is a base class of all handler classes.

Contains framework for multicore method. The purpose of this class is to provide a reference interface for concrete handler implementations. At the same time, functionality that we expect to be common for those implementations is provided here to avoid code duplication.

A typical procedure of processing a corpus would be

tmpindicator

A string indicator for the temporary folder name used by all methods of this Class.

Type

str

settings
Type

dict

corpus_path
Type

str

output_path
Type

str

encoding

Default ‘utf-8’

Type

str

input_encoding

File encoding of input corpus files. Default ‘utf-8’

Type

str

output_encoding

File encoding of output files. Could be different with input_encoding. e.g. : If input_encoding is ‘latin-1’, but we don’t want to use it for output files. We could use ‘utf-8’ for output files. Default ‘utf-8’.

Type

str

Notes

A subclass should initialize the following attributes:

  • self.settings - settings dict

prepare_fnames(fnames=None)

Prepare corpus file names based on the fnames file path or the corpus path. If a valid fnames is passed, read all file names recorded inside this file. If not, use the corpus directory self.corpus_path and read all file names inside this folder.

Parameters

fnames (str or list, optional) – If str, then it is the filename of a file which records all (a user wants to process) file names of a corpus. If list, then it contains a list of filenames.

process(fnames, **kwargs)

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:
  • ItemFreqHandler: returns a Vocab object,

  • ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,

  • TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

process_right_window(*args, **kwargs)
tmpindicator = ''
update_one_file(filename, data, **kwargs)

This is the template method updating data of one corpus file.

update_one_match(*args, **kwargs)
class nephosem.core.handler.Paralleler(workers=0, **kwargs)

Bases: object

This is a base class of all classes that work parallel. This class contains a framework for parallel tasks. Here is the description of the procedures:

  • The main entrance is the method process(). It creates a job queue (for tasks sent to all sub-processes)

and a result queue (for results returned by all sub-processes). It also creates a number of worker processes, each of whom will execute the method _worker_loop(). After creating worker processes, it produces the job queue by the input fnames (data). The _worker_loop() will process the tasks in the job queue and return the results in the result queue. Finally it uses _process_results() method to post process (merge) the results of each worker process into one final Python object. * The default _job_producer() method produces the job queue by feeding with the input fnames. One fname will be fetched and processed by a worker process (_worker_loop()). A sub-class inheriting Paralleler could over-write this method and feed other data into the job queue. Just ensure that one piece of data could be used by the _do_process_job() method inside the _worker_loop(). * The _worker_loop() method will fetch one piece of data (default, one fname) and send it to the method _do_process_job() and send the corresponding result to the result queue. Any sub-class should implement this method for different tasks. * The _do_process_job gets one piece of data, send it to a function for processing, and returns the result back to the _worker_loop() method. Any sub-class should implement this method.

workers

Number of CPU cores to be used parallel.

Type

integer

tmpdir

The path of the temporary directory used by the Class

Type

str

Notes

property pid
process(fnames, **kwargs)

Process files in the fnames list.

Parameters

fnames (iterable) – This fnames is a list of file names.

Returns

For different tasks, this method returns different objects:
  • ItemFreqHandler: returns a Vocab object,

  • ColFreqHandler: returns a TypeTokenMatrix co-occurrence frequency matrix,

  • TokenHandler: returns a Python dict mapping the type strings to their lists of tokens (TokenNode objects).

Return type

object

property subtmpdir

When the class is instanced in a subprocess, a temporary folder for this subprocess will be created for its temporary files.

nephosem.core.matrix module

Matrix Classes

Usage examples

Construct a TypeTokenMatrix with a Python dict e.g.

>>> from nephosem.tests.utils import common_texts
>>> from nephosem import TypeTokenMatrix
>>>
>>>
class nephosem.core.matrix.TypeTokenMatrix(matrix, row_items, col_items, deep=True, **kwargs)

Bases: nephosem.core.matrix.BaseMatrix

Examples

Construction of a toy TypeTokenMatrix object:

>>> row_items = ['row0', 'row1', 'row2']
>>> col_items = ['col0', 'col1', 'col2', 'col3']
>>> sparr = np.array([[-5, 0, -3, -2], [-1, 0, 0, 1], [2, 0, 4, 5]])
>>> spMTX = TypeTokenMatrix(spmx, row_items, col_items)
>>> print(spMTX)
>>> dsmx = np.array([[-5, -4, -3, -2], [-1, 0, 0, 1], [2, 3, 4, 5]])
>>> nmMTX = TypeTokenMatrix(dsmx, row_items, col_items)
>>> print(nmMTX)
>>> sqarr = np.array([1, 2, 3, 4, 5, 6])
>>> from scipy.spatial.distance import squareform
>>> sqmx = squareform(sqarr)
>>> sqMTX = TypeTokenMatrix(sqmx, col_items, col_items)
>>> print(sqMTX)
property colid2item
concatenate(targetmx, axis=0)

Concatenate target matrix with self.

Parameters
  • targetmx (TypeTokenMatrix) – Target matrix

  • axis (int) – Axis of concatenation. If axis = 0, concatenate the targetmx as the new rows of self matrix. If axis = 1, concatenate the targetmx as the new columns of self matrix.

copy()
count_nonzero(axis=0)

Count the number of nonzero values for each row or each column.

property dataframe
deepcopy()
describe()

Generates descriptive information of the matrix TODO: improve

drop(axis=0, n_nonzero=0, **kwargs)

Drop rows which has fewer or equal nonzero values than n_nonzero.

Parameters
  • axis (int) – If axis is 0, drop rows that satisfy the given criteria. If axis is 1, drop columns that satisfy the given criteria.

  • n_nonzero (int) – The number of nonzero values in each row. If n_nonzero is 0, drop all empty rows. If n_nonzero is 1, drop all rows that only have 1 nonzero value or less. …

Returns

Dropped matrix

Return type

TypeTokenMatrix

drop_empty(axis=0, explicit=False)

Drop empty rows and return a dropped matrix.

Parameters
  • axis (int) – 0 or 1

  • explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.

drop_empty_rows(explicit=False)

Drop empty rows and return a dropped matrix.

Parameters

explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that only have no stored values are empty.

drop_zero_rows()

Drop rows with only zero values and return the dropped matrix

empty_rows(explicit=False)

Show types/tokens which have empty rows If this matrix is a token-context weight matrix, the row of a token would have all zero ppmi values.

Parameters

explicit (bool) – If True, rows that only have explicit zeros are also empty. Else, rows that have no stored values are empty.

Returns

Return type

a list of row indices

equal(othermx)
classmethod from_dataframe(df, issparse=True)
get_colloc_contexts(item)

Get collocate context features.

Parameters

item

get_matrix()

Get matrix.

Returns

Return type

numpy.ndarray or scipy.sparse.csr_matrix

Raises

NotImplementedError

property item2colid

Return a dict mapping from (column) items to corresponding indices. If self._item2colid is generated, just return it, else, create the self._item2colid dict and return it.

Returns

self._item2colid

Return type

dict

property item2rowid

Return a dict mapping from (row) items to corresponding indices. If self._item2rowid is generated, just return it, else, create the self._item2rowid dict and return it.

Returns

self._item2rowid

Return type

dict

classmethod load(filename, encoding='utf-8', pack=True)
Parameters
  • filename (".../xx.wcmx.freq.pac") –

  • encoding (str) – Default ‘utf-8’.

  • pack (True or False) – Indicate the file is packaged or not

Returns

Return type

meta data and matrix

merge(targetmx)

Merge two TypeTokenMatrix objects.

Parameters

targetmx (TypeTokenMatrix) – Target matrix

property meta_data

Meta data of the matrix.

most_similar(item, k=10, descending=False)

Get most similar items of the target item.

Parameters
  • item (str) – Row item (word)

  • k (int) – Number of returned similar items.

  • descending (bool) – If descending is True, sort the elements according to descending order of the values. Else, sort the elements according to ascending order of the values. The values would be distance or similarity. So for similarity matrix, set descending to True, as we want elements with largest values (similarities). For distance matrix, set descending to False. For similarity rank matrix, same to similarity matrix.

Returns

Return type

a list of elements

multiply(other)
print_matrix(n_rows=7, n_cols=7)

Prints n_rows and n_cols of the matrix. If either is set to None, print the standard amount.

Parameters
  • n_rows

  • n_cols

classmethod read_csv(filename, sep='\t', index_col=None, header='infer', issparse=False, encoding='utf-8')

Read a comma(tab)-separated values (csv/tsv) file.

Parameters
  • filename (str) – Filename of the csv file.

  • sep (str, default ' ') – Field delimiter to use.

  • header (int or list of ints, default 'infer') – Row number(s) to use as the column names, and the start of the data.

  • index_col (int or sequence or False, default None) – Column to use as the row labels of the DataFrame.

  • issparse (bool) – True for sparse matrix (i.e. frequency matrix). False for dense matrix (i.e. distance matrix).

  • encoding (str, default 'utf-8') – Encoding to use for UTF when reading/writing (ex. ‘utf-8’).

reorder(item_list, axis=0)

Reorder the matrix based on a new item list.

Parameters
  • item_list (list of str) – A list of (string) items.

  • axis (int, optional) – 0 for row, 1 for column

property rowid2item
sample(percent=0.1, seed=- 1, replace=False)

Sample the matrix based on row

Parameters
  • percent (float) – percentage of row dimension

  • seed (int, default) – Random seed form sampling. When the seed is set to a non-default value (non-negative), the method uses this seed for numpy random sampling operation.

save(filename, encoding='utf-8', pack=True, verbose=True)
property shape
spmatrix_to_dict()

Only for sparse matrix

submatrix(row=None, col=None)

Select a submatrix. If self is a sparse matrix (i.e. word-context frequency matrix), you can either specify only row or only col or both. If self is a square matrix (i.e. word-word distance matrix), normally you should select a square submatrix, therefore specify both row and col with the same list.

Parameters
  • row (iterable (list of str)) – Only support a list of str

  • col (iterable (list of str)) – Only support a list of str

Returns

submatrix

Return type

TypeTokenMatrix

sum(axis=None)

Sum the matrix over the given axis. If the axis is None, sum over both rows and columns, returning a scalar.

axis = 1

———– ^ | 1, 0, 2 | [3] | 0, 3, 4 | [7] ———–

[1, 3, 6] > axis = 0

Collocate present Collocate absent Totals

Node present c_a_b c_a_nb R1 Node absent c_na_b c_na_nb R2 Totals C1 C2 N

Parameters

axis (int) – If axis == 1, sum over rows. If axis == 0, sum over columns. If axis is None, return the total sum value (scalar).

Returns

A python dict with row/column items as keys and sum of that row/column as values.

Return type

dict

to_csv(filename, sep='\t', index=True, header=True, encoding='utf-8', verbose=True)

Write DataFrame to a comma-separated values (csv) file.

Parameters
  • filename (str) –

  • sep (character, default ' ') – Field delimiter for the output file.

  • index (boolean, default True) – Write row names (index).

  • header (boolean, default True) – Write out the column names.

  • encoding (string, optional) – A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

  • verbose (boolean) –

todense()
transpose()

Computes the transpose of a matrix.

Returns

The transpose.

Return type

numpy.ndarray

nephosem.core.terms module

class nephosem.core.terms.CorpusFormatter(settings)

Bases: object

connector = '/'
get(match, column, fid=None, lid=None)

Get the content of the corresponding column from a corpus line.

get_colloc(match)

Get token string from match object.

get_token(match, fid, lid)

Get token string from match object.

get_type(match)

Get type string from match object.

left_bound_machine(line)
match_line(line, form=None)
right_bound_machine(line)
separator_line_machine(line)
single_bound_machine(line)
class nephosem.core.terms.Getter(settings)

Bases: object

static get_func(get_form_string)
Parameters

get_form_string (str) –

get_item(match, form)
get_token(match, fid, lid)
get_type(match)
init_machine(line)
left_bound_machine(line)
property lemma
property pos
right_bound_machine(line)
separator_line_machine(line)
single_bound_machine(line)
token_line_machine(line, fid, lid)
property word
word_line_machine(line)
class nephosem.core.terms.ItemNode(match=None, formatter=None, word=None, lemma=None, pos=None, type_fmt=None, colloc_fmt=None, **kwargs)

Bases: object

This class represents an item node parsed by line-machine regular expression. The parsed item node consists of ‘word’, ‘lemma’ and ‘pos’ (if a file line has them).

connector = '/'
to_colloc(colloc_fmt=None)

Get a collocate string based on the item node.

Parameters

colloc_fmt (str) – colloc format string, i.e. ‘lemma/pos’

to_type(type_fmt=None)

Get a type string based on the item node.

Parameters

type_fmt (str) – type format string, i.e. ‘lemma/pos’

class nephosem.core.terms.TokenNode(token_str=None, token_fmt=None, match=None, formatter=None, fid='unknown', lid='-1', word=None, pos=None, lemma=None, lcollocs=None, rcollocs=None, **kwargs)

Bases: nephosem.core.terms.ItemNode

A TokenNode normal has its left and right context/collocate ItemNodes.

connector

For connecting ‘word’, ‘lemma’, ‘pos’, ‘fid’ and ‘lid’. default ‘/’

Type

str

lcollocs
Type

a list of left collocates

rcollocs

one collocates is an ItemNode object

Type

a list of right collocates

connector = '/'
classmethod gen_token_from_json_data(json_data)
property json_data
property lspan
property rspan
property token
class nephosem.core.terms.TypeNode(match=None, formatter=None, type_fmt=None, type_str=None, word=None, lemma=None, pos=None, tokens=None, **kwargs)

Bases: nephosem.core.terms.ItemNode

This Class represents a type node which in the token level contains all its token appearances.

The followings are some important attributes. .. attribute:: lemma

type

str

pos
Type

str

tokens

an appearance includes the token and its collocate types

Type

a list of appearances / tokens

append_token(token)
property collocs
connector = '/'
property freq

frequency of the type

get_collocs()

Get the collocates of all tokens

classmethod load(filename, encoding='utf-8')
classmethod merge(tns)

Merge a list of TypeNode instances into one

Parameters

tns (a list of TypeNode instances) –

sample(n=300, method='random')

Select n tokens/appearances from all.

Parameters
  • n (int) – default is 300

  • method (str) – ‘random’, …

Returns

Return type

A new TypeNode object

save(filename, fmt='json', encoding='utf-8', verbose=True)
property type
class nephosem.core.terms.Window(lspan=10, rspan=10)

Bases: object

left_span

left span, window size of left collocates

Type

int

right_span

right span, window size of right collocates

Type

int

left

left window

Type

deque

right

right window

Type

deque

node

center node

static init_span(size)
update(cur)

current window: [l1, …] [node] [r1, …] update cur => [l2, …, node] [r1] [r2, …, cur]

Parameters

cur

nephosem.core.vocab module

Vocabulary Class

Usage examples

Initialize a vocabulary with a Python dict e.g.

>>> from nephosem.tests.utils import common_texts
>>> from nephosem import Vocab
>>>
>>>
class nephosem.core.vocab.Vocab(data=None, encoding='utf-8')

Bases: object

copy()

Just to have a better name for deepcopy().

property dataframe

Generate dataframe dynamically every time it is called. Sort items first by frequency (descending) and then by alphabetic ascending order.

deepcopy()
describe()

Give a description of Vocab.

equal(vocab2)

Check whether two vocabularies are equal.

get_dict()
get_item_list(sorting='alpha', descending=False)

Get a sorted list of items based on a sorting order. Calls utils.sort_dict().

Parameters
  • sorting (str) – ‘freq’ for frequency order, ‘alpha’ for alphabetic order.

  • descending (bool) – If True, sort dict by descending order of ‘sorting’. Else, sort dict by ascending order of ‘sorting’.

Returns

sorted list of items in the vocabulary

Return type

list

increment(key, inc=1)

Increment the value of a key by ‘inc’.

isEmpty()
items()

Same as Python dict.items()

keys()
classmethod load(filename, encoding='utf-8', fmt='json')

Load vocabulary (frequency list) from file. The default file format to load the vocabulary is ‘json’.

Parameters
  • filename (str) –

  • encoding (str) – ‘utf-8’, ‘latin-1’, …

  • fmt (str) – ‘json’, ‘plain’, ‘txt’ (same as plain)

Returns

Return type

class: ~nephosem.Vocab

make_type_file(type_list, out_fname, encoding='utf-8')

This method could be used in the token level workflow for generating a typeSelection file

Parameters
  • type_list (a list of types) –

  • out_fname (output file name) –

  • encoding

match(column_name='item', pattern='.')

Match items by a given regular expression pattern.

Parameters
  • pattern (str) – Regular expression pattern

  • column_name (str) – ‘item’ or ‘freq’, normally only use ‘item’

Returns

Return type

list

static regex_item(item, pattern)

Match an item by a given regular expression pattern.

save(filename, encoding=None, fmt='json', verbose=True)

Save vocabulary to file.

Parameters
  • filename (str) –

  • encoding (str) – Encoding format: ‘utf-8’, ‘latin-1’ … If not provided, use encoding of Vocab.

  • fmt (str) –

    File format: ‘json’, ‘plain’. The default file format is ‘json’. The ‘plain’ format would save frequency dict in the following format:

    type-string[TAB]frequency

    One type per line.

  • verbose (bool) – Show information or not.

select_items(word)

This method takes a word (or lemma) as input and returns a Vocab object. Whether the provided word matches the items in vocab or not, should depend on the type-format of the items. If item is ‘lemma/pos’, then ‘lemma’ string should be provided. If item is ‘word/pos’, then ‘word’ string should be provided.

select_subsets(specif_words, n=300, method='random', indent='')

select subsets / n appearances Here, we select, for each word, which n items(appearances) will be retrieved from the corpus.

Parameters
  • specif_words – a list of (specified) words

  • n – number of selected appearance if n > the frequency of an item, select all appearances else, randomly (default) select n appearances

  • method – selecting methods: ‘random’, …

  • indent – indentation

setFILTER(value)
subvocab(items)

Select a sub vocab by a list of items. If an item is not in the vocab, its frequency is zero.

sum()

Get total sum of all frequencies. Just a slightly better method name.

sum_freq()

Get total sum of all frequencies.

values()

Module contents