nephosem package

Submodules

nephosem.conf module

class nephosem.conf.ConfigLoader(filename=None)

Bases: object

Examples

>>> from nephosem.conf import ConfigLoader
>>> conf = ConfigLoader()  # will read default settings
>>> new_config_fname = "/path/of/new/config/file"
>>> settings = conf.update_config(new_config_fname)  # -> new settings based on your config file and default settings
classmethod load_config(config_file)

Read settings in a config file.

classmethod read_params(config, opt, sect, sett)
property settings
update_config(config_file)

Update settings based on a config file.

nephosem.logging module

class nephosem.logging.DefaultFormatter

Bases: logging.Formatter

This class inherits logging.Formatter and is used for setting the format of a handler.

dbg_fmt = 'DEBUG: %(module)s: %(lineno)d: %(msg)s'
err_fmt = 'ERROR: %(msg)s'
format(record)

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

info_fmt = '%(msg)s'
wrn_fmt = 'WARNING: %(msg)s'

nephosem.utils module

class nephosem.utils.SaveLoad

Bases: object

Serialize/deserialize object from disk, by equipping objects with the save()/load() methods.

Warning

This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes such as lambda functions etc.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

nephosem.utils.clean_dir(dirname)

Clean directory. If there are files in this directory, remove them all.

nephosem.utils.count_values(d)

Count the number of values in the nested dict

nephosem.utils.get_word_str(wquery, specific=False, corpus_name='', def_key=None)
Parameters
  • wquery – can be a string or a dictionary the result of getWordStr() always is a string

  • specific – refers to whether or not a corpus specific word representation is requested.

  • corpus_name – only used if ‘specific=True’

  • def_key – settings[‘wqueries-default-key’]

nephosem.utils.is_string(text)

Check if the passed text is an instance of str / unicode (Python2) or unicode / bytes (Python3)

nephosem.utils.load_dict_json(filename, encoding='utf-8')

Load dict from json file.

nephosem.utils.load_dict_plain(filename, encoding='utf-8')

Load dict from plain txt file One word per line ( word[TAB]freq )

nephosem.utils.make_dir(dirname)

Create directory if not exist.

nephosem.utils.pickle(obj, fname, protocol=2)

Pickle object obj to file fname, using smart_open so that fname can be on S3, HDFS, compressed etc.

Parameters
  • obj (object) – Any python object.

  • fname (str) – Path to pickle file.

  • protocol (int, optional) – Pickle protocol number. Default is 2 in order to support compatibility across python 2.x and 3.x.

nephosem.utils.read_fnames(filename, dirname='', encoding='utf-8')

This function reads all filenames in this provided file It assumes ‘filename’ is a string

Parameters
  • filename (filename that records corpus filenames) –

  • dirname (corpus directory) –

  • encoding

Returns

Return type

a list of filenames in this file

nephosem.utils.read_fnames_of_corpus(corpus_path)

Read all file names of all files in corpus_path folder.

Parameters

corpus_path (str) – The corpus path where all corpus files are located.

nephosem.utils.read_word_queries(fname, encoding='utf-8', wquery_default_key='_DEFAULT_')

Reads a list of ‘word queries’ from the file ‘fname’. These ‘word queries’ are used in other functions as search terms for retrieving tokens in the corpora. ———————————————————————- The function read_word_queries() assumes that ‘fname’ is a string that contains a file name.

The expected file format for that file is as follows:

  • lines that start with # are ignored

  • other lines, if they don’t contain tabs, are assumed to contain a single word representation

    e.g.:

    appel/noun

    such single word representations are represented as a string in the output of this function.

  • other lines, if they do contain tabs, are assumed to have the following format:

    corpusname:wordstr TAB corpusname:wordstr TAB etc.

    e.g.:

    LeNC:rustoord/noun TAB TwNC:rust_oord/noun

    This example indicates that what we consider to be one and the same word in our study, has the format “rustoord/noun” in the corpus LeNC and has the format “rust_oord/noun” in the corpus TwNC. Such information will be represented, in the output of readWQueriesList(), in a dictionary:

    e.g.:
    {“LeNC”: “rustoord/noun”,

    “TwNC”: “rust_oord/noun”, “_DEFAULT_”: “rustoord/noun”}

    In this example, it is assumed that the value of ‘settings[“wqueries-default-key”]’ is “_DEFAULT_”. There always is a key ‘settings[“wqueries-default-key”]’ in these output dictionaries. If the input line does not explicitly contain a corpus name equal to ‘settings[“wqueries-default-key”]’, then the first value/word in the input line is also used as the value/word for ‘settings[“wqueries-default-key”]’.

[format of result] - The result is a list of items, with each of these items either being

a string (which indicates that the same search term can be used in all corpora) or a dictionary that maps corpus names onto queries/words and that maps ‘settings[“wqueries-default-key”]’ onto the default query/word.

Parameters
  • fname

  • encoding (str) – default ‘utf-8’

Returns

Return type

list of dicts

nephosem.utils.save_concordance(fname, typenodes, colloc_fmt='lemma', encoding='utf-8')

Write out a concordance for types/concepts

Parameters
  • fname (str) – filename to save

  • colloc_fmt (str) – Options: ‘lemma’, ‘word’, ‘lemma/pos’, …

  • encoding (str) – default ‘utf-8’

nephosem.utils.save_dict_json(freq_dict, filename, encoding='utf-8')

Save frequency dict to json file.

nephosem.utils.save_dict_plain(freq_dict, filename, encoding='utf-8', order='freq')

Save frequency dict to plain txt file.

Parameters
  • freq_dict (dict) –

  • filename (str) –

  • encoding (str) – ‘utf-8’, ‘latin-1’, …

  • order (str) – ‘freq’, ‘alpha’

nephosem.utils.sizeof(filename, suffix='B')
nephosem.utils.sizeof_fmt(num, suffix='B')
nephosem.utils.sort_dict(freq_dict, sorting='freq', descending=True)

Sort a dict by order. Normally if the ‘sorting’ is ‘freq’, sort the dict first by frequency descending order, then by alphabetic ascending order. If the ‘sorting’ is ‘alpha’, sort the dict by alphabetic ascending order.

Parameters
  • freq_dict (dict) – Python dict of item to frequency pair.

  • sorting (str) – ‘freq’ for frequency order, ‘alpha’ for alphabetic order.

  • descending (bool) – If True, sort dict by descending order of ‘sorting’. Else, sort dict by ascending order of ‘sorting’.

Returns

sorted_keys, a list of sorted keys of the dict

Return type

list

nephosem.utils.timeit(fn)
nephosem.utils.unpickle(fname)

Load object from fname, using smart_open so that fname can be on S3, HDFS, compressed etc.

Parameters

fname (str) – Path to pickle file.

Returns

Python object loaded from fname.

Return type

object

Module contents