nephosem package¶
Subpackages¶
Submodules¶
nephosem.conf module¶
- class nephosem.conf.ConfigLoader(filename=None)¶
Bases:
object
Examples
>>> from nephosem.conf import ConfigLoader >>> conf = ConfigLoader() # will read default settings
>>> new_config_fname = "/path/of/new/config/file" >>> settings = conf.update_config(new_config_fname) # -> new settings based on your config file and default settings
- classmethod load_config(config_file)¶
Read settings in a config file.
- classmethod read_params(config, opt, sect, sett)¶
- property settings¶
- update_config(config_file)¶
Update settings based on a config file.
nephosem.logging module¶
- class nephosem.logging.DefaultFormatter¶
Bases:
logging.Formatter
This class inherits logging.Formatter and is used for setting the format of a handler.
- dbg_fmt = 'DEBUG: %(module)s: %(lineno)d: %(msg)s'¶
- err_fmt = 'ERROR: %(msg)s'¶
- format(record)¶
Format the specified record as text.
The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.
- info_fmt = '%(msg)s'¶
- wrn_fmt = 'WARNING: %(msg)s'¶
nephosem.utils module¶
- class nephosem.utils.SaveLoad¶
Bases:
object
Serialize/deserialize object from disk, by equipping objects with the save()/load() methods.
Warning
This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes such as lambda functions etc.
- classmethod load(fname, mmap=None)¶
Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶
Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
- nephosem.utils.clean_dir(dirname)¶
Clean directory. If there are files in this directory, remove them all.
- nephosem.utils.count_values(d)¶
Count the number of values in the nested dict
- nephosem.utils.get_word_str(wquery, specific=False, corpus_name='', def_key=None)¶
- Parameters
wquery – can be a string or a dictionary the result of getWordStr() always is a string
specific – refers to whether or not a corpus specific word representation is requested.
corpus_name – only used if ‘specific=True’
def_key – settings[‘wqueries-default-key’]
- nephosem.utils.is_string(text)¶
Check if the passed text is an instance of str / unicode (Python2) or unicode / bytes (Python3)
- nephosem.utils.load_dict_json(filename, encoding='utf-8')¶
Load dict from json file.
- nephosem.utils.load_dict_plain(filename, encoding='utf-8')¶
Load dict from plain txt file One word per line ( word[TAB]freq )
- nephosem.utils.make_dir(dirname)¶
Create directory if not exist.
- nephosem.utils.pickle(obj, fname, protocol=2)¶
Pickle object obj to file fname, using smart_open so that fname can be on S3, HDFS, compressed etc.
- Parameters
obj (object) – Any python object.
fname (str) – Path to pickle file.
protocol (int, optional) – Pickle protocol number. Default is 2 in order to support compatibility across python 2.x and 3.x.
- nephosem.utils.read_fnames(filename, dirname='', encoding='utf-8')¶
This function reads all filenames in this provided file It assumes ‘filename’ is a string
- Parameters
filename (filename that records corpus filenames) –
dirname (corpus directory) –
encoding –
- Returns
- Return type
a list of filenames in this file
- nephosem.utils.read_fnames_of_corpus(corpus_path)¶
Read all file names of all files in corpus_path folder.
- Parameters
corpus_path (str) – The corpus path where all corpus files are located.
- nephosem.utils.read_word_queries(fname, encoding='utf-8', wquery_default_key='_DEFAULT_')¶
Reads a list of ‘word queries’ from the file ‘fname’. These ‘word queries’ are used in other functions as search terms for retrieving tokens in the corpora. ———————————————————————- The function read_word_queries() assumes that ‘fname’ is a string that contains a file name.
The expected file format for that file is as follows:
lines that start with # are ignored
other lines, if they don’t contain tabs, are assumed to contain a single word representation
- e.g.:
appel/noun
such single word representations are represented as a string in the output of this function.
other lines, if they do contain tabs, are assumed to have the following format:
corpusname:wordstr TAB corpusname:wordstr TAB etc.
- e.g.:
LeNC:rustoord/noun TAB TwNC:rust_oord/noun
This example indicates that what we consider to be one and the same word in our study, has the format “rustoord/noun” in the corpus LeNC and has the format “rust_oord/noun” in the corpus TwNC. Such information will be represented, in the output of readWQueriesList(), in a dictionary:
- e.g.:
- {“LeNC”: “rustoord/noun”,
“TwNC”: “rust_oord/noun”, “_DEFAULT_”: “rustoord/noun”}
In this example, it is assumed that the value of ‘settings[“wqueries-default-key”]’ is “_DEFAULT_”. There always is a key ‘settings[“wqueries-default-key”]’ in these output dictionaries. If the input line does not explicitly contain a corpus name equal to ‘settings[“wqueries-default-key”]’, then the first value/word in the input line is also used as the value/word for ‘settings[“wqueries-default-key”]’.
[format of result] - The result is a list of items, with each of these items either being
a string (which indicates that the same search term can be used in all corpora) or a dictionary that maps corpus names onto queries/words and that maps ‘settings[“wqueries-default-key”]’ onto the default query/word.
- Parameters
fname –
encoding (str) – default ‘utf-8’
- Returns
- Return type
list of dicts
- nephosem.utils.save_concordance(fname, typenodes, colloc_fmt='lemma', encoding='utf-8')¶
Write out a concordance for types/concepts
- Parameters
fname (str) – filename to save
colloc_fmt (str) – Options: ‘lemma’, ‘word’, ‘lemma/pos’, …
encoding (str) – default ‘utf-8’
- nephosem.utils.save_dict_json(freq_dict, filename, encoding='utf-8')¶
Save frequency dict to json file.
- nephosem.utils.save_dict_plain(freq_dict, filename, encoding='utf-8', order='freq')¶
Save frequency dict to plain txt file.
- Parameters
freq_dict (dict) –
filename (str) –
encoding (str) – ‘utf-8’, ‘latin-1’, …
order (str) – ‘freq’, ‘alpha’
- nephosem.utils.sizeof(filename, suffix='B')¶
- nephosem.utils.sizeof_fmt(num, suffix='B')¶
- nephosem.utils.sort_dict(freq_dict, sorting='freq', descending=True)¶
Sort a dict by order. Normally if the ‘sorting’ is ‘freq’, sort the dict first by frequency descending order, then by alphabetic ascending order. If the ‘sorting’ is ‘alpha’, sort the dict by alphabetic ascending order.
- Parameters
freq_dict (dict) – Python dict of item to frequency pair.
sorting (str) – ‘freq’ for frequency order, ‘alpha’ for alphabetic order.
descending (bool) – If True, sort dict by descending order of ‘sorting’. Else, sort dict by ascending order of ‘sorting’.
- Returns
sorted_keys, a list of sorted keys of the dict
- Return type
list
- nephosem.utils.timeit(fn)¶
- nephosem.utils.unpickle(fname)¶
Load object from fname, using smart_open so that fname can be on S3, HDFS, compressed etc.
- Parameters
fname (str) – Path to pickle file.
- Returns
Python object loaded from fname.
- Return type
object