{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# All about matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common classes you will deal with when using the `nephosem` module (and by extension `semasioFlow`) is `TypeTokenMatrix`, which covers both type-level and token-level matrices, either for raw co-occurrences, association matrices or even square distance/similarity matrices. Here you can learn a bit about what you can do with them." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "nephosemdir = \"../../nephosem/\"\n", "sys.path.append(nephosemdir)\n", "mydir = f\"./\"\n", "from nephosem import ConfigLoader, TypeTokenMatrix\n", "conf = ConfigLoader()\n", "settings = conf.update_config('config.ini')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matrices and files\n", "\n", "Like `Vocab` objects (see [here](vocab.ipynb)), `TypeTokenMatrix` objects have `load()` and `save()` methods:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", ",/, NaN 2 2 NaN NaN NaN NaN ...\n", "a/D NaN 2 NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = 'output/Toy.bow.wcmx.pac'\n", "mtx = TypeTokenMatrix.load(filename) # opens a matrix\n", "mtx\n", "# mtx.save(filename) # saves the matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will create an `.npy` and a `.meta` files and compress them together in a `.pac` file (like `.zip`, basically). You can also store them as comma-separated values with `mtx.to_csv(filename)`.\n", "Why would you do that? If you want to open a _type_-level matrix in R, it doesn't work unless it's stored as `.csv`.\n", "\n", "\n", "## Matrix components\n", "\n", "A `TypeTokenMatrix` has row names, column names and a `numpy` matrix element, all retrievable with corresponding attributes:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<55x55 sparse matrix of type ''\n", "\twith 606 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mymatrix = mtx.matrix # returns a numpy 2D array\n", "mymatrix" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"'s/P\", ',/,', 'a/D', 'about/I', 'about/R']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows = mtx.row_items # returns a list\n", "rows[:5]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"'s/P\", ',/,', 'a/D', 'about/I', 'about/R']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = mtx.col_items # returns a list\n", "columns[:5]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", ",/, NaN 2 2 NaN NaN NaN NaN ...\n", "a/D NaN 2 NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mtx2 = TypeTokenMatrix(matrix = mymatrix, row_items = rows, col_items = columns) # creates a matrix again\n", "mtx2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, the `sum()` method gives you marginal frequencies: of the rows for `mtx.sum(axis=1)`, of the columns for `mtx.sum(axis=2)` and the full sum otherwise. You can use them, transformed to `Vocab` objects, as marginal frequencies for `compute_association()` (when [weighting matrices](all-in-one.ipynb#4.-Association-measures)):" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(mtx.sum(axis=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subsetting a matrix\n", "\n", "You can subset a matrix with the `submatrix()` method, specifying a list of rows and/or a list of columns. Non-existing items will simply be ignored." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "rows = ['girl/N', 'boy/N', 'apple/N']\n", "cols = ['give/V', 'eat/V', 'ask/V']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "girl/N 1 1 5 1 1 NaN NaN ...\n", "boy/N NaN NaN 4 1 NaN 1 3 ...\n", "apple/N 1 1 4 1 1 NaN 3 ..." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_rows = mtx.submatrix(row = rows)\n", "subset_rows" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[55, 3] give/V eat/V ask/V\n", "'s/P NaN NaN NaN\n", ",/, NaN 1 NaN\n", "a/D 2 1 NaN\n", "about/I NaN 1 1\n", "about/R NaN 1 NaN\n", "all/P NaN NaN NaN\n", "an/D 1 2 NaN\n", "... ... ... ..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_cols = mtx.submatrix(col = cols)\n", "subset_cols" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 3] give/V eat/V ask/V\n", "girl/N 4 11 1\n", "boy/N 5 10 1\n", "apple/N 2 12 NaN" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_both = mtx.submatrix(row = rows, col = cols)\n", "subset_both" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also easily drop all empty rows/columns with the `drop()` method. So, for example, if you have subsetted your matrix by rows and now some columns are empty in all remaining rows, you can clean them out like so:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 5] apple/N eat/V girl/N ten/J the/D\n", "about/R 1 1 1 1 1" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mtx.submatrix(row = ['about/R']).drop(axis = 1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[43, 3] give/V eat/V ask/V\n", ",/, NaN 1 NaN\n", "a/D 2 1 NaN\n", "about/I NaN 1 1\n", "about/R NaN 1 NaN\n", "an/D 1 2 NaN\n", "and/C NaN 4 NaN\n", "apple/N 2 12 NaN\n", "... ... ... ..." ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_cols.drop(axis = 0) #drops empty rows" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[13, 3] give/V eat/V ask/V\n", "a/D 2 1 NaN\n", "about/I NaN 1 1\n", "an/D 1 2 NaN\n", "apple/N 2 12 NaN\n", "be/V 2 6 NaN\n", "boy/N 5 10 1\n", "by/I 1 3 NaN\n", "... ... ... ..." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subset_cols.drop(axis = 0, n_nonzero = 1) # drops rows that only have one non-zero value or less" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you just want to obtain the value for a given row-column combination, you can simply subset with square brackets." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mtx['apple/N','girl/N']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean filter\n", "\n", "You can use any boolean matrix with the same dimensions to filter, for example, cells with certain values. The code below returns a boolean matrix with `True` where the values are larger than 1 and `False` where they are not (but `NaN` where it was `NaN`)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[55, 55] 's/P ,/, a/D about/I about/R all/P an/D ...\n", "'s/P NaN NaN NaN NaN NaN NaN NaN ...\n", ",/, NaN True True NaN NaN NaN NaN ...\n", "a/D NaN True NaN NaN NaN NaN NaN ...\n", "about/I NaN NaN NaN NaN NaN NaN NaN ...\n", "about/R NaN NaN NaN NaN NaN NaN NaN ...\n", "all/P NaN NaN NaN NaN NaN NaN NaN ...\n", "an/D NaN NaN NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mtx > 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such a matrix can be used to turn all the values below that threshold into 0s with the `multiply()` method." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[35, 35] ,/, a/D an/D and/C apple/N ask/V at/I ...\n", ",/, 2 2 NaN NaN NaN NaN NaN ...\n", "a/D 2 NaN NaN 2 4 NaN 2 ...\n", "an/D NaN NaN NaN NaN 3 NaN NaN ...\n", "and/C NaN 2 NaN NaN 3 NaN NaN ...\n", "apple/N NaN 4 3 3 NaN NaN NaN ...\n", "ask/V NaN NaN NaN NaN NaN NaN NaN ...\n", "at/I NaN 2 NaN NaN NaN NaN NaN ...\n", "... ... ... ... ... ... ... ... ..." ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mtx.multiply(mtx > 1).drop(axis = 1).drop(axis = 0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }