{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# All about matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the most common classes you will deal with when using the `nephosem` module (and by extension `semasioFlow`) is `TypeTokenMatrix`, which covers both type-level and token-level matrices, either for raw co-occurrences, association matrices or even square distance/similarity matrices. Here you can learn a bit about what you can do with them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "nephosemdir = \"../../nephosem/\"\n",
    "sys.path.append(nephosemdir)\n",
    "mydir = f\"./\"\n",
    "from nephosem import ConfigLoader, TypeTokenMatrix\n",
    "conf = ConfigLoader()\n",
    "settings = conf.update_config('config.ini')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matrices and files\n",
    "\n",
    "Like `Vocab` objects (see [here](vocab.ipynb)), `TypeTokenMatrix` objects have `load()` and `save()` methods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[55, 55]  's/P  ,/,  a/D  about/I  about/R  all/P  an/D  ...\n",
       "'s/P      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       ",/,       NaN   2    2    NaN      NaN      NaN    NaN   ...\n",
       "a/D       NaN   2    NaN  NaN      NaN      NaN    NaN   ...\n",
       "about/I   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "about/R   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "all/P     NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "an/D      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "...       ...   ...  ...  ...      ...      ...    ...   ..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filename = 'output/Toy.bow.wcmx.pac'\n",
    "mtx = TypeTokenMatrix.load(filename) # opens a matrix\n",
    "mtx\n",
    "# mtx.save(filename) # saves the matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create an `.npy` and a `.meta` files and compress them together in a `.pac` file (like `.zip`, basically). You can also store them as comma-separated values with `mtx.to_csv(filename)`.\n",
    "Why would you do that? If you want to open a _type_-level matrix in R, it doesn't work unless it's stored as `.csv`.\n",
    "\n",
    "\n",
    "## Matrix components\n",
    "\n",
    "A `TypeTokenMatrix` has row names, column names and a `numpy` matrix element, all retrievable with corresponding attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<55x55 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 606 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mymatrix = mtx.matrix # returns a numpy 2D array\n",
    "mymatrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"'s/P\", ',/,', 'a/D', 'about/I', 'about/R']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows = mtx.row_items # returns a list\n",
    "rows[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"'s/P\", ',/,', 'a/D', 'about/I', 'about/R']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns = mtx.col_items # returns a list\n",
    "columns[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[55, 55]  's/P  ,/,  a/D  about/I  about/R  all/P  an/D  ...\n",
       "'s/P      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       ",/,       NaN   2    2    NaN      NaN      NaN    NaN   ...\n",
       "a/D       NaN   2    NaN  NaN      NaN      NaN    NaN   ...\n",
       "about/I   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "about/R   NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "all/P     NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "an/D      NaN   NaN  NaN  NaN      NaN      NaN    NaN   ...\n",
       "...       ...   ...  ...  ...      ...      ...    ...   ..."
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mtx2 = TypeTokenMatrix(matrix = mymatrix, row_items = rows, col_items = columns) # creates a matrix again\n",
    "mtx2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, the `sum()` method gives you marginal frequencies: of the rows for `mtx.sum(axis=1)`, of the columns for `mtx.sum(axis=2)` and the full sum otherwise. You can use them, transformed to `Vocab` objects, as marginal frequencies for `compute_association()` (when [weighting matrices](all-in-one.ipynb#4.-Association-measures)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(mtx.sum(axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Subsetting a matrix\n",
    "\n",
    "You can subset a matrix with the `submatrix()` method, specifying a list of rows and/or a list of columns. Non-existing items will simply be ignored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = ['girl/N', 'boy/N', 'apple/N']\n",
    "cols = ['give/V', 'eat/V', 'ask/V']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3, 55]  's/P  ,/,  a/D  about/I  about/R  all/P  an/D  ...\n",
       "girl/N   1     1    5    1        1        NaN    NaN   ...\n",
       "boy/N    NaN   NaN  4    1        NaN      1      3     ...\n",
       "apple/N  1     1    4    1        1        NaN    3     ..."
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_rows = mtx.submatrix(row = rows)\n",
    "subset_rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[55, 3]  give/V  eat/V  ask/V\n",
       "'s/P     NaN     NaN    NaN\n",
       ",/,      NaN     1      NaN\n",
       "a/D      2       1      NaN\n",
       "about/I  NaN     1      1\n",
       "about/R  NaN     1      NaN\n",
       "all/P    NaN     NaN    NaN\n",
       "an/D     1       2      NaN\n",
       "...      ...     ...    ..."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_cols = mtx.submatrix(col = cols)\n",
    "subset_cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3, 3]   give/V  eat/V  ask/V\n",
       "girl/N   4       11     1\n",
       "boy/N    5       10     1\n",
       "apple/N  2       12     NaN"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_both = mtx.submatrix(row = rows, col = cols)\n",
    "subset_both"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also easily drop all empty rows/columns with the `drop()` method. So, for example, if you have subsetted your matrix by rows and now some columns are empty in all remaining rows, you can clean them out like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 5]   apple/N  eat/V  girl/N  ten/J  the/D\n",
       "about/R  1        1      1       1      1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mtx.submatrix(row = ['about/R']).drop(axis = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[43, 3]  give/V  eat/V  ask/V\n",
       ",/,      NaN     1      NaN\n",
       "a/D      2       1      NaN\n",
       "about/I  NaN     1      1\n",
       "about/R  NaN     1      NaN\n",
       "an/D     1       2      NaN\n",
       "and/C    NaN     4      NaN\n",
       "apple/N  2       12     NaN\n",
       "...      ...     ...    ..."
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_cols.drop(axis = 0) #drops empty rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[13, 3]  give/V  eat/V  ask/V\n",
       "a/D      2       1      NaN\n",
       "about/I  NaN     1      1\n",
       "an/D     1       2      NaN\n",
       "apple/N  2       12     NaN\n",
       "be/V     2       6      NaN\n",
       "boy/N    5       10     1\n",
       "by/I     1       3      NaN\n",
       "...      ...     ...    ..."
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "subset_cols.drop(axis = 0, n_nonzero = 1) # drops rows that only have one non-zero value or less"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you just want to obtain the value for a given row-column combination, you can simply subset with square brackets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mtx['apple/N','girl/N']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean filter\n",
    "\n",
    "You can use any boolean matrix with the same dimensions to filter, for example, cells with certain values. The code below returns a boolean matrix with `True` where the values are larger than 1 and `False` where they are not (but `NaN` where it was `NaN`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[55, 55]  's/P  ,/,   a/D   about/I  about/R  all/P  an/D  ...\n",
       "'s/P      NaN   NaN   NaN   NaN      NaN      NaN    NaN   ...\n",
       ",/,       NaN   True  True  NaN      NaN      NaN    NaN   ...\n",
       "a/D       NaN   True  NaN   NaN      NaN      NaN    NaN   ...\n",
       "about/I   NaN   NaN   NaN   NaN      NaN      NaN    NaN   ...\n",
       "about/R   NaN   NaN   NaN   NaN      NaN      NaN    NaN   ...\n",
       "all/P     NaN   NaN   NaN   NaN      NaN      NaN    NaN   ...\n",
       "an/D      NaN   NaN   NaN   NaN      NaN      NaN    NaN   ...\n",
       "...       ...   ...   ...   ...      ...      ...    ...   ..."
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mtx > 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Such a matrix can be used to turn all the values below that threshold into 0s with the `multiply()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[35, 35]  ,/,  a/D  an/D  and/C  apple/N  ask/V  at/I  ...\n",
       ",/,       2    2    NaN   NaN    NaN      NaN    NaN   ...\n",
       "a/D       2    NaN  NaN   2      4        NaN    2     ...\n",
       "an/D      NaN  NaN  NaN   NaN    3        NaN    NaN   ...\n",
       "and/C     NaN  2    NaN   NaN    3        NaN    NaN   ...\n",
       "apple/N   NaN  4    3     3      NaN      NaN    NaN   ...\n",
       "ask/V     NaN  NaN  NaN   NaN    NaN      NaN    NaN   ...\n",
       "at/I      NaN  2    NaN   NaN    NaN      NaN    NaN   ...\n",
       "...       ...  ...  ...   ...    ...      ...    ...   ..."
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mtx.multiply(mtx > 1).drop(axis = 1).drop(axis = 0)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}