Linear Algebra

Misc

Packages
- {sparsevctrs} - Sparse Vectors for Use in Data Frames or Tibbles
  - Sparse matrices are not great for data in general, or at least not until the very end, when mathematical calculations occur.
  - Some computational tools for calculations use sparse matrices, specifically the Matrix package and some modeling packages (e.g., xgboost, glmnet, etc.).
  - A sparse representation of data that allows us to use modern data manipulation interfaces, keeps memory overhead low, and can be efficiently converted to a more primitive matrix format so that we can let Matrix and other packages do what they do best.
- {quickr} - R to Fortran Transpiler
  - Only atomic vectors, matrices, and array are currently supported: integer, double, logical, and complex.
  - The return value must be an atomic array (e.g., not a list)
  - Only a subset of R’s vocabulary is currently supported.
    #> [1] != %% %/% & && ( * #> [8] + - / : < <- <= #> [15] = == > >= Fortran [ [<- #> [22] ^ c cat cbind character declare double #> [29] for if ifelse integer length logical matrix #> [36] max min numeric print prod raw seq #> [43] sum which.max which.min { | ||

Resources

Crash Course in Matrix Algebra: A Refresher on Matrix Algebra for Econometricians with Implementation in R
See Matrix Cookbook pdf in R >> Documents >> Mathematics
- derivatives, inverses, statistics, probability, etc.
Link - A lot of matrix properties as related to regression, covariance, coefficients, etc.
EBOOK statistical linear algebra: basics, transformations, decompositions, linear systems, regression - Matrix Algebra for Educational Scientists
Video Course: Linear Algebra for Data Science - Basics, Least Squares, Covariance, Regression, PCA, SVD
Powered by Linear Algebra: The central role of matrices and vector spaces in data science by Matloff
- Delta Method, Regularized Regression, PCA, SVD, Attention

Matrix Multiplication

Matrix Algebra

An expected value equation (VC stands for variance-covariance in example) multiplied by a matrix, C.
- C is factored out of an expected value as C
- C is factored out of a transpose as CT

Factorization

Methods

The Moore-Penrose Inverse

The Moore-Penrose inverse, often denoted as \(A^+\), is a generalization of the ordinary matrix inverse that applies to any matrix, even those that are singular (non-invertible) or rectangular. It was independently developed by E.H. Moore and Roger Penrose.
Generalized Inverse
- Unlike a regular inverse, which only exists for square, non-singular matrices, the Moore-Penrose inverse exists for all matrices. This is incredibly useful in situations where you have more equations than unknowns (overdetermined systems) or fewer equations than unknowns (underdetermined systems), or when your matrix is singular.
Four Penrose Conditions
- The Moore-Penrose inverse \(A^+\) of a matrix \(A\) is uniquely defined by four conditions:
  1. \(A A^+ A = A\)
  2. \(A^+ A A^+ = A^+\)
  3. \((A A^+)^T = A A^+\)
  4. \((A^+ A)^T = A^+ A\)
Least Squares Solution
- One of its most important applications is in finding the “best fit” (least squares) solution to a system of linear equations \(Ax = b\). When an exact solution doesn’t exist (e.g., in overdetermined systems), the Moore-Penrose inverse provides the \(x\) that minimizes the Euclidean norm of the residual error, \(\|Ax - b\|^2\). The solution is given by \(x = A^+ b\). If there are multiple solutions (e.g., in underdetermined systems), it provides the solution with the minimum Euclidean norm \(\|x\|^2\).
Computation
- The most common and robust method for computing the Moore-Penrose inverse is through Singular Value Decomposition (SVD). If \(A = U \Sigma V^T\) is the SVD of \(A\), then \(A^+ = V \Sigma^+ U^T\), where \(\Sigma^+\) is obtained by taking the reciprocal of the non-zero singular values in \(\Sigma\) and transposing the resulting matrix.

Specialty Matrices

Notes from Derivatives, Gradients, Jacobians and Hessians – Oh My!
Jacobian

\[ \begin{align} &v, w = f(x, y, z) \\ &\mathbb{J} = \begin{bmatrix} \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} & \frac{\partial v}{\partial z} \\ \frac{\partial w}{\partial x} & \frac{\partial w}{\partial y} & \frac{\partial w}{\partial z} \end{bmatrix} \end{align} \]
- Also see (Video) The Jacobian : Data Science Basics for it’s usage in ML
- The gradient is calculated for v and the gradient is calculated for w. Then, the result is put into matrix.
- At a specific point in space (of whatever space the input parameters are in), it tells you how the space is warped in that location – like how much it is rotated and squished.
- The determinent of the Jacobian:
  - \(\gt 1\) : Things get bigger
  - \(\lt 1\) but \(\gt 0\) : Things get smaller
  - \(\lt 0\) : Things get flipped
  - \(0\) : Things get squished to a point, and the matrix is not invertible
Hessian
\[ \begin{align} &w = f(x, y, z) \\ &\mathbb{H} = \begin{bmatrix} \frac{\partial^2 w}{\partial x^2} & \frac{\partial^2 w}{\partial xy} & \frac{\partial^2 w}{\partial xz} \\ \frac{\partial^2 w}{\partial yx} & \frac{\partial^2 w}{\partial y^2} & \frac{\partial^2 w}{\partial yz} \\ \frac{\partial^2 w}{\partial zx} & \frac{\partial^2 w}{\partial zy} & \frac{\partial^2 w}{\partial z^2} \end{bmatrix} \end{align} \]
- See
- Taking the second derivative is taking partial derivatives (3 total, 1 w.r.t. each variable) of each partial derivative (3 total, 1 w.r.t. each variable) for a grand total of 9 partial derivatives.