7 minute read

I’m starting this series in order to consolidate everything I’ve learned during the week with regards to AI safety research engineering. I think it could be useful for me, as there’s a lot of support out there for the idea that you truly understand someting once you try to explain it to someone else. So, let’s go.

I wanted to start with an area that I feel particularly weak in: linear algebra. I had a really unfortunate experience in college with it. My professor was so bad that, one week when we had a tutor for a substitute, she told the class that she had gotten a lot of requests to just continue teaching the course over him. I eventually stopped going to class and binged Khan Academy in order to do the homework and tests, but I retained absolutely nothing from it. And now that’s biting me in the ass since so much of machine learning is built on linear algebra.

Thankfully, there’s one resource out there that I found made everything click: 3blue1brown’s Essence of Linear Algebra series. I cross-referenced the ARENA 3.0 prereqs in order to focus on only the topics that are directly relevant to machine learning.

Span and Basis Vectors

For the sake of simplicity, I’m going to stick to 2D space. There are two unit vectors: \(\hat{i} = \begin{bmatrix}1 \\ 0\end{bmatrix}\) which is the vector that represents moving by 1 on the x axis, and \(\hat{j} = \begin{bmatrix}0 \\ 1\end{bmatrix}\) which is the vector that represents moving by 1 on the y axis. $\hat{i}$ and $\hat{j}$ are said to span 2D space, which means that you can get to any point by simply taking some linear combination of the two. For example, if you want to get to the point $(2, 3)$, you can think of it as moving some number of $\hat{i}$s and $\hat{j}$s, in this case:

\(\begin{bmatrix}2 \\ 3\end{bmatrix} = 2\hat{i} + 3\hat{j} = 2\begin{bmatrix}1 \\ 0\end{bmatrix} + 3\begin{bmatrix}0 \\ 1\end{bmatrix}\).

These two vectors also form a basis of 2D space because not only do the span that space, but they also are independent of each other.

Okay so we have two basis unit vectors $\hat{i}$ and $\hat{j}$, now what?

Linear Transformations

A linear transformation is just a function that maps vectors to vectors (or matrices to matrics, I’m just going to stick to vectors). The word transformation suggests that that the vector moves through space from the input to the output. Linear transformations have two important properties:

  1. Grid lines remain straight and evenly-spaced. No curvy grid lines or weird spacing allowed!
  2. The origin $\vec{0} = \begin{bmatrix} 0 \ 0\end{bmatrix}$ remains fixed in place. So $f(\vec{0}) = \vec{0}$. This means that the grid isn’t getting shifted in any direction.

Here’s an example of a linear transformation in 2D:

\[A = \begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\]

One way to think about linear transformations is that the columns of the transformation represent how the vectors $\hat{i}$ and $\hat{j}$ get moved into the new space. The first column represents how $\hat{i}$ moves, and the second represents how $\hat{j}$ moves.

\[\hat{i}:\begin{bmatrix} 1 \\ 0 \end{bmatrix} \to \begin{bmatrix} 1 \\ -1 \end{bmatrix}, \hat{j}:\begin{bmatrix} 0 \\ 1 \end{bmatrix} \to \begin{bmatrix} 1 \\ 2 \end{bmatrix}\]

You can figure this out with matrix multiplication:

\[\begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix}(1)(1) + (0)(1) \\ (1)(-1) + (0)(2)\end{bmatrix} = \begin{bmatrix} 1 \\ -1 \end{bmatrix}\] \[\begin{bmatrix} 1 & 1 \\ -1 & 2 \end{bmatrix}\begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix}(0)(1) + (1)(1) \\ (0)(-1) + (1)(2)\end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\]
How does this relate to ML?

The foundation of almost every neural network involves matrix-vector multiplication, and you can think of these layers as performing linear transformations on the input vector. For example, the linear layer has the form:

\[W\vec{x} + \vec{b}\]

Where $W$ is a matrix containing the weights for the layer, $\vec{x}$ is the input feature vector, and $\vec{b}$ is the bias weights vector. So you can think of a neural network linear layer as taking some feature vector and performing a linear transformation on it to map it into some new space!

Dot Products

Say we have two vectors:

\[\vec{v} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \vec{w} = \begin{bmatrix} 3 \\ -1 \end{bmatrix}\]

The dot product is comuted as:

\[\begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ -1 \end{bmatrix} = (1)(3) + (2)(-1) = 3 - 2 = 1\]

This looks similar to a linear transformation, right? If we just take the transpose (swap the rows and columns) of the first matrix we get:

\[\begin{bmatrix} 1 & 2 \end{bmatrix} \begin{bmatrix} 3 \\ -1 \end{bmatrix} = (1)(3) + (2)(-1) = 3 - 2 = 1\]

The geometric way to think about the dot product is you first project $\vec{w}$ onto $\vec{v}$, then multiply the lengths of $\vec{v}$ and the projected $\vec{w}$ together. The dot product has the following properties:

  • If it’s positive, then the vectors are pointing in similar directions.
  • If it’s zero, then the vectors are perpendicular.
  • If it’s negative, then the vectors are pointing in opposite directions.
How does this relate to ML?

Basically the same point as in the linear transformations section. $W\vec{x}$ is a dot product.

Determinants

The determinant of a linear transformation is the factor by which it scales any area in the original space before taking the transformation. A determinant of 0 means that space gets squished into a line or a point as part of the linear transformation. A related term is the null space of a linear transformation, which is the set of vectors that get compressed to the origin (when the determinant is 0). This isn’t directly useful for ML, but it sets up how to understand eigenvectors and eigenvalues in the next section.

Eigenvectors and Eigenvalues

An eigenvector is any vector that, after performing a linear transformation, stays on the same line (span). Two extreme examples:

  • A clockwise rotation by $90^\circ$ has no eigenvectors because every vector gets rotated off its original line.
  • A transformation that doubles in all directions has every vector as an eigenvector because the length of the original vector just gets doubled.

An eigenvalue is the amount by which an eigenvector gets scaled after doing the transformation.

  • An eigenvalue >1 stretches the eigenvector
  • An eigenvalue >0 and <1 shrinks the eigenvector
  • An eigenvalue <0 flips the direction of the eigenvector (and either stretches or shrinks it based on the other two conditions)

Given a linear transformation $A$ and the geometric definitions above you can express eigenvalues $\lambda$ and eigenvectors $\vec{v}$ with the following relationship:

\[A\vec{v} = \lambda\vec{v}\]

Say $A = \begin{bmatrix} 1 & 0 \ -1 & 2 \end{bmatrix}$, we can solve for the eignvalues and eigenvectors like so:

\[A\vec{v} = \lambda\vec{v} \\ A\vec{v} = \lambda I\vec{v} \\ A\vec{v} - \lambda I\vec{v} = \vec{0} \\ (A - \lambda I)\vec{v} = \vec{0} \\ (\begin{bmatrix} 1 & 0 \\ -1 & 2 \end{bmatrix} - \lambda \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix})\vec{v} = \vec{0} \\ \begin{bmatrix} 1 - \lambda & 0 \\ -1 & 2-\lambda \end{bmatrix}\vec{v} = \vec{0}\]

Now one last leap: we are interested in finding $\lambda$ such that $det(\begin{bmatrix} 1 - \lambda & 0 \ -1 & 2-\lambda \end{bmatrix}) = 0$

\[(1-\lambda)(2-\lambda) + (-1)(0) = 0 \\ (1-\lambda)(2-\lambda) = 0\]

This gives us two values for $\lambda$: 1 and 2. Now just go back, plug in $\lambda$ and solve for the x and y values of $\vec{v}$ to get the eigenvectors.

\(\begin{bmatrix} 1 - 1 & 0 \\ -1 & 2-1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \vec{0} \\ -x + y = 0\\ y = x\) So one such eigenvector on the line $y=x$ is $\begin{bmatrix}1 \ 1\end{bmatrix}$ for $\lambda = 1$.

For $\lambda = 2$:

\[\begin{bmatrix} 1 - 2 & 0 \\ -1 & 2-2 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \vec{0} \\ -x = 0 \\ x = 0\]

So one such eigenvector on the line $x=0$ is $\begin{bmatrix}0 \ 1\end{bmatrix}$ for $\lambda = 2$.

There’s this thing you can do with eigenvectors and using them in a change of basis matrix. Given the original transformation, eigenvectors of that transformation, and eigenvalues of those eigenvectors, you get this cool relationship:

\[\begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix}^{-1} \begin{bmatrix} 1 & 0 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}\]

The left you can think of as saying “express the linear transformation in terms this (eigen)basis”. This is guaranteed to be diagonal (with the associated eignvalues on the diagonal), because the eigenvectors just get scaled in the linear transformation! This is useful because say you want to apply the same linear transformation a lot of times (take its power). You could first find the eigenbasis and values, then do a change of basis with the eignbasis, apply the multiple transformations at once, then change basis back.

How does this relate to ML?

Eigenvectors and eigenvalues are important parts of principal component analysis (PCA), where a high dimensional feature space gets projected into a lower dimensional space. THey have other uses in clustering an computer vision, but I haven’t touched them.

Conclusion

I feel like I have a decent grasp of the important parts of linear algebra, at least as far as I need for mastering transformers and backpropagation (which are where I want to go next).