06 Sep 2021
The Stardew Valley Grange Display - Basic Web Scraping
Lately, I’ve been pretty obsessed with Stardew Valley. It’s a lovely little farming game, with a lot more to do besides grow crops. For the purpose of the next couple posts, I’d like to focus on an event that happens every fall, the Stardew Valley Fair. Specifically, you have the option of creating a Grange Display, and winning gets you 1000 star points. The star points aren’t the most useful thing in the world, but like many things in the game, that’s hardly the point. I’d just like to figure out how to get a perfect score with the least amount of work.
Step one that I’ll describe in this post is some super basic web-scraping of the stardew valley wiki. My goal is to grab the tables on the page about the fair to find the point values for the different items you can put in the grange display. I’ll use the requests package to get the web page and parse the html using Beautiful Soup. My ultimate goal is to create a pandas data frame.
#import libraries
import requests
from bs4 import BeautifulSoup
from IPython.display import display, HTML
import pandas as pd
#get html from stardew wiki
r = requests.get('https://stardewvalleywiki.com/Stardew_Valley_Fair')
html_text = r.text
soup = BeautifulSoup(html_text, 'html.parser')
Beautiful Soup turns the html page into an object that can be accessed in a much more pythonic way. For example, here I can find all the ‘wikitable’s on the page. Dealing with the actual text inside each row and column requires a slightly different procedure depending on if the text is a single string, multiple strings, or an image. Cells that have both an image and text are read just as text, since the image is usually a ‘gold’ or ‘star point’ image. It would be possible to reproduce these cells, but it would be more work, and doesn’t really help me.
#find all the tags that are part of the wikitable css class
tables = soup.find_all(class_="wikitable")
#a little helper function for when there are several strings withtin the same tag
def join_strings(tag):
return ''.join([string for string in tag.stripped_strings])
#function to parse the table into a pandas dataframe
#optionally searches for the heading to add a 'category' column to the dataframe
def parse_table(table, get_heading = True):
if get_heading:
table_title = [join_strings(table.find_previous(class_='mw-collapsible').th)]
else:
table_title = []
row = table.tr #the first row has several 'th' tags, this is pretty general for tables on this site
headers = [elem.string.strip() for elem in row.find_all("th")]
#print(headers)
records = []
row = row.next_sibling
while row:
if row.string:
row = row.next_sibling #prevents reading in rows that are just a newline without columns
continue
else:
column = row.td #find first column
record = []
while column:
if column.string:
cell_text = column.string.strip() #get string within the td tag
else:
cell_text = join_strings(column) #get string when there are multiple strings in a tag''.join([string for string in column.stripped_strings])
try:
image = column.img
#Don't want the star token and gold images
if (image.attrs['alt'] != 'Token.png')&(image.attrs['alt'] != 'Gold.png'):
if image.attrs['src'][:5]!='https':
image.attrs['src'] = 'https://stardewvalleywiki.com'+image.attrs['src'] #get full location
image = str(image)#Needs to be a string to be resolved in DataFrame
else:
image = None
except:
image = None
if cell_text != '':
record.append(cell_text) #adds column to row record if it isn't empty
elif image:
record.append(image) #adds column to row record if it's an image (and just an image)
column = column.next_sibling #goes to the next column
records.append(tuple(record+table_title)) #adds row to the list of records
#print(records)
row = row.next_sibling #goes to the next row
if get_heading:
columns = headers+['category'] #adds category column if we are finding the table heading
else:
columns = headers #just uses the column names found in th tags
table_df = pd.DataFrame.from_records(data=records,columns = columns).dropna() #creates a pandas dataframe from a list of records
return table_df
def display_table(table_df):
return display(HTML(table_df.to_html(escape=False,index=False))) #helper function to make it display like I want
dfs = []
for table in tables[3:11]:
dfs.append(parse_table(table))#Get all the tables related to the grange display
grange_df = pd.concat(dfs,axis=0,ignore_index=True)#Concatenate the tables into one flat table
grange_df = grange_df.melt(id_vars = ["Item","category","Price"],
value_vars = ["Base","Silver","Gold","Iridium"],
var_name = "Quality",
value_name = "Points")#Melt the table so that the points are all in one column
| Item |
category |
Price |
Quality |
Points |
| Honey (Wild)* |
Artisan Goods |
100g |
Base |
3 |
| Jelly (Ancient Fruit) |
Artisan Goods |
1,150g |
Base |
6 |
| Jelly (Apple) |
Artisan Goods |
250g |
Base |
4 |
| Jelly (Apricot) |
Artisan Goods |
150g |
Base |
3 |
| Jelly (Banana) |
Artisan Goods |
350g |
Base |
5 |
| Jelly (Blackberry) |
Artisan Goods |
90g |
Base |
3 |
| Jelly (Blueberry) |
Artisan Goods |
150g |
Base |
3 |
| Jelly (Cactus Fruit) |
Artisan Goods |
200g |
Base |
4 |
| Jelly (Cherry) |
Artisan Goods |
210g |
Base |
4 |
| Jelly (Coconut) |
Artisan Goods |
250g |
Base |
4 |
| Jelly (Cranberries) |
Artisan Goods |
200g |
Base |
4 |
| Jelly (Crystal Fruit) |
Artisan Goods |
350g |
Base |
5 |
And here’s an example of one of the tables that has images in it. For my goal of finding an easy perfect score grange display, I won’t really need it, but it’s a nice thing to know how to do.
display_table(parse_table(tables[12]))
| Image |
Name |
Description |
Price |
category |
 |
Dried Sunflowers |
Can be placed inside your house. |
100 |
Vegetables |
 |
Fedora |
A city-slicker's standard. |
500 |
Vegetables |
 |
Rarecrow |
Collect them all! (1 of 8) |
800 |
Vegetables |
 |
Stardrop |
A mysterious fruit that empowers those who eat it. The flavor is like a dream... a powerful personal experience, yet difficult to describe to others. |
2,000 |
Vegetables |
 |
Light Green Rug |
Can be placed inside your house. |
500 |
Vegetables |
And that’s it! In the next post, I’ll analyze my pandas dataframes.
24 Jul 2019
This is a post in a series about gradients.
Let’s start with a two layer network. The workflow is
1) A weight matrix multiplies the feature vectors (\(input_1 = XW_1 + b_1\))
2) The weighted features go through an activation function to create hidden features (\(hidden = \alpha_1(input_1)\))
3) A second weight matrix multiplies the hidden features (\(input_2 = hW_2 + b_2\))
4) The second layer goes through an activation function to create scores (\(scores = \alpha_2(input_2)\))
5) A loss is computed from the scores.
6) The gradient of the loss is calculated and used to optimize the weight matrices.
For an example, we use an activation function of \(max(0,x)\) for the hidden layer and the softmax function for the output layer. The loss is calculated as the log loss. My convention for indices is
The forward pass then looks like so:
-
\(f^{(1)}_{il} = \sum\limits_k X_{ik} W^{(1)}_{kl} + b^{(1)}_{l}\) 1st layer weighted features
-
\(h_{il} = max(0, f^{(1)}_{il})\) hidden classes from the activation function
-
\(s_{ij} = \sum\limits_l h^{(1)}_{il}W^{(2)}_{lj} + b^{(2)}_j\) 2nd layer scores
-
\(L_i = -log(softmax(s_{iy_i})\) log loss
-
\(L = \dfrac{1}{N}\sum\limits_i L_i + \lambda\left(\sum\limits_{k,l}(W^{(1)}_{kl})^2 + \sum\limits_{l,j}(W^{(2)}_{lj})^2 + \sum\limits_l (b^{(1)}_{l})^2+ \sum\limits_j (b^{(2)}_{j})^2\right )\) loss with regularization
In numpy, it looks like this:
num_train = X.shape[0]
i = np.arange(num_train)
#inputs to hidden layer
inputs1 = X @ W1 + b1[np.newaxis,:]
#hidden layer after activation function
h = np.maximum(0, inputs1)
#inputs to output layer
inputs2 = h1 @ W2 + b2[np.newaxis,:]
#output layer activation function (softmax)
scores = inputs2 - np.amax(inputs2, axis = 1, keepdims = True)
score_factors = np.exp(scores)
sum_score_factors = np.sum(score_factors, axis = 1, keepdims = True)
softmax = score_factors/sum_score_factors #softmax function for each observation and class
#Calculate loss (log loss with regularization)
data_loss = -np.sum(np.log(softmax[i,y]))
data_loss /= num_train
reg_loss = reg*(np.square(W1)).sum() + reg*(np.square(W2)).sum()
reg_loss += reg*(np.square(b1)).sum() + reg*(np.square(b2)).sum()
loss = data_loss + reg_loss
Finding the gradient of this isn’t much worse than finding the gradient of the SVM examples. However, it should be obvious that calculating the gradient will get rapidly harder as you add layers to the neural net. An important development for neural nets was therefore the concept of backpropagation, or using the values calculated in the loss to calculate the gradient using the chain rule.
Since we need to optimize for two weight matrices and two bias vectors, our gradient will branch off in several places.
The backward pass looks like this:
Second layer gradients, all the summation over i is reused for the different gradients:
-
\(\dfrac{\partial L}{\partial s_{ij}} = \dfrac{1}{N}\sum\limits_i(softmax(s_{ij})-\delta_{iy_i})\) see softmax gradient
-
\(\dfrac{\partial L}{\partial W^{(2)}_{lj}} = \dfrac{\partial L}{\partial s_{ij}} \dfrac{\partial s_{ij}}{\partial W^{(2)}_{lj}}=\sum\limits_ih^T_{li}\dfrac{\partial L}{\partial s_{ij}}\) gradient with repect to second weight matrix (l x j)
-
\(\dfrac{\partial L}{\partial b^{(2)}_j} = \dfrac{\partial L}{\partial s_{ij}} \dfrac{\partial s_{ij}}{\partial b^{(2)}_{j}}=\sum\limits_i\dfrac{\partial L}{\partial s_{ij}}\) gradient with repect to second bias vector (row vector length j)
First layer gradients:
-
\(\dfrac{\partial L}{\partial h_{il}} = \dfrac{\partial L}{\partial s_{ij}}\dfrac{\partial s_{ij}}{\partial h_{il}} = \sum\limits_j\dfrac{\partial L}{\partial s_{ij}}W^{(2)T}_{jl}\) gradient with respect to hidden classes (l x j)
-
\(\dfrac{\partial L}{\partial f_{il}} = \dfrac{\partial L}{\partial h_{il}}\dfrac{\partial h_{il}}{\partial f_{il}}=\dfrac{\partial L}{\partial h_{il}}\mathbb{1}(f_{il} > 0)\) gradient with respect to weighted features (i x l)
-
\(\dfrac{\partial L}{\partial W^{(1)}_{kl}} = \dfrac{\partial L}{\partial f_{il}}\dfrac{\partial f_{il}}{\partial W^{(1)}_{kl}}=\sum\limits_iX^T_{ki}\dfrac{\partial L}{\partial f_{il}}\) gradient with respect to first weight matrix (k x l)
-
\(\dfrac{\partial L}{\partial b^{(1)}_l} = \dfrac{\partial L}{\partial f_{il}}\dfrac{\partial f_{il}}{\partial b^{(1)}_{l}}=\sum\limits_i\dfrac{\partial L}{\partial f_{il}}\) gradient with respect to first bias vector (row vector length l)
#gradient of loss with respect to the scores
softmax[i,y] -= 1
dLds = softmax/num_train
#gradient of loss with respect to second weights and bias
grads['W2'] = h.T @ dLds
grads['b2'] = np.sum(dLds,axis=0)
#gradient of loss with respect to hidden features
dLdh = dLds @ W2.T
#gradient of loss with respect to inputs
dLdi = dLdh*(inputs > 0)
#gradient of loss with respect to first weights and bias
grads['W1'] = X.T @ dLdi
grads['b1'] = np.sum(dLdi,axis=0)
#add gradients due to regularization
grads['W1'] += 2*reg*W1
grads['W2'] += 2*reg*W2
grads['b1'] += 2*reg*b1
grads['b2'] += 2*reg*b2
And that’s it! An important point here is that we could speed up our computation because we already calculated a bunch of values we used in the loss. There is no fundamental reason I know of that it has to be this way. Not all derivatives look like their original function, \(e^x\) is the only function with this explicit property. Also, the chain rule does not tell you exactly how to break up a derivative into smaller parts. Sometimes there’s one way that is obviously easier, but this may not be true with complex derivatives.
This is all to say, that back-propagation in and of itself does not calculate gradients efficiently. You have to make good choices! Your goal should be to use gradients where you use as many of the forward pass values as possible to calculate the backward-pass.
Another goal is to multiply matrices by vectors instead of matrices by matrices, but I need to learn more to be clear on how often that goal can be achieved.
14 Jul 2019
The Gradient for Softmax Loss
This is the second in a series on gradients, check out an introduction here and also the hinge loss gradient.
The softmax function appears in a number of disciplines and goes by many names. I first saw it in physics class in the context of thermodynamics. We called the bottom half the partition function and its related to many thermodynamic variables like free energy and entropy.
The loss function used for softmax svm is stated like this for a single training example:
\[L_i = -log\left (\dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\right )\]
where again \(s_{ij} = \sum\limits_k X_{ik}W_{kj}\). This is not conventional notation, but for this article, I will be calling \(e^{s_{ij}}\) the score factor in analogy with the Boltzmann factor in thermodynamics (which is written \(e^{-\beta E}\). This loss is relatively straightforward to calculate compared to the hinge loss.
We will calculate the scores, and then the matrix of score factors. We will produce both a selection of the correct score factors and a sum over all the classes of the score factors. Finally, we divide the two vectors element-wise and take the negative logarithm.
A trick often suggested is to subtract off the maximum score from all elements in the score matrix so that the score factors don’t become too large. Since the score factors are normalized to always be less than one, this is more a practical consideration than a mathematical one. It doesn’t affect the outcome because
\[\dfrac{e^{s_{iy_i}-s_{max}}}{\sum\limits_j e^{s_{ij}-s_{max}}} = \dfrac{e^{-s_{max}}e^{s_{iy_i}}}{e^{-s_{max}}\sum\limits_j e^{s_{ij}}} = \dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\]
So let’s translate this loss function into numpy:
#Find number of training examples
num_train = X.shape[0]
#Useful for indexing over all rows
i = np.arange(num_train)
#Calculate Scores
scores = X @ W
#subtract off the maximum score, do this for each training example
#keepdims keeps the 1D matrix as a column vector
score_max = np.amax(scores, axis = 1, keepdims = True)
scores -= score_max
#calculate score factors
score_factors = np.exp(scores)
#Create sums to normalize the score factors
score_factor_sums = np.sum(score_factors,axis = 1, keepdims = True)
#Calculate the softmax function
softmax = score_factors/score_factor_sums
#Calculate log Loss
Loss = np.sum(-log(softmax[i,y]))
Loss /= num_train
Loss += reg*np.sum(W*W)
Now we would like to calculate the gradient. As often happens when the exponential function and logarithm get into the mix, we will end up with a derivative that resembles the original function, allowing us to save ourselves some computation. For convenience, I’m going to abbreviate the softmax function as m
Using the chain rule, we first take the derivative of the logarithm (in this case it is a natural logarithm) with respect to the softmax function.
-
Derivative of the Loss with respect to softmax: \(\dfrac{\partial L}{\partial m_{iy_i}} = - \dfrac{1}{N}\sum\limits_i \dfrac{1}{m_{iy_i}}\)
-
Derivative of softmax with respect to the scores: \(\dfrac{\partial m_{iy_i}}{\partial s_{ij}} = \dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\delta_{iy_i} - \dfrac{e^{s_{iy_i}}e^{s_{ij}}}{(\sum\limits_j e^{s_{ij}})^2}\delta_{ij}\)
-
In terms of the softmax function: \(\dfrac{\partial m_{iy_i}}{\partial s_{ij}} = m_{iy_i}\delta_{iy_i} - m_{iy_i}m_{ij}\delta_{ij}\)
-
Derivative of the scores with respect to the weights: \(\dfrac{\partial s_{ij}}{\partial W_{kj}} = X_{ik}\delta_{kj}\)
And we put it all together to get
\[\dfrac{\partial L}{\partial W_{kj}} = -\dfrac{1}{N}\sum\limits_i \dfrac{1}{m_{iy_i}}(m_{iy_i}\delta_{iy_i} - m_{iy_i}m_{ij}\delta_{ij})X_{ik} = \dfrac{1}{N}\sum\limits_i(m_{ij}\delta_{ij} - \delta_{iy_i})X_{ik} = \dfrac{1}{N}\sum\limits_iX_{ki}^T(m_{ij}\delta_{ij} - \delta_{iy_i})\]
Where the regularization loss and its derivative are the same as the previous example. Now let’s translate this into numpy:
#Calculate softmax, oh wait, you already did, just subtract one from each correct class
softmax[i,y] -= 1
#Multiply by feature vectors
dLdW = X.T @ softmax
#Normalize and add regulatory term
dLdW /= num_train
dLdW += 2*reg*W
The derivative of the sigmoid function used for neural nets has a similarly nice property which allows for very efficient computation of complex gradients. We’ll get to that later.
13 Jul 2019
This is the first gradient in a series, for general notes on notation read this post
Hinge Loss for SVM
The hinge loss is often used for support vector machines. To help keep indices straight, I’ll use the following conventions:
i will be used for training examples
j will be used for classes to be categorized
k will be used as the feature dimensions
Each training example accumulates loss according to how much weight is given to mis-categorized examples. The hinge in hinge loss comes from the fact that loss is accumulated only if the wrong score is greater than the right score by a predetermined margin. The full expression for the loss of one training example is
\(L_i = \sum\limits_{j\neq y_i}max(0,s_{ij} - s_{iy_i} + \Delta)\) where \(s_{ij} = \sum\limits_k X_{ik}W_{kj}\)
For coding purposes, we break down the loss into a series of smaller computations.
-
Scores: \(s_{ij} = \sum\limits_{k} X_{ik}W_{kj}\)
-
Margins: \(m_{ij} = s_{ij} - s_{iy_i} + 1\)
-
Hinge: \(h_{ij} = max(0,m_{ij})\)
-
Loss: \(L_{hinge} = \frac{1}{N}\sum\limits_i\sum\limits_{j\neq y_i} h_{ij}\)
-
Loss with Regularization: \(L = L_{hinge} + \lambda\sum\limits_k\sum\limits_jW_{kj}^2\)
This is what it looks like translated into numpy
#Find number of training examples
num_train = X.shape[0]
#Useful for indexing over all rows
i = np.arange(num_train)
#Calculate scores (i x j)
s = X @ W
#Calculate Margins (i x j), some cool numpy indexing and Broadcasting here
m = s - s[i,y][:,np.newaxis] + 1
#Calculate the hinge function (i x j)
h = max(0,m)
#Select only elements of the hinge function that are not in the correct class (i x j)
h[i,y] = 0
#Sum up the loss for all examples and divide by the number of training examples (scalar)
L = (1/num_train)*np.sum(h)
#Add the regulatory loss (scalar)
L += reg*np.sum(W*W)
The Gradient for Hinge Loss
The hinge loss can be an intimidating function at first, but it can be broken down into relatively easy to compute chunks. Similarly, the complicated gradient of the hinge loss can be broken down into easy to compute chunks via the chain rule.
It is important to note that the hinge function \(max(0,x)\) does not have a derivative at zero. If SVM were the subject of pure mathematics we would have to reckon with this fact, but it is not. The algorithm has been shown to work as expected in a wide range of applications, possibly because the margin values are almost never exactly zero.
Away from zero, the hinge function has a derivative of zero if its argument is less than zero, and one if it is greater. Here, I choose the derivative to be zero at zero. We denote this function as \(\mathbb{1}(x)\).
-
Derivative of Loss with respect to margins: \(\dfrac{\partial L_{hinge}}{\partial m_{ij}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i}\mathbb{1} (m_{ij} > 0)\)
-
Derivative of margins with respect to scores: \(\dfrac{\partial m_{ij}}{\partial s_{ij}} = \delta_{ij\neq y_i} - \delta_{ij=y_i}\)
-
Derivative of scores with repect to Weights: \(\dfrac{\partial s_{ij}}{\partial W_{jk}} = X_{ik}\)
-
Derivative of the regularization loss: \(\dfrac{\partial}{\partial W_{jk}}\lambda\sum\limits_j\sum\limits_k W_{jk}^2= 2\lambda W_{jk}\)
And finally, we put it all together to obtain the derivative of the loss with respect to the weights
\[\dfrac{\partial L}{\partial W_{jk}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i}\left (\dfrac{\partial s_{ij}}{\partial W_{jk}}\right )^T\dfrac{\partial L_{hinge}}{\partial m_{ij}}\dfrac{\partial m_{ij}}{\partial s_{ij}} + \dfrac{\partial}{\partial W_{jk}}\lambda\sum\limits_j\sum\limits_k W_{jk}^2\]
\[\dfrac{\partial L}{\partial W_{jk}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i} X_{ki}^T \mathbb{1}(m_{ij} > 0)(\delta_{ij\neq y_i} - \delta_{ij=y_i}) + 2\lambda W_{jk}\]
To implement this in code it is useful to first calculate the matrix where the positive elements of the margins matrix are set equal to one. The elements that are not the correct class remain one, while the elements of the correct class are set equal to \(-\sum\limits_{j\neq y_i}\mathbb{1}(m_{ij} > 0)\). This matrix is then multiplied by the transpose of the feature matrix.
The reason for the sum term (called class_sums in the code below) is that every margin function includes the scores \(s_{iy_i}\), and so they all must be summed.
#Derivative of Loss with respect to margin (i x j)
dLdm = np.zeros(m.shape)
dLdm[m > 0] = 1
#Derivative of margin with respect to scores (i x j)
dmds = np.ones(m.shape)
class_sums = -np.sum(dLdm,axis = 1) + 1
dmds[i,y] = class_sums
#Derivative of scores with respect to weights (i x k)
dsdW = X
#Derivative of Loss with respect to weights
dLdW = dfdW.T @ (dLdm*dmds)/num_train
dLdW += 2*reg*W
12 Jul 2019
Finding analytic gradients can be a daunting task for many. The good news is that most analytic gradients used routinely in machine learning have already been solved and implemented at blazing fast speeds. However, if you want to ‘get under the hood’ of machine learning algorithms, you need to become familiar with how to calculate these objects. Between this and subsequent posts, I will try to walk you through the math behind the gradients.
Notation
A few facts make it easier to translate mathematics into code. When calculating the gradient with respect to a matrix, you’re really calculating the derivative of a function with respect to an arbitrary element of the matrix. We refer to an arbitrary element of matrix \(X\) as \(X_{ij}\), which is the ith row and jth column.
Two common operations are element-wise multiplication and matrix multiplication. Both can be represented as operations on an arbitrary element of the matrix. For element-wise multiplication we write that
\(C_{ij} = A_{ij}B_{ij}\) (element-wise product)
This obviously only works if all matrices have the same dimensions. Matrix multiplication is a type of inner product, meaning that the matrices share a common dimension which is summed over and eliminated. We represent matrix multiplication by arbitrary elements like this
\(C_{ij} = \sum\limits_k A_{ik}B_{kj}\) (matrix product)
Note that \(C_{ij} \neq \sum\limits_k A_{ik}B_{jk}\). The columns of the first matrix must match the rows of the second. However, we can remedy a situation like this by using the transpose of a matrix, since \(X_{ij} = X^T_{ji}\).
Finally, there are times when we must denote a single element of a matrix. A matrix that is one at row i and column j and zero elsewhere is written
\(\delta_{ij}\) (unit matrix)
The Chain Rule
One of the most useful theorems in calculus is the chain rule. It allows you to calculate complicated derivatives by breaking it into pieces. Especially when computers are involved, breaking things into smaller pieces is almost always the right move. In addition, the chain rule features prominently in the back-propagation used in training neural networks.
Essentially, what the chain rule says is that we can treat partial differentials like fractions, so that
\[\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial y}\dfrac{\partial y}{\partial x}\]
So for example if we wanted to take the derivative of \(f(x) = ln(x^2)\) could take the derivative of \(f(y) = ln(y)\) and the derivative of \(y = x^2\) so that
\[\dfrac{\partial f}{\partial x} = \dfrac{\partial ln(x^2)}{\partial x^2}\dfrac{\partial x^2}{\partial x} = \dfrac{1}{x^2}2x = \dfrac{2}{x}\]
Numpy
Numpy is a huge package for dealing with arrays and vectors. For the purpose of finding gradients, I’m going to reduce numpy to four operations: indexing, summing, multiplying, and broadcasting.
Numpy indexing is fairly straightforward, you specify an element of a numpy array by specifying its row and column, for example the element \(X_{ij}\) is represented by
An entire row or column is specified by
#column
X[:,j]
#row
X[i,:]
An important note about indexing is that numpy vectors are neither row nor column vectors. This has important implications for broadcasting.
Summations are also relatively straightforward. When summing arrays, it is important to specify the axis you are summing over. If you are summing over the rows (leaving you with a single row) use
In order to sum over the columns (leaving you with a single column) use
And in order to sum over both, simply use
Multiplication comes in two flavors. Element-wise multiplication is represented by the * operator. Matrix multiplication is achieved by the @ operator.
Finally, broadcasting is the powerful numpy tool that allows you to add or multiply incompatible arrays together. Normally, a nxm matrix A cannot be added to a nx1 vector B, however, numpy assumes that the column vector is actually m copies of B, thus
results in B being added to each column of A and
results in B multiplying each column of A element-wise.
The same can be done with row vectors. Scalars will be added or multiplied to each element of an array.
Now earlier, I pointed out that numpy vectors were neither row nor column vectors, so sometimes it is necessary to force them to be row or column vectors by adding an axis of length 1. To make the numpy vector y into a row vector input
and to create a column vector input
Now that that’s out of the way, let’s take a look at SVM hinge loss