Chris Malec Data Scientist

The Stardew Valley Grange Display - Basic Web Scraping

The Stardew Valley Grange Display - Basic Web Scraping

Lately, I’ve been pretty obsessed with Stardew Valley. It’s a lovely little farming game, with a lot more to do besides grow crops. For the purpose of the next couple posts, I’d like to focus on an event that happens every fall, the Stardew Valley Fair. Specifically, you have the option of creating a Grange Display, and winning gets you 1000 star points. The star points aren’t the most useful thing in the world, but like many things in the game, that’s hardly the point. I’d just like to figure out how to get a perfect score with the least amount of work.

Step one that I’ll describe in this post is some super basic web-scraping of the stardew valley wiki. My goal is to grab the tables on the page about the fair to find the point values for the different items you can put in the grange display. I’ll use the requests package to get the web page and parse the html using Beautiful Soup. My ultimate goal is to create a pandas data frame.

#import libraries
import requests
from bs4 import BeautifulSoup
from IPython.display import display, HTML
import pandas as pd
#get html from stardew wiki
r = requests.get('https://stardewvalleywiki.com/Stardew_Valley_Fair')
html_text = r.text
soup = BeautifulSoup(html_text, 'html.parser')

Beautiful Soup turns the html page into an object that can be accessed in a much more pythonic way. For example, here I can find all the ‘wikitable’s on the page. Dealing with the actual text inside each row and column requires a slightly different procedure depending on if the text is a single string, multiple strings, or an image. Cells that have both an image and text are read just as text, since the image is usually a ‘gold’ or ‘star point’ image. It would be possible to reproduce these cells, but it would be more work, and doesn’t really help me.

#find all the tags that are part of the wikitable css class
tables = soup.find_all(class_="wikitable")
#a little helper function for when there are several strings withtin the same tag
def join_strings(tag):
    return ''.join([string for string in tag.stripped_strings])
#function to parse the table into a pandas dataframe
#optionally searches for the heading to add a 'category' column to the dataframe
def parse_table(table, get_heading = True):
    if get_heading:
        table_title = [join_strings(table.find_previous(class_='mw-collapsible').th)]
    else:
        table_title = []
    row = table.tr #the first row has several 'th' tags, this is pretty general for tables on this site
    headers = [elem.string.strip() for elem in row.find_all("th")]
    #print(headers)
    records = []
    row = row.next_sibling
    while row:
        if row.string:
            row = row.next_sibling #prevents reading in rows that are just a newline without columns
            continue
        else:
            column = row.td #find first column
        record = []
        while column:
            if column.string:
                cell_text = column.string.strip() #get string within the td tag
            else:
                cell_text =  join_strings(column) #get string when there are multiple strings in a tag''.join([string for string in column.stripped_strings])
            try:
                image = column.img
                #Don't want the star token and gold images
                if (image.attrs['alt'] != 'Token.png')&(image.attrs['alt'] != 'Gold.png'):
                    if image.attrs['src'][:5]!='https':
                        image.attrs['src'] = 'https://stardewvalleywiki.com'+image.attrs['src'] #get full location
                    image = str(image)#Needs to be a string to be resolved in DataFrame
                else:
                    image = None
            except:
                image = None
            
            if cell_text != '':
                record.append(cell_text) #adds column to row record if it isn't empty
            elif image:
                record.append(image) #adds column to row record if it's an image (and just an image)
            column = column.next_sibling #goes to the next column
        records.append(tuple(record+table_title)) #adds row to the list of records
        #print(records)
        row = row.next_sibling #goes to the next row
    if get_heading:
        columns = headers+['category'] #adds category column if we are finding the table heading
    else:
        columns = headers #just uses the column names found in th tags
    table_df = pd.DataFrame.from_records(data=records,columns = columns).dropna() #creates a pandas dataframe from a list of records
    return table_df
def display_table(table_df):
    return display(HTML(table_df.to_html(escape=False,index=False))) #helper function to make it display like I want
dfs = []
for table in tables[3:11]:
    dfs.append(parse_table(table))#Get all the tables related to the grange display
grange_df = pd.concat(dfs,axis=0,ignore_index=True)#Concatenate the tables into one flat table
grange_df = grange_df.melt(id_vars = ["Item","category","Price"],
                             value_vars = ["Base","Silver","Gold","Iridium"],
                             var_name = "Quality",
                             value_name = "Points")#Melt the table so that the points are all in one column
display_table(grange_df)
Item category Price Quality Points
Honey (Wild)* Artisan Goods 100g Base 3
Jelly (Ancient Fruit) Artisan Goods 1,150g Base 6
Jelly (Apple) Artisan Goods 250g Base 4
Jelly (Apricot) Artisan Goods 150g Base 3
Jelly (Banana) Artisan Goods 350g Base 5
Jelly (Blackberry) Artisan Goods 90g Base 3
Jelly (Blueberry) Artisan Goods 150g Base 3
Jelly (Cactus Fruit) Artisan Goods 200g Base 4
Jelly (Cherry) Artisan Goods 210g Base 4
Jelly (Coconut) Artisan Goods 250g Base 4
Jelly (Cranberries) Artisan Goods 200g Base 4
Jelly (Crystal Fruit) Artisan Goods 350g Base 5

And here’s an example of one of the tables that has images in it. For my goal of finding an easy perfect score grange display, I won’t really need it, but it’s a nice thing to know how to do.

display_table(parse_table(tables[12]))
Image Name Description Price category
Dried Sunflowers.png Dried Sunflowers Can be placed inside your house. 100 Vegetables
Fedora.png Fedora A city-slicker's standard. 500 Vegetables
Rarecrow 1.png Rarecrow Collect them all! (1 of 8) 800 Vegetables
Stardrop.png Stardrop A mysterious fruit that empowers those who eat it. The flavor is like a dream... a powerful personal experience, yet difficult to describe to others. 2,000 Vegetables
Light Green Rug.png Light Green Rug Can be placed inside your house. 500 Vegetables

And that’s it! In the next post, I’ll analyze my pandas dataframes.

Gradients and You - Backpropagation and Neural Nets

This is a post in a series about gradients.

Let’s start with a two layer network. The workflow is

1) A weight matrix multiplies the feature vectors (\(input_1 = XW_1 + b_1\))

2) The weighted features go through an activation function to create hidden features (\(hidden = \alpha_1(input_1)\))

3) A second weight matrix multiplies the hidden features (\(input_2 = hW_2 + b_2\))

4) The second layer goes through an activation function to create scores (\(scores = \alpha_2(input_2)\))

5) A loss is computed from the scores.

6) The gradient of the loss is calculated and used to optimize the weight matrices.

For an example, we use an activation function of \(max(0,x)\) for the hidden layer and the softmax function for the output layer. The loss is calculated as the log loss. My convention for indices is

  • \(i\): examples

  • \(j\): classes

  • \(k\): feature dimensions

  • \(l\): hidden classes

The forward pass then looks like so:

  • \(f^{(1)}_{il} = \sum\limits_k X_{ik} W^{(1)}_{kl} + b^{(1)}_{l}\) 1st layer weighted features

  • \(h_{il} = max(0, f^{(1)}_{il})\) hidden classes from the activation function

  • \(s_{ij} = \sum\limits_l h^{(1)}_{il}W^{(2)}_{lj} + b^{(2)}_j\) 2nd layer scores

  • \(L_i = -log(softmax(s_{iy_i})\) log loss

  • \(L = \dfrac{1}{N}\sum\limits_i L_i + \lambda\left(\sum\limits_{k,l}(W^{(1)}_{kl})^2 + \sum\limits_{l,j}(W^{(2)}_{lj})^2 + \sum\limits_l (b^{(1)}_{l})^2+ \sum\limits_j (b^{(2)}_{j})^2\right )\) loss with regularization

In numpy, it looks like this:

num_train = X.shape[0]
i = np.arange(num_train)

#inputs to hidden layer
inputs1 = X @ W1 + b1[np.newaxis,:]
#hidden layer after activation function
h = np.maximum(0, inputs1)
#inputs to output layer
inputs2 = h1 @ W2 + b2[np.newaxis,:]
#output layer activation function (softmax)        
scores = inputs2 - np.amax(inputs2, axis = 1, keepdims = True)
score_factors = np.exp(scores)
sum_score_factors = np.sum(score_factors, axis = 1, keepdims = True)
softmax = score_factors/sum_score_factors #softmax function for each observation and class
        
#Calculate loss (log loss with regularization)
data_loss = -np.sum(np.log(softmax[i,y]))
data_loss /= num_train
reg_loss = reg*(np.square(W1)).sum() + reg*(np.square(W2)).sum()
reg_loss += reg*(np.square(b1)).sum() + reg*(np.square(b2)).sum()
loss = data_loss + reg_loss

Finding the gradient of this isn’t much worse than finding the gradient of the SVM examples. However, it should be obvious that calculating the gradient will get rapidly harder as you add layers to the neural net. An important development for neural nets was therefore the concept of backpropagation, or using the values calculated in the loss to calculate the gradient using the chain rule.

Since we need to optimize for two weight matrices and two bias vectors, our gradient will branch off in several places.

The backward pass looks like this:

Second layer gradients, all the summation over i is reused for the different gradients:

  • \(\dfrac{\partial L}{\partial s_{ij}} = \dfrac{1}{N}\sum\limits_i(softmax(s_{ij})-\delta_{iy_i})\) see softmax gradient

  • \(\dfrac{\partial L}{\partial W^{(2)}_{lj}} = \dfrac{\partial L}{\partial s_{ij}} \dfrac{\partial s_{ij}}{\partial W^{(2)}_{lj}}=\sum\limits_ih^T_{li}\dfrac{\partial L}{\partial s_{ij}}\) gradient with repect to second weight matrix (l x j)

  • \(\dfrac{\partial L}{\partial b^{(2)}_j} = \dfrac{\partial L}{\partial s_{ij}} \dfrac{\partial s_{ij}}{\partial b^{(2)}_{j}}=\sum\limits_i\dfrac{\partial L}{\partial s_{ij}}\) gradient with repect to second bias vector (row vector length j)

First layer gradients:

  • \(\dfrac{\partial L}{\partial h_{il}} = \dfrac{\partial L}{\partial s_{ij}}\dfrac{\partial s_{ij}}{\partial h_{il}} = \sum\limits_j\dfrac{\partial L}{\partial s_{ij}}W^{(2)T}_{jl}\) gradient with respect to hidden classes (l x j)

  • \(\dfrac{\partial L}{\partial f_{il}} = \dfrac{\partial L}{\partial h_{il}}\dfrac{\partial h_{il}}{\partial f_{il}}=\dfrac{\partial L}{\partial h_{il}}\mathbb{1}(f_{il} > 0)\) gradient with respect to weighted features (i x l)

  • \(\dfrac{\partial L}{\partial W^{(1)}_{kl}} = \dfrac{\partial L}{\partial f_{il}}\dfrac{\partial f_{il}}{\partial W^{(1)}_{kl}}=\sum\limits_iX^T_{ki}\dfrac{\partial L}{\partial f_{il}}\) gradient with respect to first weight matrix (k x l)

  • \(\dfrac{\partial L}{\partial b^{(1)}_l} = \dfrac{\partial L}{\partial f_{il}}\dfrac{\partial f_{il}}{\partial b^{(1)}_{l}}=\sum\limits_i\dfrac{\partial L}{\partial f_{il}}\) gradient with respect to first bias vector (row vector length l)

#gradient of loss with respect to the scores
softmax[i,y] -= 1
dLds = softmax/num_train
#gradient of loss with respect to second weights and bias
grads['W2'] = h.T @ dLds
grads['b2'] = np.sum(dLds,axis=0)
#gradient of loss with respect to hidden features        
dLdh = dLds @ W2.T
#gradient of loss with respect to inputs
dLdi = dLdh*(inputs > 0)
#gradient of loss with respect to first weights and bias
grads['W1'] = X.T @ dLdi
grads['b1'] = np.sum(dLdi,axis=0)
#add gradients due to regularization
grads['W1'] += 2*reg*W1
grads['W2'] += 2*reg*W2
grads['b1'] += 2*reg*b1
grads['b2'] += 2*reg*b2

And that’s it! An important point here is that we could speed up our computation because we already calculated a bunch of values we used in the loss. There is no fundamental reason I know of that it has to be this way. Not all derivatives look like their original function, \(e^x\) is the only function with this explicit property. Also, the chain rule does not tell you exactly how to break up a derivative into smaller parts. Sometimes there’s one way that is obviously easier, but this may not be true with complex derivatives.

This is all to say, that back-propagation in and of itself does not calculate gradients efficiently. You have to make good choices! Your goal should be to use gradients where you use as many of the forward pass values as possible to calculate the backward-pass.

Another goal is to multiply matrices by vectors instead of matrices by matrices, but I need to learn more to be clear on how often that goal can be achieved.

Gradients and You - Softmax

The Gradient for Softmax Loss

This is the second in a series on gradients, check out an introduction here and also the hinge loss gradient. The softmax function appears in a number of disciplines and goes by many names. I first saw it in physics class in the context of thermodynamics. We called the bottom half the partition function and its related to many thermodynamic variables like free energy and entropy.

The loss function used for softmax svm is stated like this for a single training example:

\[L_i = -log\left (\dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\right )\]

where again \(s_{ij} = \sum\limits_k X_{ik}W_{kj}\). This is not conventional notation, but for this article, I will be calling \(e^{s_{ij}}\) the score factor in analogy with the Boltzmann factor in thermodynamics (which is written \(e^{-\beta E}\). This loss is relatively straightforward to calculate compared to the hinge loss.

We will calculate the scores, and then the matrix of score factors. We will produce both a selection of the correct score factors and a sum over all the classes of the score factors. Finally, we divide the two vectors element-wise and take the negative logarithm.

A trick often suggested is to subtract off the maximum score from all elements in the score matrix so that the score factors don’t become too large. Since the score factors are normalized to always be less than one, this is more a practical consideration than a mathematical one. It doesn’t affect the outcome because

\[\dfrac{e^{s_{iy_i}-s_{max}}}{\sum\limits_j e^{s_{ij}-s_{max}}} = \dfrac{e^{-s_{max}}e^{s_{iy_i}}}{e^{-s_{max}}\sum\limits_j e^{s_{ij}}} = \dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\]

So let’s translate this loss function into numpy:

#Find number of training examples
num_train = X.shape[0]

#Useful for indexing over all rows
i = np.arange(num_train)

#Calculate Scores

scores = X @ W

#subtract off the maximum score, do this for each training example
#keepdims keeps the 1D matrix as a column vector

score_max = np.amax(scores, axis = 1, keepdims = True) 
scores -= score_max

#calculate score factors

score_factors = np.exp(scores)

#Create sums to normalize the score factors

score_factor_sums = np.sum(score_factors,axis = 1, keepdims = True)

#Calculate the softmax function
softmax = score_factors/score_factor_sums

#Calculate log Loss

Loss = np.sum(-log(softmax[i,y]))
Loss /= num_train
Loss += reg*np.sum(W*W)

Now we would like to calculate the gradient. As often happens when the exponential function and logarithm get into the mix, we will end up with a derivative that resembles the original function, allowing us to save ourselves some computation. For convenience, I’m going to abbreviate the softmax function as m

Using the chain rule, we first take the derivative of the logarithm (in this case it is a natural logarithm) with respect to the softmax function.

  • Derivative of the Loss with respect to softmax: \(\dfrac{\partial L}{\partial m_{iy_i}} = - \dfrac{1}{N}\sum\limits_i \dfrac{1}{m_{iy_i}}\)

  • Derivative of softmax with respect to the scores: \(\dfrac{\partial m_{iy_i}}{\partial s_{ij}} = \dfrac{e^{s_{iy_i}}}{\sum\limits_j e^{s_{ij}}}\delta_{iy_i} - \dfrac{e^{s_{iy_i}}e^{s_{ij}}}{(\sum\limits_j e^{s_{ij}})^2}\delta_{ij}\)

  • In terms of the softmax function: \(\dfrac{\partial m_{iy_i}}{\partial s_{ij}} = m_{iy_i}\delta_{iy_i} - m_{iy_i}m_{ij}\delta_{ij}\)

  • Derivative of the scores with respect to the weights: \(\dfrac{\partial s_{ij}}{\partial W_{kj}} = X_{ik}\delta_{kj}\)

And we put it all together to get

\[\dfrac{\partial L}{\partial W_{kj}} = -\dfrac{1}{N}\sum\limits_i \dfrac{1}{m_{iy_i}}(m_{iy_i}\delta_{iy_i} - m_{iy_i}m_{ij}\delta_{ij})X_{ik} = \dfrac{1}{N}\sum\limits_i(m_{ij}\delta_{ij} - \delta_{iy_i})X_{ik} = \dfrac{1}{N}\sum\limits_iX_{ki}^T(m_{ij}\delta_{ij} - \delta_{iy_i})\]

Where the regularization loss and its derivative are the same as the previous example. Now let’s translate this into numpy:

#Calculate softmax, oh wait, you already did, just subtract one from each correct class
softmax[i,y] -= 1

#Multiply by feature vectors
dLdW = X.T @ softmax

#Normalize and add regulatory term
dLdW /= num_train
dLdW += 2*reg*W

The derivative of the sigmoid function used for neural nets has a similarly nice property which allows for very efficient computation of complex gradients. We’ll get to that later.

Gradients and You- SVM hinge loss

This is the first gradient in a series, for general notes on notation read this post

Hinge Loss for SVM

The hinge loss is often used for support vector machines. To help keep indices straight, I’ll use the following conventions:

i will be used for training examples

j will be used for classes to be categorized

k will be used as the feature dimensions

Each training example accumulates loss according to how much weight is given to mis-categorized examples. The hinge in hinge loss comes from the fact that loss is accumulated only if the wrong score is greater than the right score by a predetermined margin. The full expression for the loss of one training example is

\(L_i = \sum\limits_{j\neq y_i}max(0,s_{ij} - s_{iy_i} + \Delta)\) where \(s_{ij} = \sum\limits_k X_{ik}W_{kj}\)

For coding purposes, we break down the loss into a series of smaller computations.

  • Scores: \(s_{ij} = \sum\limits_{k} X_{ik}W_{kj}\)

  • Margins: \(m_{ij} = s_{ij} - s_{iy_i} + 1\)

  • Hinge: \(h_{ij} = max(0,m_{ij})\)

  • Loss: \(L_{hinge} = \frac{1}{N}\sum\limits_i\sum\limits_{j\neq y_i} h_{ij}\)

  • Loss with Regularization: \(L = L_{hinge} + \lambda\sum\limits_k\sum\limits_jW_{kj}^2\)

This is what it looks like translated into numpy

#Find number of training examples
num_train = X.shape[0]

#Useful for indexing over all rows
i = np.arange(num_train)

#Calculate scores (i x j)
s = X @ W

#Calculate Margins (i x j), some cool numpy indexing and Broadcasting here
m = s - s[i,y][:,np.newaxis] + 1

#Calculate the hinge function (i x j)
h = max(0,m)

#Select only elements of the hinge function that are not in the correct class (i x j)
h[i,y] = 0

#Sum up the loss for all examples and divide by the number of training examples (scalar)
L = (1/num_train)*np.sum(h)

#Add the regulatory loss (scalar)
L += reg*np.sum(W*W)

The Gradient for Hinge Loss

The hinge loss can be an intimidating function at first, but it can be broken down into relatively easy to compute chunks. Similarly, the complicated gradient of the hinge loss can be broken down into easy to compute chunks via the chain rule.

It is important to note that the hinge function \(max(0,x)\) does not have a derivative at zero. If SVM were the subject of pure mathematics we would have to reckon with this fact, but it is not. The algorithm has been shown to work as expected in a wide range of applications, possibly because the margin values are almost never exactly zero.

Away from zero, the hinge function has a derivative of zero if its argument is less than zero, and one if it is greater. Here, I choose the derivative to be zero at zero. We denote this function as \(\mathbb{1}(x)\).

  • Derivative of Loss with respect to margins: \(\dfrac{\partial L_{hinge}}{\partial m_{ij}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i}\mathbb{1} (m_{ij} > 0)\)

  • Derivative of margins with respect to scores: \(\dfrac{\partial m_{ij}}{\partial s_{ij}} = \delta_{ij\neq y_i} - \delta_{ij=y_i}\)

  • Derivative of scores with repect to Weights: \(\dfrac{\partial s_{ij}}{\partial W_{jk}} = X_{ik}\)

  • Derivative of the regularization loss: \(\dfrac{\partial}{\partial W_{jk}}\lambda\sum\limits_j\sum\limits_k W_{jk}^2= 2\lambda W_{jk}\)

And finally, we put it all together to obtain the derivative of the loss with respect to the weights

\[\dfrac{\partial L}{\partial W_{jk}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i}\left (\dfrac{\partial s_{ij}}{\partial W_{jk}}\right )^T\dfrac{\partial L_{hinge}}{\partial m_{ij}}\dfrac{\partial m_{ij}}{\partial s_{ij}} + \dfrac{\partial}{\partial W_{jk}}\lambda\sum\limits_j\sum\limits_k W_{jk}^2\] \[\dfrac{\partial L}{\partial W_{jk}} = \dfrac{1}{N}\sum\limits_i\sum\limits_{j'\neq y_i} X_{ki}^T \mathbb{1}(m_{ij} > 0)(\delta_{ij\neq y_i} - \delta_{ij=y_i}) + 2\lambda W_{jk}\]

To implement this in code it is useful to first calculate the matrix where the positive elements of the margins matrix are set equal to one. The elements that are not the correct class remain one, while the elements of the correct class are set equal to \(-\sum\limits_{j\neq y_i}\mathbb{1}(m_{ij} > 0)\). This matrix is then multiplied by the transpose of the feature matrix.

The reason for the sum term (called class_sums in the code below) is that every margin function includes the scores \(s_{iy_i}\), and so they all must be summed.

#Derivative of Loss with respect to margin (i x j)
dLdm = np.zeros(m.shape)
dLdm[m > 0] = 1

#Derivative of margin with respect to scores (i x j)
dmds = np.ones(m.shape)
class_sums = -np.sum(dLdm,axis = 1) + 1
dmds[i,y] = class_sums

#Derivative of scores with respect to weights (i x k)
dsdW = X

#Derivative of Loss with respect to weights
dLdW = dfdW.T @ (dLdm*dmds)/num_train
dLdW += 2*reg*W

Gradients and You- Introduction and Notation

Finding analytic gradients can be a daunting task for many. The good news is that most analytic gradients used routinely in machine learning have already been solved and implemented at blazing fast speeds. However, if you want to ‘get under the hood’ of machine learning algorithms, you need to become familiar with how to calculate these objects. Between this and subsequent posts, I will try to walk you through the math behind the gradients.

Notation

A few facts make it easier to translate mathematics into code. When calculating the gradient with respect to a matrix, you’re really calculating the derivative of a function with respect to an arbitrary element of the matrix. We refer to an arbitrary element of matrix \(X\) as \(X_{ij}\), which is the ith row and jth column.

Two common operations are element-wise multiplication and matrix multiplication. Both can be represented as operations on an arbitrary element of the matrix. For element-wise multiplication we write that

\(C_{ij} = A_{ij}B_{ij}\) (element-wise product)

This obviously only works if all matrices have the same dimensions. Matrix multiplication is a type of inner product, meaning that the matrices share a common dimension which is summed over and eliminated. We represent matrix multiplication by arbitrary elements like this

\(C_{ij} = \sum\limits_k A_{ik}B_{kj}\) (matrix product)

Note that \(C_{ij} \neq \sum\limits_k A_{ik}B_{jk}\). The columns of the first matrix must match the rows of the second. However, we can remedy a situation like this by using the transpose of a matrix, since \(X_{ij} = X^T_{ji}\).

Finally, there are times when we must denote a single element of a matrix. A matrix that is one at row i and column j and zero elsewhere is written

\(\delta_{ij}\) (unit matrix)

The Chain Rule

One of the most useful theorems in calculus is the chain rule. It allows you to calculate complicated derivatives by breaking it into pieces. Especially when computers are involved, breaking things into smaller pieces is almost always the right move. In addition, the chain rule features prominently in the back-propagation used in training neural networks.

Essentially, what the chain rule says is that we can treat partial differentials like fractions, so that

\[\dfrac{\partial f}{\partial x} = \dfrac{\partial f}{\partial y}\dfrac{\partial y}{\partial x}\]

So for example if we wanted to take the derivative of \(f(x) = ln(x^2)\) could take the derivative of \(f(y) = ln(y)\) and the derivative of \(y = x^2\) so that

\[\dfrac{\partial f}{\partial x} = \dfrac{\partial ln(x^2)}{\partial x^2}\dfrac{\partial x^2}{\partial x} = \dfrac{1}{x^2}2x = \dfrac{2}{x}\]

Numpy

Numpy is a huge package for dealing with arrays and vectors. For the purpose of finding gradients, I’m going to reduce numpy to four operations: indexing, summing, multiplying, and broadcasting.

Numpy indexing is fairly straightforward, you specify an element of a numpy array by specifying its row and column, for example the element \(X_{ij}\) is represented by

X[i,j]

An entire row or column is specified by

#column
X[:,j]
#row
X[i,:]

An important note about indexing is that numpy vectors are neither row nor column vectors. This has important implications for broadcasting.

Summations are also relatively straightforward. When summing arrays, it is important to specify the axis you are summing over. If you are summing over the rows (leaving you with a single row) use

np.sum(X,axis = 0)

In order to sum over the columns (leaving you with a single column) use

np.sum(X,axis = 1)

And in order to sum over both, simply use

np.sum(X)

Multiplication comes in two flavors. Element-wise multiplication is represented by the * operator. Matrix multiplication is achieved by the @ operator.

Finally, broadcasting is the powerful numpy tool that allows you to add or multiply incompatible arrays together. Normally, a nxm matrix A cannot be added to a nx1 vector B, however, numpy assumes that the column vector is actually m copies of B, thus

A + B

results in B being added to each column of A and

A * B

results in B multiplying each column of A element-wise.

The same can be done with row vectors. Scalars will be added or multiplied to each element of an array.

Now earlier, I pointed out that numpy vectors were neither row nor column vectors, so sometimes it is necessary to force them to be row or column vectors by adding an axis of length 1. To make the numpy vector y into a row vector input

y[np.newaxis,:]

and to create a column vector input

y[:, np.newaxis]

Now that that’s out of the way, let’s take a look at SVM hinge loss