Numpy Neural Nets Training

19 Nov 2025

Another Numpy Neural Net: Training Loop

Finally, we need to actually train this model. You usually have two choices, which is to either make a training function, or supply a .train() method to your model. Either is fine, I will say that for the purposes of this exercise, it’s a little easier to deal with a training function. You don’t have to keep track of which things your model has inside of it and which it doesn’t.

First we have some configuration global variables and logging setup

# Model configuration
INPUT_SIZE = 28 * 28
HIDDEN_SIZE1 = 64
HIDDEN_SIZE2 = 32
OUTPUT_SIZE = 10

# Create logger
logger = logging.getLogger(__name__)

Next we define the training function, it takes in a Model and a Dataset, as well as values for epochs, batch_size, and learning_rate. Each epoch effectively reshuffles the dataset and runs through all the examples again. Within each epoch, for each batch there is a:

Forward pass
Loss computation
Backward pass (to compute gradients and update weights/biases)
Metric computation

Each epoch I log the current log_loss and accuracy averaged over the batches.

def training_loop(
    model: Model, dataset: Dataset, epochs: int, batch_size: int, learning_rate: float
):
    for epoch in range(epochs):
        i = 0
        batch_accuracies = []
        batch_losses = []
        for x_train, y_train in dataset.get_batches(batch_size):
            # Forward pass
            y_pred = model.forward(x_train)

            # Compute loss
            loss = model.compute_loss(y_train, y_pred)
            batch_losses.append(loss)

            # Backward pass (weights update as part of the optimizer step in Model.backward)
            model.backward(y_train, y_pred, learning_rate)

            # Compute accuracy
            batch_acc = accuracy(y_train, y_pred)
            batch_accuracies.append(batch_acc)
            i += 1
        avg_acc = np.mean(batch_accuracies)
        avg_loss = np.mean(batch_losses)
        logger.info(
            f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}, Avg Accuracy: {avg_acc:.4f}"
        )

Which seems a lot simpler than everything else, but that is in part because the model has a few different important methods like .forward(), .backward(), and .calculate_loss() that make everything at this stage very compact. However, you may be asking ‘when did we actually construct the model.’ Well the training_loop.py file is not done yet, there’s also a main function that executes when we call the script.

This allows arguments to be parsed from the command line so you can type something like

python training_loop.py --epochs 200 --learning_rate 5e-6 --log-level DEBUG

which we will take advantage of in the next post. After the arguments are parsed, we load the data in, construct our model, and call the model. After that, we save the model and log the training/test losses and accuracies.

if __name__ == "__main__":
    argparser = argparse.ArgumentParser(
        description="Train a simple neural network on MNIST"
    )
    argparser.add_argument(
        "--epochs", type=int, default=20, help="Number of epochs to train"
    )
    argparser.add_argument(
        "--batch_size", type=int, default=32, help="Batch size for training"
    )
    argparser.add_argument(
        "--learning_rate", type=float, default=1e-3, help="Learning rate for optimizer"
    )
    argparser.add_argument("--log_level", type=str, default="INFO")
    # Parse command line arguments
    args = argparser.parse_args()
    epochs = args.epochs
    batch_size = args.batch_size
    learning_rate = args.learning_rate

    # Configure logging
    logging.basicConfig(
        level=getattr(logging, args.log_level.upper(), None),
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    logger.info(
        f"Starting training for {epochs} epochs with batch size {batch_size} and learning rate {learning_rate}"
    )

    # Define Model
    basic_model = Model(
        layers=[
            DenseLayer(
                input_size=INPUT_SIZE,
                output_size=HIDDEN_SIZE1,
                activation_function=ReLU(),
            ),
            DenseLayer(
                input_size=HIDDEN_SIZE1,
                output_size=HIDDEN_SIZE2,
                activation_function=ReLU(),
            ),
            DenseLayer(
                input_size=HIDDEN_SIZE2,
                output_size=OUTPUT_SIZE,
                activation_function=SoftMax(),
            ),
        ],
        loss=CrossEntropyLoss(),
        optimizer=Adam(learning_rate=learning_rate),
    )

    # Load Dataset
    train_dataset = MNISTDataset(split="train")
    test_dataset = MNISTDataset(split="test")
    X_train, y_train = train_dataset.X, train_dataset.y
    X_test, y_test = test_dataset.X, test_dataset.y
    print("Training data shape:", X_train.shape, y_train.shape)
    print("Testing data shape:", X_test.shape, y_test.shape)

    # Train Model
    training_loop(
        model=basic_model,
        dataset=train_dataset,
        epochs=epochs,
        batch_size=batch_size,
        learning_rate=learning_rate,
    )

    # Save Model
    basic_model.save("models/bigger_model.npz")

    # Evaluate on training set
    y_train_pred = basic_model.predict(X_train)
    train_loss = basic_model.compute_loss(y_train, y_train_pred)
    logging.info(f"Train Loss: {train_loss:.4f}")
    logging.info(f"Train Accuracy: {accuracy(y_train, y_train_pred):.4f}")

    # Evaluate on test set
    y_test_pred = basic_model.predict(X_test)
    test_loss = basic_model.compute_loss(y_test, y_test_pred)
    test_accuracy = accuracy(y_test, y_test_pred)
    logging.info(f"Test Loss: {test_loss:.4f}")
    logging.info(f"Test Accuracy: {test_accuracy:.4f}")

And there you have it, I can run this and get a relatively accurate mnist model.

Numpy Neural Nets Models

13 Nov 2025

Another Numpy Neural Net: Model

Finally, we get to make a model. The model is essentially a bunch of layers which are in turn arrays of weights and biases. It is also the method by which the weights and biases are optimized. It is thirdly the data that gives the optimization method fuel to optimize. So to define our model, we will need

A list of layers
An optimizer
A loss function

It’s not common practice to carry the training data around with the model, but it is terribly important to keep track of what data was used to train the model. In our mnist set this is done by using the ‘train’ file only for training.

class Model:
    layers: List[DenseLayer]
    loss: DifferentiableFunction
    optimizer: Optimizer

    def __init__(
        self,
        layers: List[DenseLayer],
        loss: DifferentiableFunction,
        optimizer: Optimizer,
    ):
        self.layers = layers
        self.loss = loss
        self.optimizer = optimizer
        for layer in self.layers:
            if layer.name is None:
                layer.name = f"Layer_{self.layers.index(layer)}"

    def forward(self, x: np.ndarray) -> np.ndarray:
        for layer in self.layers:
            x = layer.forward(x)
        return x

    def backward(self, y_true: np.ndarray, y_pred: np.ndarray):
        loss_grad = self.loss.derivative(y_true, y_pred)
        grad_dict = {"inputs": loss_grad}
        for layer in reversed(self.layers):
            grad_dict = layer.backward(grad_dict["inputs"])
            self.optimizer.step(layer, grad_dict)

    def predict(self, x: np.ndarray) -> np.ndarray:
        return self.forward(x)

    def compute_loss(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        return np.mean(self.loss.function(y_true, y_pred))

Numpy Neural Nets Datasets

29 Oct 2025

Another Numpy Neural Net: Datasets

The Data is the most important part of training an machine learning model, and also the part that’s most difficult to automate. It seems no matter how many ‘data pipelines’ and ‘etl solutions’ are out there, at some point you end up having to have an if/elif/else statement that’s about 50 clauses long to handle all the different formats and entire-pipeline-wrecking abnormalities that the data might contain.

I think this is fundamentally because you don’t have total control over data the way you do with the rest of the process. Data, if it is useful, comes from the ‘real world’, the part of the world which is not contained as a bunch of charges and magnetic domains inside a computer. Therefore, as the world is infinite in its complexity, so too are the number of ways that the data can crush your dreams. Anyone who has scanned an excel doc with a column containing missing values has seen something along the lines of

Important column which is never null	Optional column
0	N/A
1	NULL
2	null
3	blank
NULL
4

Important note: the last two entries in the right-hand column are blank and a space (‘ ‘) and require different handling.

I could go on, these are pretty mundane problems. Most data scientists will probably get some grizzled sea captain look on their face if you ask them to recount the worst data problem they’ve seen, they’re lucky to be alive. All this is to say, here, like most pedagogical projects, I will be using an extremely-battle-tested dataset: MNIST. There’s a few other good options, but who doesn’t like reading hand written digits?

Today’s Base Class

I tried a few fancier, more generalizable things for this class, but honestly, I have found you almost always have to have an extremely specific set of functions at some point to read your data. So here I’m kind of trying to outline the important bits and things you’ll have to think about for your particular application, but things like how you load, get it into some kind of dataframe/array, preprocess, are all going to be very specific to your particular dataset and your particular application.

class Dataset:
    def __init__(self, X: np.ndarray, y: np.ndarray,split: Literal['train','test','validation']='train'):
        self.X = X
        self.y = y
        self.split = split

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx: int):
        return self.X[idx], self.y[idx]

    def get_batches(self, batch_size: int, shuffle: bool = True):
        indices = np.arange(len(self.X))
        if shuffle:
            np.random.shuffle(indices)
        for start_idx in range(0, len(self.X), batch_size):
            batch_indices = indices[start_idx:start_idx + batch_size]
            logger.debug(f"Yielding batch from index {start_idx} to {start_idx + batch_size}")
            yield self.X[batch_indices], self.y[batch_indices]

    @abstractmethod
    def preprocess_data(self):
        pass
    @abstractmethod
    def preprocess_labels(self):
        pass

    def preprocess(self):
        self.X = self.preprocess_data()
        self.y = self.preprocess_labels()

    def split(self, train_ratio: float):
        split_idx = int(len(self.X) * train_ratio)
        X_train, y_train = self.X[:split_idx], self.y[:split_idx]
        X_val, y_val = self.X[split_idx:], self.y[split_idx:]
        return self.__class__(X_train, y_train, split='train'), self.__class__(X_val, y_val, split='validation')

So here we make a Dataset class that takes in an array of data and an array of labels as well as a indication as to whether it is to be considered a train or test set. Now already here’s something that is affected by how you get your data. The MNIST set I downloaded from here comes with a train file and a test file. The way I’ve set up this class makes a little less sense if you just have a big chunk of data that you need to split yourself. I have included a split method for doing a train/val split. It returns two copies of the Dataset that called it but with different slices of the data. You’d need to make it a little different if you wanted k-fold cross validation.

Importantly, the preprocess_data and preprocess_labels method is applicable to most applications. I haven’t encountered a whole lot of data you can just shove into a model raw, in this case it comes as a PIL (python image library) object and you have to turn it into an array of pixels. The labels are the numbers themselves, and no, you can’t use those raw either, you have to one-hot-encode them. So that:

= [1,0,0,0,0,0,0,0,0,0]
= [0,1,0,0,0,0,0,0,0,0]
...
= [0,0,0,0,0,0,0,0,0,1]

Then use np.argmax to turn the output of softmax back into a number when you make predictions. But first we need to read a parquet file:

class ParquetDataset(Dataset):
    def __init__(self, file_path: str, feature_cols: list, label_col: str):
        df = pd.read_parquet(file_path)
        X = df[feature_cols].values
        y = df[label_col].values
        super().__init__(X, y)

Which is a simple enough embellishment, I could use this for any data that comes to me in a parquet file with a label column. It’s not everything, but it’s a lot of things. Finally, MNIST:

class MNISTDataset(ParquetDataset):
    def __init__(self, 
                 split: Literal['train','test','validation'] = 'train',
                 feature_cols: List[str] = [ 'image' ],
                 label_col: str = 'label'):
        file_path: str = f'/workspaces/NumpyNN/data/mnist/{split}-00000-of-00001.parquet'
        super().__init__(file_path, 
                         feature_cols,
                         label_col)
        self.split = split
        self.preprocess()

    def preprocess_data(self):
        # Convert byte arrays to numpy arrays
        imgs = self.X[:,0]
        img_arrays = [Image.open(io.BytesIO(img['bytes'])) for img in imgs]
        # Flatten
        flattened_imgs = [np.array(img).reshape(-1) for img in img_arrays]
        # Normalize
        normalized_imgs = np.array(flattened_imgs) / 255.0
        # Return an array of shape (num_samples, 784)
        return np.array(normalized_imgs)
        
    def preprocess_labels(self):
        return np.eye(10)[self.y]  # One-hot encode labels

I’ve got the file in my repo because it’s small. You really shouldn’t keep them in your repo in general, but DVC is the subject of a totally different post. The data and the label both have preprocessing functions and the init method calls them both with preprocess. These arrays must be normalized and flattened, which is not true for all datasets and all applications, so it’ll need to be changed for a different model architecture.

A Note on Tests

So the last few posts, copilot and autocomplete have for the most part come through. I more or less write the name of a class and I get large chunks of it taken care of for me, went even faster once I found that I sometimes get two options (like make a placeholder method or actually fill it in). If I feel it’s missing a method or a method is missing a parameter, I just type it in, and the rest updates seemlessly. However, I got to this dataset, and it was a bit of a struggle. The stuff that copilot output that worked was kind of a mess because it was trying to account for every possible situation, even after I tried to chat to it a bit about what the situation was. Trying to add the datafile to the context isn’t super helpful because in general it’s too big of a file, so you need to read the data and make a smaller version to help the llm out, and you might as well just do it yourself at that point.

Which comes around to my point, I love that this thing writes tests for me, but the fact is that they are only good tests if I

read them and
run them

So here’s me reading them. This is a good test of copilot, because I always find trying to write data tests to be particularly annoying for all the reasons mentioned earlier:

@pytest.fixture
def sample_parquet(tmp_path):
    # Create a sample DataFrame
    df = pd.DataFrame({
        'image': [np.random.rand(28*28) for _ in range(5)],
        'label': [0, 1, 2, 3, 4]
    })
    file_path = tmp_path / "sample.parquet"
    df.to_parquet(file_path)
    return str(file_path), df

def test_mnist_train_dataset_init(sample_parquet):
    file_path, df = sample_parquet
    dataset = ParquetDataset(file_path=file_path, feature_cols=['image'], label_col='label')
    assert isinstance(dataset.X, np.ndarray)
    assert isinstance(dataset.y, np.ndarray)
    assert dataset.X.shape[0] == len(df)
    assert dataset.y.shape[0] == len(df)
    # Check that the labels match
    np.testing.assert_array_equal(dataset.y, df['label'].values)

First off, good for copilot, using a fixture. It must have read my earlier post. Unlike when it was trying to write the class, it’s being super specific about the one we’re actually using, down to assuming a 28x28 image. This seems appropriate, this dataset doesn’t have surprises in image sizing. It uses the tmp_path builtin pytest fixture so we can test that the __init__ method is actually loading parquet files, but won’t slowly fill up our repo with junk parquet files. The assert statements are reasonable in that it’s testing that the X and y attributes are the right type and have the right shape. However, since this is operating on the ParquetDataset, we aren’t really getting a test of the MNISTDataset class.

I had to ask copilot special to make me something, and a couple different ways, but it did it:

def test_mnist_dataset_pil_images(sample_parquet):
    file_path, df = sample_parquet
    dataset = MNISTDataset(file_path=str(file_path), split='train')
    assert isinstance(dataset.X, np.ndarray)
    assert dataset.X.shape == (5, 28 * 28)
    assert np.issubdtype(dataset.X.dtype, np.floating)
    assert dataset.X.max() <= 1.0 and dataset.X.min() >= 0.0

    assert dataset.y.shape == (5, 10)
    np.testing.assert_array_equal(np.argmax(dataset.y, axis=1), df['label'].values)

There you have it, now we can train a model. Again, I find it somewhat annoying to mix up different types of assert statements, but it works!

Numpy Neural Nets Optimizers

24 Oct 2025

Another Numpy Neural Net: Optimizers

We have layers, but currently they consist of a weight and bias matrix with randomly chosen values. If we send an input through this, it will produce an essentially random output. So we need a way to improve the values of the layer weights, this process is called training and training affects the values of the weights and biases in our model through data and optimization. Usually when one begins their machine learning journey, they begin with gradient descent, and it’s nearly identical cousin stocastic gradient descent (using random batches to optimize). But there are other optimization algorithms out there that have proven even more successful than our pal SGD, but they are all more or less variations on Newton’s Method. If you know where you are and where you’re going you can figure out where you should be, at least if you take a small enough step.

Base class

Not a whole lot to this one, just need to enforce the methods we’ll need because like layers, the optimizer will be used by the model, so we have to make sure that all optimizers have certain common functionalities, expect certain inputs, and produce certain outputs.

from abc import abstractmethod
from typing import Dict,Any
import numpy as np

class Optimizer:
    def __init__(self):
        self.name = None
        self.type = 'Optimizer'

    @abstractmethod
    def step(self, layer: Any, grads: Dict[str,np.ndarray]) -> np.ndarray:
        pass
    
    @abstractmethod
    def to_dict(self) -> Dict[str,Any]:
        pass
    
    @classmethod
    @abstractmethod
    def from_dict(cls, config: Dict[str,Any]) -> 'Optimizer':
        pass

You see this looks remarkably like the Layer base class, but the main method we need to implement in subclasses is step, which is where the weights and biases of the layer passed to it are updated using the gradients calculated in the layer’s backward pass. Now we make the SGD class:

class SGD(Optimizer):
    def __init__(self, learning_rate: float):
        super().__init__()
        self.learning_rate = learning_rate
        self.type = 'SGD'
    
    def step(self, layer, grads: np.ndarray) -> np.ndarray:
        params={'weights': layer.weights,'biases': layer.biases} 
        for key in ['weights','biases']:
            params[key] -= self.learning_rate * grads[key]
    
    def to_dict(self) -> dict:
        return {'learning_rate': self.learning_rate}
    
    @classmethod
    def from_dict(cls, data: dict):
        return cls(learning_rate=data['learning_rate'])

As we’ll see this is dead simple compared to other optimizers, and is special in that it is stateless. No matter what step you are in in the training process or what the previous step was, the optimization algorithm proceeds the same. For each layer, we get the weights and biases, we subtract the relevant gradient multiplied by the learning rate, and return the new weights and biases. The optimizer itself also only has one parameter, the learning rate.

But there are others! Notably, Adam, which is many people’s go-to deep learning optimizer. Adam adds a momentum term that smooths out rapid changes in the gradient. In practice this leads to much faster training for many neural nets.

class Adam(Optimizer):
    def __init__(self, 
                 learning_rate: float, 
                 beta1: float = 0.9, 
                 beta2: float = 0.999, 
                 epsilon: float = 1e-8):
        super().__init__()
        self.type = 'Adam'
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = dict()
        self.v = dict()
        self.t = 0
    
    def initialize_state(self,layer: Any):
        initial_dict = dict(weights = np.zeros_like(layer.weights),
                            biases = np.zeros_like(layer.biases))
        self.m[layer.name] = initial_dict.copy()
        self.v[layer.name] = initial_dict.copy()

    def step(self,
             layer: Any,
             grads: Dict[str,np.ndarray]) -> np.ndarray:
        if self.m.get(layer.name) is None:
            self.initialize_state(layer)
        params = {'weights': layer.weights, 'biases': layer.biases}
        if self.t < 1000:
            self.t += 1
        for key in ['weights', 'biases']:
            self.m[layer.name][key] = self.beta1 * self.m[layer.name][key]  + (1 - self.beta1) * grads[key]
            self.v[layer.name][key]  = self.beta2 * self.v[layer.name][key]  + (1 - self.beta2) * (grads[key] ** 2)
            m_hat = self.m[layer.name][key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[layer.name][key] / (1 - self.beta2 ** self.t)
            params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)
    
    def to_dict(self) -> dict:
        return {'beta1': self.beta1, 
                          'beta2': self.beta2, 
                          'epsilon': self.epsilon,
                          'm': self.m,
                          'v': self.v,
                          't': self.t}
    
    @classmethod
    def from_dict(cls, data: dict):
        obj = cls(learning_rate=data['learning_rate'],
                  beta1=data.get('beta1', 0.9),
                  beta2=data.get('beta2', 0.999),
                  epsilon=data.get('epsilon', 1e-8))
        obj.m = data.get('m', None)
        obj.v = data.get('v', None)
        obj.t = data.get('t', 0)
        return obj

The big difference here is that the optimizer has a state which is important when it comes time to save the model and train it further later. I’ve been bitten many times by not properly saving the optimizer state and going to train a model for additional epochs only to find it quickly turn into garbage and I have to waste time retraining from scratch. The m (momentum) and v (variance) terms have values associated both with the weights and with the biases and are updated for each layer at each step, and furthermore early steps (t) in the training process are attenuated by a factor of (1-beta1**t) for m and (1-beta2**t) for v. It’s a little more complicated, but once you have your optimizer, it’s really no harder to use than any other one, and it often works a lot better.

Well that’s the optimizer. I lied in the last post, there’s actually one more incredibly important component before the model. It’s the Dataset! Then we’ll make a model, I promise.

Numpy Neural Nets Layers

23 Oct 2025

Another Numpy Neural Net: Layers

Deep Learning is built on layers, you input numbers on one side and the crank turns forward from layer to layer until it spits numbers out on the other side. Then the crank turns backwards and a loss out the other side spits a derivative on one side. This incredible modularity is both necessary (you have to introduce a non-linearity periodically or all your operations can be combined into one linear transformation) and probably one reason why the field moves so fast. So let’s start with the classic, fully connected layer, we’ll loop back and do a couple more layers once we have the rest of the machinery in place.

A Base Class

But first, a Base Class:

from DifferentiableFunction import DifferentiableFunction
from typing import Dict
from abc import abstractmethod

class Layer:
    def __init__(self):
        self.name = None
        self.type = 'Layer'
        self.weights = None
        self.biases = None
        self.last_input = None
        self.last_z = None
        self.weights_initialized = False

    @abstractmethod
    def initialize_weights(self):
        pass

    @abstractmethod
    def forward(self, input_data: np.ndarray) -> np.ndarray:
        if self.weights_initialized is False:
            self.initialize_weights()
            self.weights_initialized = True

    @abstractmethod
    def backward(self, output_gradient: np.ndarray) -> Dict[str,np.ndarray]:
        pass

    @abstractmethod
    def to_dict(self) -> Dict:
        pass
    @classmethod
    @abstractmethod
    def from_dict(cls, data: Dict) -> 'Layer':
        pass

The Layer Class defines what we would like all layers to have, and since layers are generally used by other objects, certain methods we need the layer to have. So all layers have certain attributes specified in the __init__ method, we can call this from subclasses and not wonder if we forgot something later. The point of the @abstractmethod decorator is that if a subclass does not implement the method underneath it, python will throw a NotImplemented error. There’s a number of them, because if a layer is missing any of these, it won’t work! It needs a forward pass, a backward pass, a way to initialize weights and a to/from_dict method. The @classmethod decorator means that the method can be called from the class rather than an instance (Layer.from_dict(kwargs) instead of Layer(kwargs).from_dict(kwargs)), this makes a lot of sense for ‘load’ type functions, where is seems silly to try to make some kind of dummy instance so you can instantiate the real instance.

The last two ‘dict’ methods are really so that we can specify these models as a yaml or json file instead of inline. This can be very useful when you want to separate the code to run/train your model from the actual model architecture. You don’t want to be pushing code to the repo just to try out a different number of hidden nodes.

Subclass

So now we’re ready to code up our Dense layer:

class DenseLayer(Layer):
    """
    A fully connected neural network layer.
    
     Parameters:
        input_size (int): The number of input features.
        output_size (int): The number of output features (neurons).
        activation_function (DifferentiableFunction): The activation function to apply.
        name (str, optional): Name of the layer. Defaults to None.
    """

    def __init__(self, 
                 input_size:int,
                 output_size: int, 
                 activation_function: DifferentiableFunction,
                 name: str = None):
        super().__init__()
        self.name = name
        self.type = 'Dense'
        self.input_size = input_size
        self.output_size = output_size
        self.activation_function = activation_function

    def initialize_weights(self):
        """
        initialize weights with He initialization.
        """
        self.weights = np.random.randn(self.input_size, self.output_size) * (np.sqrt(2./self.input_size))
        self.biases = np.zeros(self.output_size)
        logger.info(f"Weights and biases initialized for layer {self.name}")

    def forward(self, input_data: np.ndarray) -> np.ndarray:
        """
        Performs the forward pass through the layer, supporting batched input.
        Parameters:
            input_data (np.ndarray): Input data to the layer. Shape: (batch_size, input_size)

        Returns:
            np.ndarray: Output after applying weights, biases, and activation function. Shape: (batch_size, output_size)
        """
        super().forward(input_data)
        self.last_input = input_data
        self.last_z = np.dot(input_data, self.weights) + self.biases  # (batch_size, output_size)
        logger.debug(f"Forward pass in layer {self.name}: input shape {input_data.shape}, z shape {self.last_z.shape}")
        return self.activation_function.function(self.last_z)

    def backward(self, output_gradient: np.ndarray) -> Dict[str,np.ndarray]:
        """
        Performs the backward pass through the layer, updating weights and biases.
        Parameters:
            output_gradient (np.ndarray): Gradient of the loss with respect to the layer's output. Shape: (batch_size, output_size)
            learning_rate (float): Learning rate for weight updates.
        Returns:
            np.ndarray: Gradient of the loss with respect to the layer's input. Shape: (batch_size, input_size)
            """
        activation_gradient = self.activation_function.derivative(self.last_z)  # (batch_size, output_size)
        delta = output_gradient * activation_gradient  # (batch_size, output_size)

        weight_gradient = np.dot(self.last_input.T, delta) / self.last_input.shape[0]  # (input_size, output_size)
        bias_gradient = np.sum(delta, axis=0) / self.last_input.shape[0] # (output_size,)

        input_gradient = np.dot(delta, self.weights.T)  # (batch_size, input_size)
        logger.debug(f"Backward pass in layer {self.name}: output_gradient shape {output_gradient.shape}, input_gradient shape {input_gradient.shape}") 
        grad_dict = {'inputs': input_gradient,
                     'weights': weight_gradient,
                     'biases': bias_gradient}
        
        return grad_dict

        # Update weights and biases will be handled by the optimizer in Model.py

    def to_dict(self) -> Dict:
        return {
            'name': self.name,
            'type': self.type,
            'input_size': self.input_size,
            'output_size': self.output_size,
            'activation_function': self.activation_function.__class__.__name__,
        }
    
    @classmethod
    def from_dict(cls, data: Dict) -> 'DenseLayer':
        activation_function = getattr(__import__('DifferentiableFunction'), data['activation_function'])()
        layer = cls(input_size=data['input_size'],
                    output_size=data['output_size'],
                    activation_function=activation_function,
                    name=data['name'])
        return layer

The call to super().__init__() initializes all our default variables we’ll need to do forward/backward passes. We update self.type to the new layer type, and implement all the abstract methods.

forward simply does a matrix multiplication of the weights with the inputs and adds the biases, then applies the activation function. The first time forward pass is called, it initializes the weights by calling the base class’ forward method

backward calculates the relevant gradients and packages them up into a dictionary to be used by the optimizer. Note we don’t actually update the weights on the forward or backward pass partly to decrease the number of objects the layer needs to depend on.

Note that I also through some logging in here. Putting logging inside your classes is usually doing yourself a real solid later when some indeterminate layer is ruining your life come training time. It’s also way better than a print statement because you don’t have to remember to erase it (I feel like a n00b every time a colleague points out a print I missed in a PR review). To put logging in just throw this on top of any file:

import logging

logger = logging.getLogger(__name__)
logger.INFO("This is a log message at loglevel INFO")

Finally, the to/from dict methods make it possible to save/load a layer from a dictionary like form such as json or yaml.

Next up we’ll look at Optimizers and then we’ll be ready to put it all together into a model.

Older Newer

Chris Malec Data Scientist