Deep Learning notation

2021-10-15 | aprates.dev

[1] Leia este post em português

A computer would deserve to be called intelligent if it could deceive a human into believing that it was human. - Alan Turing

Deep maths, ufff…

By the mid 2021 I started diving into a machine learning course I though I should do. A long time ago, when I graduated, my graduation paper was about chatbots with emotions and how humans would react to that. I wanted to better understand how the techniques had evolved from back then, in 2006, and found something a bit different from what I was expecting.

For the current status quo, you just cannot avoid some basic knowledge of python libraries (such as numpy), linear algebra, and a good dose of mathematical notation understanding, when reading descriptions of machine learning methods. And it can be very frustrating at times.

One bit of notation in an equation you don't grasp completely might prevent you from implementing the concept your are trying to learn. Coming as an experienced developer, I had that beginner-like feeling, while facing modern machine learning basics.

So here I have collected some mathematical notation that I have come across while doing the deep learning course, and also some notes on concepts that felt like mysterious to me like cost and derivatives.

I noted those mostly for my personal use, but posted it as I wish I had found this when searching the Internet. Also I must say notation varies a lot from author to author, and also, that I am still learning, so take my notes with a grain of salt.

Principle

The activation of a node in a neural network is something of the form:

output = activation_function(dot_product(weights, inputs) + bias)

General Notation

as per Andrew Ng of the deeplearning.ai specialization on Coursera [2]

x : input

y : output

m : amount of data

X : set of training examples

Y : set of outcome examples

N : size of X

( x(i) , y(i) ) : i-th example pair in X

( x' , y' ) : a testing pair

yˆ : predicted output

L : loss function (can also refer to hidden layers, see hyperparameters)

J : cost function

W : set of w parameters (weights)

B : set of b parameters (biases)

w[1], w[2], b[1], b[2],… : parameters per layer

Z = transpose(W) * X + B : vectorized implementation of hidden and output layers

dw1, dw2, db : derivatives of parameters

Hyperparameters

These parameters actually control how parameters w and b work:

α : learning rate (alpha symbol)

number of iterations for the gradient descent

L : number of hidden layers

n[1], n[2],… : number of hidden units per layer

choice of activation function, like relu, tanh, sigmoid…

Concepts

Cost

The loss function is determined as the difference between the actual output and the predicted output from the model, like y V.S. y^.

Although sometimes loss is also referred as cost, it's not the same thing. The cost function is an average loss over the complete train dataset like Y.

Derivatives (dx)

Collected from a note I found useful on forum posted by BurntCalcium (nick), another student:

Basically if f is a function of x, you're taking a ratio of the *change in f* to the *change in x*, given that the latter is an infinitesimally small quantity. The 'd' that is used while writing the notation represents the Greek letter Δ (Delta), which is commonly used to show change in a quantity in physics and math. So basically dx would mean the change in x, df(x) would mean the change in f(x), and df(x)/dx as a whole is called the derivative of f(x) with respect to x. And of course, in the course the instructors have adopted the notation that dx represents df(x)/dx, however outside the context of this course dx would simply mean change in x.

Reference

[2] Deep Learning on Coursera