DeepMind UCL Deep Learning Online Course 2 Summary

History

The original idea of developing neural networks came from the human study of the brain, which has a large number of biological neurons, as shown in the figure above. In the biological neuron model, electrical signals flow through the dendrite into the cell body, which is equivalent to a data processing center, and when there is enough stimulus response, it sends a weak electrical signal to the cable-like axon, which flows through the axon and then reaches the synaptic terminals. terminals), a neuron will have multiple dendrites and a single axon, and the signaling between neurons and neurons is accomplished through a conductive fluid in the synaptic gap. By modeling the biological neuron, an artificial neuron is obtained, and the four elements to build the corresponding model are as follows.

Biological model :

Cell body
axon
Dendrites
Synapse

Artificial neuron:

Cell body
Output channels
Input channels
Weights

McCulloch and Pitts developed the first artificial neuron model based on biological neural networks. [Address of the paper A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY](https://www.cs.cmu.edu/~. /epxing/Class/10715/reading/McCulloch.and.Pitts.pdf)

Neuronal Models :

Machine learning

Deep learning is just a subset inside the set of machine learning, and the following is a general overview of machine learning in general.

Supervised learning

Algorithms:

k-Nearest-Neighbors
Linear Regression
Logistic Regression
Support Vector Machine
Decision Trees
Random Forests
Neural networks

Unsupervised learning

Algorithms:

Clustering
- K-Means
- DBSCAN
- HCA (Hierarchical Cluster Analysis)
Anomaly detection and novelty detection
- One Class SVM
- Isolation Forests
Visualization Dimensionality reduction
- PCA (principal component analysis)
- Kernel PCA
- Locally Linear Embedding (LLE)
- t-distributed stochastic neighbor embedding (t-SNE)
Association rule learning
- Apriori
- Eclat

Neural network architecture

Neural networks can be analogous to the computation of graphs, briefly sorting out the more commonly used neural network architectures:

DNN (deep neural networks)

CNN (Convolutional neural network)

RNN (Recurrent neural network)

AutoEncoders

Deep Belief Networks DBNs
Restricted Boltzmann Machines (RBMs)
Emergent Architectures (EAs)
- Deep Spatio Temporal Neural Network
- Mutli-Dimensional Recurrent Neural Network
- Convolutional AutoEncoders

Artificial Neural Network Algorithms

The artificial neural network is computed as above y is the output neuron, x is the input neuron, b is the bias, and w is the connection weight between neurons.

Above we have defined a single layer neural network with output size 10 vector, input vector size 5, weights 10*5 matrix and weight layer size 10.

In the above calculation, we input vectors {1,2,3,4,5} into the neural network to get the output as {-0.808207, -1.06912, -2.05227, 3.40917, 2.20901, -3.62218, 5.72408, -0.997228, 2.31316, 1.08206}

From the neural network we obtain the values of the weights and biases, and we can verify that the artificial neural network is calculated as wx + b = y, and the output values obtained from the linearLayerWeights point multiplication input + linearLayerBias are consistent with the above figure.

Gradient descent algorithm

About the gradient calculation:

R->R
R^(n) -> R^(n)

Vector Field:

Let z = f(x,y) = x^2 + y^2

Mathematica code is as follows:

f[x_, y_] = x^2 + y^2;
Plot3D[f[x, y], {x, -3, 3}, {y, -3, 3}, PlotStyle -> Gray]

Compute the f(x,y) gradient:

grad = Grad[f[x, y], {x, y}] (* output: {2x, 2y}
VectorDensityPlot[grad, {x, -3, 3}, {y, -3, 3},
 PlotLegends -> Automatic, ColorFunction -> "Rainbow"]

The vector field of gradients can be understood as

The vector field of the gradient of our defined function is shown below, with the colors representing the length of the vectors:

From the graph we can see how the gradient of the function f(x,y) changes at each point as x,y changes.

Find the minimum value based on the gradient :

(* Calculate the gradient *)
grad = Grad[f[x, y], {x, y}]
(* {2 x, 2 y} *)

(* Solve the gradient equation, when the gradient is 0, the function exists as a maximal or minimal value *)
sol = NSolve[grad == {0, 0}, {x, y}, Reals]
(* {{x -> 0, y -> 0}} *)

(* then find the first order derivative of the gradient, which is greater than 0, the function has a minimal value, and less than 0, the function has a maximal value *)
dGrad = Grad[grad, {x, y}]
(* {{2, 0}, {0, 2}} *)

Show[
 Plot3D[f[x, y], {x, -3, 3}, {y, -3, 3},
  PlotStyle -> Opacity[.4]],
 Graphics3D[{PointSize[.04], Red, Point[{x, y, f[x, y]}] /. sol}]]

Get the minimal points of the function {x->0,y->0,z->0}

Calculating the gradient of the function and the first order derivative of the gradient will help us determine the extreme point of the function. In neural networks relying on gradient descent algorithms, the values of the weights are changed continuously and iteratively, so that the loss function reaches its extreme value.

Using the Taylor expansion, we can show that when the variable x is computed according to a decreasing gradient, so that f(x) gradually approaches the extreme value.

Why use an activation function

Why use the activation function, first look at the XOR iso-or operation

{1,1} -> 0
{0,0} -> 0
{1,0} -> 1
{0,1} -> 1

We need to train a model so that the model can predict the computation of XOR.

First model: single model without hidden layers:

data = {{1, 1} -> 0, {0, 0} -> 0, {1, 0} -> 1, {0, 1} -> 1}
net = NetChain[{LinearLayer[1]}]
trained = NetTrain[net, data]
trained[#] & /@ Keys[data]

(* Output: {0.5, 0.5, 0.5, 0.5} *)

We can see that the model is unable to predict simple XOR rules.

The second model adds a hidden layer

net = NetChain[{LinearLayer[16], 1}]

(* Output {0.500014, 0.500002, 0.500009, 0.500006} *)

We still find that the changed model cannot predict the XOR.

The third one continues to increase the number of layers of the hidden layer of the model

net = NetChain[{LinearLayer[16], LinearLayer[16], 1}]

output -> {0.499969, 0.499997, 0.499982, 0.499984}

Obviously the model also fails to meet the requirements. Since the algorithm used in the model is a linear regression algorithm with y=wx+b, which cannot represent the XOR operation, it is necessary to add nonlinear characteristics to the network, such as adding the ReLU activation function (rectified linear unit) to the hidden

net = NetChain[{LinearLayer[16], Ramp, 1}]

output -> {4.38886*10^-8, 2.32831*10^-10, 1., 1.}

After adding the nonlinear property, our network can correctly express the XOR operation.

Back-propagation mathematical corollary

First, we introduce the chain derivation rule in calculus as follows:

Vector notation:

Activation function

Sigmoid
Relu
tanh
softmax
cross entropy

Derivation of backpropagation

Neural networks are actually computations of graphs in the computation process, first consider artificial neural networks without hidden layers:

According to the theory of supervised learning, we need to calculate the gap between the computed value and the target value, so that the gap is as small as possible. The only parameters that we can adjust in the network are the weights and the threshold value, and in order to minimize the error, we must finely adjust the weights to make the error smaller and smaller.

The backpropagation derivation process is as follows: