DeepMind UCL Deep Learning Online Course 2 Summary
Foundations of Neural Networks
History
The original idea of developing neural networks came from the human study of the brain, which has a large number of biological neurons, as shown in the figure above. In the biological neuron model, electrical signals flow through the dendrite into the cell body, which is equivalent to a data processing center, and when there is enough stimulus response, it sends a weak electrical signal to the cablelike axon, which flows through the axon and then reaches the synaptic terminals. terminals), a neuron will have multiple dendrites and a single axon, and the signaling between neurons and neurons is accomplished through a conductive fluid in the synaptic gap. By modeling the biological neuron, an artificial neuron is obtained, and the four elements to build the corresponding model are as follows.
Biological model :
 Cell body
 axon
 Dendrites
 Synapse
Artificial neuron:
 Cell body
 Output channels
 Input channels
 Weights
McCulloch and Pitts developed the first artificial neuron model based on biological neural networks. [Address of the paper A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY](https://www.cs.cmu.edu/~. /epxing/Class/10715/reading/McCulloch.and.Pitts.pdf)
Neuronal Models :
Machine learning
Deep learning is just a subset inside the set of machine learning, and the following is a general overview of machine learning in general.
Supervised learning
Algorithms:

kNearestNeighbors

Linear Regression

Logistic Regression

Support Vector Machine

Decision Trees

Random Forests

Neural networks
Unsupervised learning
Algorithms:

Clustering
 KMeans
 DBSCAN
 HCA (Hierarchical Cluster Analysis)

Anomaly detection and novelty detection
 One Class SVM
 Isolation Forests

Visualization Dimensionality reduction
 PCA (principal component analysis)
 Kernel PCA
 Locally Linear Embedding (LLE)
 tdistributed stochastic neighbor embedding (tSNE)

Association rule learning
 Apriori
 Eclat
Neural network architecture
Neural networks can be analogous to the computation of graphs, briefly sorting out the more commonly used neural network architectures:
 DNN (deep neural networks)
 CNN (Convolutional neural network)
 RNN (Recurrent neural network)
 AutoEncoders

Deep Belief Networks DBNs

Restricted Boltzmann Machines (RBMs)

Emergent Architectures (EAs)
 Deep Spatio Temporal Neural Network
 MutliDimensional Recurrent Neural Network
 Convolutional AutoEncoders
Artificial Neural Network Algorithms
The artificial neural network is computed as above y is the output neuron, x is the input neuron, b is the bias, and w is the connection weight between neurons.
Above we have defined a single layer neural network with output size 10 vector, input vector size 5, weights 10*5 matrix and weight layer size 10.
In the above calculation, we input vectors {1,2,3,4,5} into the neural network to get the output as {0.808207, 1.06912, 2.05227, 3.40917, 2.20901, 3.62218, 5.72408, 0.997228, 2.31316, 1.08206}
From the neural network we obtain the values of the weights and biases, and we can verify that the artificial neural network is calculated as wx + b = y, and the output values obtained from the linearLayerWeights point multiplication input + linearLayerBias are consistent with the above figure.
Gradient descent algorithm
About the gradient calculation:
 R>R
 R^(n) > R^(n)
Vector Field:
Let z = f(x,y) = x^2 + y^2
Mathematica code is as follows:
f[x_, y_] = x^2 + y^2;
Plot3D[f[x, y], {x, 3, 3}, {y, 3, 3}, PlotStyle > Gray]
Compute the f(x,y) gradient:
grad = Grad[f[x, y], {x, y}] (* output: {2x, 2y}
VectorDensityPlot[grad, {x, 3, 3}, {y, 3, 3},
PlotLegends > Automatic, ColorFunction > "Rainbow"]
The vector field of gradients can be understood as
The vector field of the gradient of our defined function is shown below, with the colors representing the length of the vectors:
From the graph we can see how the gradient of the function f(x,y) changes at each point as x,y changes.
Find the minimum value based on the gradient :
(* Calculate the gradient *)
grad = Grad[f[x, y], {x, y}]
(* {2 x, 2 y} *)
(* Solve the gradient equation, when the gradient is 0, the function exists as a maximal or minimal value *)
sol = NSolve[grad == {0, 0}, {x, y}, Reals]
(* {{x > 0, y > 0}} *)
(* then find the first order derivative of the gradient, which is greater than 0, the function has a minimal value, and less than 0, the function has a maximal value *)
dGrad = Grad[grad, {x, y}]
(* {{2, 0}, {0, 2}} *)
Show[
Plot3D[f[x, y], {x, 3, 3}, {y, 3, 3},
PlotStyle > Opacity[.4]],
Graphics3D[{PointSize[.04], Red, Point[{x, y, f[x, y]}] /. sol}]]
Get the minimal points of the function {x>0,y>0,z>0}
Calculating the gradient of the function and the first order derivative of the gradient will help us determine the extreme point of the function. In neural networks relying on gradient descent algorithms, the values of the weights are changed continuously and iteratively, so that the loss function reaches its extreme value.
Using the Taylor expansion, we can show that when the variable x is computed according to a decreasing gradient, so that f(x) gradually approaches the extreme value.
Why use an activation function
Why use the activation function, first look at the XOR isoor operation
{1,1} > 0
{0,0} > 0
{1,0} > 1
{0,1} > 1
We need to train a model so that the model can predict the computation of XOR.
First model: single model without hidden layers:
data = {{1, 1} > 0, {0, 0} > 0, {1, 0} > 1, {0, 1} > 1}
net = NetChain[{LinearLayer[1]}]
trained = NetTrain[net, data]
trained[#] & /@ Keys[data]
(* Output: {0.5, 0.5, 0.5, 0.5} *)
We can see that the model is unable to predict simple XOR rules.
The second model adds a hidden layer
net = NetChain[{LinearLayer[16], 1}]
(* Output {0.500014, 0.500002, 0.500009, 0.500006} *)
We still find that the changed model cannot predict the XOR.
The third one continues to increase the number of layers of the hidden layer of the model
net = NetChain[{LinearLayer[16], LinearLayer[16], 1}]
output > {0.499969, 0.499997, 0.499982, 0.499984}
Obviously the model also fails to meet the requirements. Since the algorithm used in the model is a linear regression algorithm with y=wx+b, which cannot represent the XOR operation, it is necessary to add nonlinear characteristics to the network, such as adding the ReLU activation function (rectified linear unit) to the hidden
net = NetChain[{LinearLayer[16], Ramp, 1}]
output > {4.38886*10^8, 2.32831*10^10, 1., 1.}
After adding the nonlinear property, our network can correctly express the XOR operation.
Backpropagation mathematical corollary
First, we introduce the chain derivation rule in calculus as follows:
Vector notation:
Activation function

Sigmoid

Relu

tanh

softmax

cross entropy
Derivation of backpropagation
Neural networks are actually computations of graphs in the computation process, first consider artificial neural networks without hidden layers:
According to the theory of supervised learning, we need to calculate the gap between the computed value and the target value, so that the gap is as small as possible. The only parameters that we can adjust in the network are the weights and the threshold value, and in order to minimize the error, we must finely adjust the weights to make the error smaller and smaller.
The backpropagation derivation process is as follows:
The final adjustment of the weights is calculated as follows:
If xj is the hidden layer in the neural network, as shown below:
then the derivation of the backpropagation process is shown below:
Finally, we adjust the weights of the jlayer network as follows:
Backpropagation pseudocode
Singlelayer neural network computation:
Multilayer perceptron backwards computation: