A Deep Look @ Deep Learning — Part 02
Welcome back to the discussion on deep learning. We could have a basic idea of what is deep learning in the previous story. You can find it here. Now we know that deep learning is a process that a large neural network can learn to do do something specific. We will focus on the building block of the neural networks, the neurons and the mathematical background behind the scene more this time to understand the concept better.
Since we had a great intuition to have a deep look at deep learning, it is clever to start the discussion from the lowest point, where it begins. A good starting point would be exploring one of the major branches of machine learning, supervised learning. We will touch some key points only and try to lay the foundation here.
Supervised machine learning
Basically, machine learning can be classified into three major branches i.e., Supervised learning, Unsupervised learning and Reinforcement learning. All three kinds of these branches have something common. They are going to predict something based on some observations.
There are algorithms that can identify patterns from the observations and learn something. We will only focus on supervised learning algorithms at the moment. For this type of algorithms, we need to supply the examples or simply data for them to observe and learn. This process is called predictive modeling. Ultimately, we would have a superpower, a machine learning model after it learned a way to predict something useful for unseen data.
We will take some scenarios and deep dive into the topic.
The nature of the predictions required could be varied. Suppose we have collected weather data like wind-speed, humidity, temperature, etc of each hour daily for several months or years.
If we wanted to predict the temperature of an instance given the wind-speed and humidity, we should find a mapping between temperature data points and other data points (wind-speed, humidity)from existing observations which could be kind of an equation.
Sometimes we might need to classify a combination of data points into several classes. For example, whether it would rain in the evening or not when the wind-speed, humidity, and temperature of the morning had values within some ranges. If we have the information on the number of instances it rained which had a particular condition, we could have used those data for predicting the probability of raining for a given condition. If the raining probability is more than a threshold, we could say it would rain or otherwise not.
We can observe the combinations of wind-speed and temperature above the green surface have caused raining more than sixty percent. So we can consider the surface as a boundary to decide whether it would rain or not for a given combination. For example, we could predict whether it would rain at the instance where the combination of wind-speed and temperature indicated in the red circle observing the pattern as shown in the image. It would not rain. The boundary of the surface can be represented with a function of wind-speed and temperature.
All these scenarios had some input variables and output variables predicted. We can name the input variables as features and output variables as the label (y). The features can be represented in the vector form as below where X is the vector and each x representing a single feature like wind-speed.
The above scenarios explain the sub-branches of supervised learning, regression, and classification. Regression problems predict continuous values. The temperature or the probability of raining could have continuous values. After the threshold comparison, it could have discrete values, rained or not, the classes. When the problem is to classify something into two classes, it is a binary classification problem. If there are more classes to classify, it is a multi-class classification problem.
Anyway, all we did in every scenario was finding an approximation between features and the label. The function or the model which could do the prediction was supervised by the examples and that is why the approach is called supervised learning.
The nature of the approximation function
The approximation function represents a connection between the features and the label. When the problem is a regression type one, this function can be used to calculate the output based on inputs. When the problem is a classification type one, this function can be used as a decision boundary. Based on the number of features involved, the nature of the boundary of this function can be just a straight line, a curved line, a plane surface, a parabolic surface or even an advanced form that can not be visually represented.
If the function takes the following form where y represents the label, x represents features, w represents weights, and b represents a bias, we call it as a linear function. All the functions which don’t fulfill this form are non-linear functions.
If we can use a linear function as a decision boundary to separate data points like in the following figure, we call it linearly separable.
Sometimes, it is not possible to use a single linear function to achieve this goal. Consider the following data sets.
The features are crossed. Just one linear function can not form a decision boundary. But two linear functions can do.
When the data becomes more complected to classify, more linear functions might be required.
As a solution, we could transform the linear space to a non-linear space where data points become linearly separable. Non-linear functions come to play which are required to achieve this transformation. We will have a look at some non-linear functions later in the coming sections. The following figure would be helpful to have an idea of transforming the linear space to a non-linear space.
Having the basic understanding and the intuition, we will now try to understand how this is achieved with neural networks by exploring the functionality of a single neuron.
The functionality of a single neuron
With the previous discussion, we know that a neuron is an implementation of All-or-nothing-law. We will take the following figure and recall the memory.
As we already know, there are multiple input terminals and a single output terminal. All the input values are weighted and summed together. We also add a bias to the sum. The idea is to vary the function more with the bias which we can explore later.
Now we will try to convert this story to mathematics by writing equations that would help us to simulate the functionality of a single neuron. We will use simple letters x to denote an input value, w to denote a weight, b for bias and y for the output. There is n number of weighted inputs sum together and bias added. If the sum if more than a threshold value, the output is generated or otherwise not which is also called activation of the neuron.
We use capital letters X and W to represent vectors of inputs and weights since the vector multiplication gives the exact result and makes the calculation easier and faster in later works. Hence we can write the connection between input and output as follows.
As we learned, the inner part before the activation is in the form of a linear function. If the activation function is also a linear function, the whole neuron would also produce a linear function. Hence a network of neurons again produces a linear function.
Although it would be enough for a linearly separable problem, it would not be suitable for a complicated linearly inseparable problem. We need to introduce some non-linearity to the neurons in that case. The easiest way to do so is by making the activation non-linear.
Making activation non-linear is achievable if we could use a non-linear function as the activation function. For that, we will try to investigate the characteristics of some popular non-linear functions used for this purpose.
Popular non-linear activation functions
All the functions which are not in the form of linear form are non-linear functions. But, there are some characteristics that an activation function should contain which are important for the optimum computations in the neural networks. Here, we will only focus on some of the popular non-linear functions and their characteristics used in the deep-learning era.
There is a mathematical constant e called Euler’s number used in the equations of the following sections and it is approximately equal to 2.71828. We also need to have an idea of differentiation of a function. Simply it is finding the slope or the gradient between two points of a function and we will discuss it later in the coming stories.
The sigmoid function is having the following form and it is a bit computationally expensive equation.
The shape of the curve produced is an S-shaped curve or a sigmoid curve as the name implies.
The value of the functions always lies between zero (0) and one (1) for all the real numbers. The output tends towards the edge of the curve and becomes very close to one or zero for the value of x above two (x>2) and below minus two (x<-2). This characteristic helps to predict y clearly.
The function can be differentiated. The gradient of this function is very smooth but having vanishing gradients for the higher and the lower values of x which could cause a neural network to get stuck at the training time.
TanH function ( Hyperbolic tangent )
s is also similar to the Sigmoid function. But the value lies between minus one (-1) to plus one (+1). This is a zero centered function and differentiable.
TanH function or Hyperbolic tangent is derived from Hyperbolic sine and Hyperbolic cosine functions as shown below.
There are advantages using TanH function over the Sigmoid function since the gradient of this function is stronger.
ReLU (Rectified Linear Unit) and Leaky ReLU
ReLU function is the most used activation function these days. We will have a look at its form first and see the characteristics later.
The value of the function lies between zero (0) and the infinity (∞). Neural networks using the ReLU activation function can be trained faster since this function is easy to compute and causing quicker convergence.
ReLU function is differentiable. But the values of x below zero have the output of zero and the gradient becomes zero causing the neural network unable to learn from those negative values. The problem due to this characteristic is called the Dying ReLU problem.
Leaky ReLU comes to solve this matter.
Leaky ReLU introduces a leaking factor of 0.01 and enables the learning from all the data points since the gradient for negative values too becomes non-zero. But the predictions become inconsistent when using this as the activation function.
The value used to make the leak can also be varied and the function is then called Parametric ReLU.
These are the most popular non-linear activation function used. There is another function used at the neurons of output layers of the neural networks called Softmax function. We will discuss this later in the coming stories since the purpose of this function is a bit different than using it as an activation function.
Introducing non-linearity to neurons
Now we are ready to use one of these non-linear functions to activate neurons making the neural network complicated enough to solve complex problems.
The equations written above can be represented with activation as below.
We learned the functionality of a neuron and stimulated it with equations. The next step is to make the neural network with layers of neurons and understand the propagation of the input to output. There are two types of propagation strategies ie. forward and backward. We will discuss more on this topic in the next story. We also need to have some hands-on experience with a programming language to build neurons. We will try some basics of the commonly used frameworks which would open our paths to experience more.
Please feel free to use the comment section for any kind of feedback. And also don’t forget to give a clap if you enjoyed the article. Happy reading …!