section{Introduction}Artificial intelligence (AI) refers to the ability of a machine to perform complex tasks, in contrast to natural intelligence displayed by human beings and animals cite{poole1998computational}.

Artificial intelligence branches into many categories – machine learning, natural language processing, speech, robotics, computer vision etc. – basically anything a human being is capable of performing. There are three types of AI – Artificial Narrow Intelligence (ANI), Artificial General Intelligence (AGI) and Artificial Super Intelligence (ASI). ANI, or `weak’ AI can perform specific tasks such as play chess and convert speech to text. All existing technologies are currently at ANI at best. AGI is AI that can perform any task which can be done by a human being, such as learn, plan, communicate, and apply a combination of those skills to achieve certain goals. ASI refers to machines which surpasses the most intelligent human minds, and can arguably be stated as the long-term goal of AI.

The capabilities of ASI, by definition, cannot be conceived by the human mind (an analogy would be an insect being unable to conceive the idea of special relativity). We are at an exciting era where the advent of AGI and ASI could spell a revolution in the way our lives are lived. The subject matter is broad, and this article will focus on the `learning’ aspect of AI.subsection{Machine Learning}There are three broad classes of machine learning algorithms, supervised learning, unsupervised learning and reinforcement learning.

Supervised learning involves the use of test data inputs and specified outputs to train an algorithm to match inputs to outputs. With sufficient training data, it is hoped that the algorithm will be able to generate accurate outputs from new inputs. Unsupervised learning allows the approach to problems with no specified `correct’ output to train the algorithm, which means one can utilise it with little or no idea of what the results should look like. Using an algorithm to detect galaxies clusters without any feedback on prediction results is a form of unsupervised learning, where the algorithm finds structure in the input data. This has already been implemented in the classification of galaxies from the Sloan Digital Sky Survey Data cite{unsupervised_galaxy}. Reinforcement learning involves an algorithm training itself using trial and error to maximise the artificial reward it is given.

The basic reinforcement is modelled as a Markov decision process cite{howard1960dynamic}. One effective learning algorithm in machine learning is the Artificial Neural Network (ANN), which is loosely modelled after the human brain. There is lots of information on how ANNs work, but not much is known on why they are so effective in solving problems. This article will discuss the various hypotheses on why ANNs work so well.section{How do ANNs work?}Before one can start hypothesising why ANNs work as well as they do, we need to have an understanding of how they work. ANNs, as their name indicates, computational networks which attempt to simulate the neurons of the brain. The use of multi-layer artificial neural networks is characteristic of deep learning, a subset of machine learning.FloatBarrieregingroup centering includegraphicswidth=linewidth{ANN2.

pdf} captionof{figure}{(a) A simple three layer neural network with 3 inputs, $x_1$, $x_2$ and $x_3$ with one hidden layer containing one neuron. (b) Sigmoid function $sigma(x)=1/(1+e^{-x})$.}label{fig:ANN}endgroupFloatBarrierANNs are comprised of individual nodes known as neurons, which are activation functions that map the values between the layers in an ANN. The most common activation function used is called the sigmoid neuron, named as such because the sigmoid function is applied (Figure ef{fig:ANN}b).

The sigmoid neuron takes in inputs, multiplies each of the inputs by weights $Theta$ and outputs a value (Figure ef{fig:ANN}a). We introduce two notations. The `activation’ of node i in layer j is represented by $a_i^{(j)}$ and the matrix of weights connecting layer j to layer j+1 is $Theta^{,(j)}$.

The output is thus given byegin{equation}egin{split}a_1^{(3)} &= sigma(Theta_{10}^{,(1)}x_0 + Theta_{11}^{,(1)}x_1 + Theta_{12}^{,(1)}x_2 + Theta_{13}^{,(1)}x_3)&= sigma(mathbfTheta_1^{,(1)}cdot mathbf x),label{eq:output}end{split}end{equation}where $a_1^{(3)}$ is the output and activation function at node 1 of layer three, $sigma(x)=1/(1+e^{-x})$, $mathbf Theta_i^{(j)}$ represents the vectorised form of the weights, with the number of weights equal to the number of inputs. $mathbf x$ represents the inputs in n dimensions. $x_0$ is always 1 and $Theta_{10}^{,(1)}x_0$ is known as the bias, which provides the affine transformation used to approximate the function mapping the inputs to the outputs. This extra `input’ is always added to each layer for computational and notational convenience. The sigmoid function normalises values to between 0 and 1. The output will thus be a value between 0 and 1.

ANNs are made up of multiple layers of neurons. subsection{Recognising handwritten digits}To add a layer of depth to our understanding of ANNs, we shall illustrate the use of ANNs in solving a multiclass classification problem. Say we want to train a neural network to recognise digits.

We require test data for training inputs (for example a handwritten image of `2′) and matching outputs (the value 2). Each digit will be a 28$ imes$28 pixel image, each with a grayscale value from 0 to 1, representing how dark each pixel is. This makes 784 inputs for each digit, and 784 input neurons. Each of these neurons have a weight attached to it.

First, we randomly initialise a set of weights and calculate the outputs ($a_i^{(3)}$) by applying Equation ef{eq:output} to each of the neurons in the hidden layer and summing them. To generalise, each of the output neurons in the output layer can be computed with the following equationegin{equation}egin{split}a_i^{(3)} &= sigma(Theta_{i0}^{,(2)}a_0^{(2)} + Theta_{i1}^{,(2)}a_1^{(2)} + Theta_{i2}^{,(2)}a_2^{(2)} + Theta_{i3}^{,(2)}a_3^{(2)} + … )&= sigmaig(sum_{i=1,,j=0} Theta_{ij}^{(2)}x_jig),label{eq:output2}end{split}end{equation}where $Theta_{ij}^{,(2)}$ represents the weights which map the second layer to the third layer, i.e. it maps the outputs computed from the individual nodes in the second layer to those the output layer.

To put things into context, an accurately trained ANN will output $a_3^{(3)} approx 1$ and $a_{ieq3}^{(3)} approx 0$, when an image of `2′ is input (Figure ef{fig:ANN_example}).FloatBarrieregingroup centering includegraphicswidth=linewidth{ANN_example2.pdf} captionof{figure}{Three layer Artificial Neural Network for recognising handwritten digits. The output layer should contain 10 neurons representing 0 to 9, but stops at 2 for simplicity of the diagram. Here, an image of `2′ is input into the trained three layer neural network, which correctly outputs `2′ in the third neuron $a_3^{(3)}$.}label{fig:ANN_example}endgroupFloatBarrierThe next step would be to find the error between the actual value and the value that was output from the randomly initialised set of weights. This error can be represented using the quadratic cost function (or mean squared error), given byegin{equation}C(Theta)equivfrac{1}{2n}sum_x ||y(x) – a(x, Theta)||^2,end{equation}where, $Theta$ represents the collection of all weights in the network, n is the total number of training inputs, $y(x)$ is the set of training outputs with values that match the training images, $a$ is a function of $x$, $w$ and $b$ and represents the vector of outputs from the network when $x$ is input cite{nielsen2015neural}. Training the the ANN entails minimising the cost function.

One method to achieve this is by gradient descent, which uses the grad operator to compute the gradient of steepest descent to reach a minimum error value. By updating the weights by a small amount in a direction which progresses towards the minimum error value, one can iterate and thus optimise the neural network. This process of iteratively updating the weights according to the error is known as backpropagation cite{backpropagation}. It is `backwards’ because the errors are computed starting from the final layer of the ANN, and this is a consequence of the fact that the error is a function of the outputs. A successfully trained neural network will have a matrix of weights to connect layers in the neural network, and gives accurate predictions when new inputs are fed into the system. Gradient descent is one of many optimisation algorithms used to reduce cost functions.

section{Applications of ANNs}subsection{ANNs solve mathematical problems}Deep learning using ANNs is one of the most effective learning algorithms. It has been successfully applied to solve many problems such as image recognition, achieving an accuracy of 99.6 \% for recognising handwritten digits, one of the highest accuracies cite{simard2003best} achieved when trained with the MNIST data set. ANNs were also used to build Google’s AlphaGo which bested a world champion at Go, a game which is harder to solve computationally compared with Chess cite{silver2016mastering}. subsection{ANNs in Physics}The application of AI, and specifically ANNs in Physics is not uncommon. ANNs have been used to augment Physics experiments since the 1980s cite{highenergyphysics}.

For example, a feed-forward ANN was used to discriminate the decay of the Z boson into c, b or s quarks and the results were further used to determine the decay probability of Z into the corresponding states cite{delphi1997search}. Very recently, ANNs were used to approximate the wavefunction of a many-body quantum system where a simple neural network was used successfully in finding the ground state and describing the unitary time evolution of complex interacting quantum systems cite{carleo2017solving}. This is exciting because there are a number of unexplored regimes which exist due to the sign problem when using quantum Monte Carlo (QMC) methods cite{foulkes2001quantum} and the inefficiency of current compression approaches in higher dimensions, and a properly trained ANN has the potential to eliminate these problems. However, not much is known on why they work so well in solving problems. It is sometimes treated as a `black box’ which can be unsettling. section{Why do ANNs work?}subsection{Effectiveness of backpropagation}When tuning an ANN to solve a problem, one can vary the number of hidden layers (any layer which is not the input or output layer), and the number of neurons in each hidden layer. One of the reasons why ANNs work so well is because of the effectiveness of the backpropagation theorem.

Although ostensibly trivial, the algorithm provides a methodology to control every individual weight as they propagate through the network till it reaches the output layer and affects the cost function. Every iteration also changes the weights by a small amount, progressing towards a minimum cost output. Using matrices for backpropagation also makes this method computationally viable, as keeping track of every single weight and updating them procedurally would result in significant computational cost.subsection{Universality theorem}It would be helpful to recall that the problems ANNs are applied to solve aim to relate a bunch of inputs to the outputs. Hence, what it is doing is approximating the function which most accurately maps the inputs to the outputs.

It turns out that one of the main reasons why ANNs work so well is the fact that the multi-layer feedforward architecture gives neural networks the potential to be universal approximators cite{cybenko1989approximation, HORNIK1989359}. It was proven that a shallow ANN with a single hidden layer is able to approximate all functions simply by varying the number of neurons in the hidden layer. It was shown that it was not the sigmoidal activation function that gave rise to the universality, but rather the architecture of the ANN.subsection{Efficiency/low computational cost}There are many learning algorithms, such as linear regression and logistic regression.

One might ponder why ANNs have to be used in solving problems as well. To understand this, imagine a computer vision problem problem where we want to train an algorithm to recognise kittens on the internet. Non-linear hypothesis – if use logistic regression, the number of features to compute will increase by $n^2/2$subsection{Simplicity of the problems}Another recent hypothesis suggests that we look not to the why ANNs can solve so many problems, but the nature of the problems that can be solved cite{lin2017does}.