Biological and Artificial Neurons
This week I thought I’d give you a very brief introduction to one of the most powerful and perhaps most mysterious learning algorithms we have: the Artificial Neural Network or ANN. This will take us on an journey through what a biological neuron is (and briefly how they work/learn), how an artificial neuron model might model a biological neuron, what the computational complexity of a single artificial neuron looks like, and how these artificial neurons can be assembled to learn statistical models of their input data sets. This stuff lies at the intersection of computer science, robotics, systems biology, neuroscience and what we all think of as networking. What could be cooler than that? Lets take a look…
Much of machine learning claims to be “biologically inspired”. While this is reassuring (some of our brains learn pretty well), it is not required and many learning algorithms are considered “biologically infeasible”. So lets briefly look at the structure of a single biological neuron and consider it as a feasible model for an artificial neuron. Figure 1 shows a typical biological neuron. BTW, in case you are wondering, the human brain is thought to have roughly O(10^11) neurons; you can compare that to a cockroach which has O(10^6) neurons or a chimpanzee which has O(10^10).
For purposes of our discussion, I want to call your attention to three features of biological neurons: First are the dendrites; think of these as the terminals at which the neuron receives its inputs (real dendrites can do computations and have feedback control; this is not typically modeled an artificial neuron; a notable exception is Jeff Hawkins‘ Hierarchical Temporal Memory model). Next is the cell body; think of this as where any processing occurs. Finally, the axon carries the output from the cell body to neighboring neurons and their dendrites. Basically the neuron is a “simple” computational device: it receives input at its dendrites, does a computation at the cell body, and carries the output on its axon (I’ll just note here that a real neuron is much more complicated; for example the dendrites themselves can carry out some kinds of computation).
Biological neurons communicate across a synapse, which is a chemical “connection”, typically between an axon and dendrite. Signals flow from the axon terminal to receptors on the dendrite, mediated by the chemical state of both the axon and dendrite. So what does learning mean in this setting? According to Hebbian theory (named after Canadian psychologist Donald Hebb), learning is related to the increase in synaptic efficacy that arises from the presynaptic cell’s repeated and persistent stimulation of the postsynaptic cell. It is this increase in “communication” efficacy that we call learning. One way to think about this is that the connection between the axon terminal and the dendrite is weighted, and the larger the weight (concentration of neurotransmitters in the axon terminal along with other chemical components including receptors in the dendrites) the more likely the neuron is to “fire”. So in this setting learning consists of optimizing neuron firings in certain patterns; to put it a different way, learning consists of optimizing the connection weights between axons and dendrites in a way that leads to some observed behavior. We will return to this idea of optimizing weights as learning when we talk about training Artificial Neural Networks.
Artificial Neurons, or ANs, are biologically inspired algorithms (sorry, move along, no magic here). Figure 2 compares biological and artificial neurons (BTW, I have no idea of how to write mathematical notation in wordpress, so hey, if you know how to do that drop me a note).
What you can see is that the artificial neuron has inputs, typically labelled and an output y. The inputs roughly correspond to the signals received by the dendrites of a biological neuron, and the output y models what the neuron soma (cell body) sends down the axon to its neighbors (i.e., the neuron’s output).
Now, in order to calculate the “pre-activation” or the AN, it calculates the sum (b is a bias term) . Finally, to calculate the output (the equivalent of what the biological neuron would send on its axon), the AN applies an activation function, usually called g(.), to the sum to get the AN’s activation. This winds up looking like this: .
So here is the really amazing thing: learning in an ANN is in large part about setting weights and biases. This is also a pretty good (although admittedly simplistic) analogy to what how biological networks learn; this kind of falls out of Hebbian theory. In fact, the ‘s are frequently called “synaptic weights” as they play a role similar to that of the chemistry of the synapse in learning processes. Of course, computation in a biological neuron is much more complicated, so the artificial neuron is in some sense an “abstraction”.
The next question is what the activation function g(.) might look like. Among the first ANs was the perceptron, invented by Frank Rosenblatt and his colleagues in 1957. In modern terms the perceptron was an algorithm for learning a binary classifier. However, the perceptron, had a problem that limited its utility. In particular, the perceptron computed a step function. The fact that a step function is by definition not continuous made it mathematically unwieldy; for example step functions aren’t generally guaranteed to have a derivative at the “step” (i.e., the slope of the curve is infinite/undefined at the step). The activation function cartoon for the perceptron is shown in Figure 3.
Modern ANs use a variety of activation functions which are smoother than the perceptron’s step function. The most popular activation functions are shown in Figure 4. Note that the logistic function is sometimes called “sigmoid” because of its shape. These functions are smoother than the step function an as such have nicer mathematical properties.
What Can A Single AN Compute?
We’ve seen that a single AN is comprised of connection weights (the Ws), a bias term, and an activation function. Given this, what can we say about what a single AN compute? Figure 5 shows the situation for the a sigmoid activation function with two inputs, and . Note here that I’m using sigmoid and logistic interchangeably when referring to activation functions; BTW, logistic regression doesn’t have much to do with linear regression, so don’t get confused by that either. In any event what you can see in Figure 5 is that the “decision boundary” for an AN with a sigmoid activation function is linear (“decision boundary” is an important concept in machine learning that we’ll spend more time on later). What this means at a hight level is that the values of the two classes that the AN is trying to predict or classify can be separated by a straight line. Similarly, in the general case of multiple classes a decision boundary is linear if there exists a hyper-plane that separates the classes.
The intuition here is that a single AN can compute functions with a linear decision boundary. These are frequently called linear features (we will look at how to handle non-linear features in a later blog). As an example, consider the cases shown in Figure 6, and consider the OR function on the left. The triangles indicate where the value of is one and where there are circles indicate where is zero. You can draw a straight line separating the triangles and circles, so is linearly separable. The result is that since the OR and AND functions are linearly separable, they can be computed by a single AN.
On the other hand, the XOR function is not linearly separable; this situation is shown on the left in Figure 7. On the right we transformed the input (with functions that are computable by a single AN) into something that is linearly separable, and hence the XOR of the transformed input is computable by a single AN. This gives us the hint that we might be able to compute more complex functions with a multi-layer ANN. As we shall see, that intuition is correct.
Next time we’ll take a look at multi-layer ANNs and how to train them (i.e., more on what learning means and how it works). Once we have that machinery, we’ll be able to look much of the cool contemporary work in Machine Learning including amazing ideas like stacked denoising auto-encoders and Restricted Boltzmann Machines, and some of the beautiful techniques such as dropout which are designed to make your algorithms learn more effectively. Armed with all of that, we’ll write some phython code that implements a multi-layer ANN and generate some data to test it.