Don’t Fear the AI?

What is this all about?

Terminator-terminator-9683150-1024-576

Everyone seems to be talking about their fear of AI these days, including Stephen Hawking,  Elon Musk and many others. Lets stipulate this: building a malevolent super intelligence is probably not a good thing. Ok, so booting up Skynet is a bad idea. Check.

On the other hand, Ray Kurzweil is telling a different story: Real AI (let’s call it Artificial General Intelligence, or AGI) will turn up in less than 15 years, in 2029 to be precise. Better still,  rather than Skynet,  AGIs will be the best technology ever (for some value of “ever”). As an aside, AGI is worth looking at if you are are interested in a look into one possible future.

 AGI by 2029?

Kurzweil is basically predicting that we will see AIs by 2029. While I’m also optimistic (we’re clever folks), I’m less convinced that we’ll see AGI by 2029.  Let’s be precise here: people are really worried about AGIs, not “narrow” AI. For example, if you look at Baidu’s recent state of the art speech recognition system (see Deep Speech: Scaling up end-to-end speech recognition) you will notice that in contrast to AGI, the system is very much engineered and optimized for the “narrow” task at hand (in this case speech recognition).  BTW, there are many hard problems still to be solved in machine learning, many of which have to do with the sheer computational complexity that you run into when trying to build systems that generalize well and while at the same time contending with high-dimensional input spaces (consider, for example, solving an optimization problem that has a billion parameters on a 1000 machines /16000 cores). Quite simply, the progress made by these deep learning systems and their performance is impressive. The same can be said about the machine learning community in general. So while many hard problems remain, both application and theoretical progress in machine learning (and deep learning in particular) has been spectacular.  BTW, there is also plenty of code around if you want to try any of this your self; see for example Caffe, where you can find state of the art pre-trained neural networks (there are many others).

Of course, the fact that these systems are optimized for the task at hand doesn’t mean that  haven’t learned a great deal from that work or that the work/progress isn’t impressive; on the contrary. The capabilities of state of the art machine learning are nothing short of spectacular. However, if you out check out the machine learning literature, you will quickly realize just how much task specific engineering goes into a deep learning system like the Baidu Deep Speech system (or robots that learn from watching youtube). They are far from being general purpose (or perhaps more importantly, self-aware). So is this  progress in “narrow” AI a necessary precursor to AGI? Not surprisingly, you can find an opinion on every side of this question.

On the other hand, AGI itself, while making great strides, is still in its infancy (notwithstanding decades of work across a wide variety of disciplines; it is an ambitious undertaking after all). An excellent example is the OpenCogPrime Architecture from Ben Goertzel and team.  While all of this stuff is incredibly cool and progress is coming quickly, there would seem to be a quite a way to go before we see real AIs.

Now, it probably goes without saying that that some set of technological breakthroughs, or hey, maybe even synthetic neurobiology breakthroughs could lead to the “boot up” of an AGI much sooner than anticipated (BTW, take a look at what Ed Boyden is doing in the synthetic neurobiology space; pretty amazing stuff). In any event, if such an AI was coupled with some ability to recursively self-improve, for example the advent of an AI that can rewrite its own code,  we could stumble into the Skynet nightmare. There is also no reason to believe  that such a  malevolent super intelligence  would look anything like human intelligence.  Notwithstanding the significant challenges that lie between here and AGIs, people like Kurzweil,  Ben Goertzel,  Randal Koene and many others  certainly believe the development of AGIs is positive, inevitable,  and likely in the “one or two of decades” time frame.

So how to place odds on both potential for development of AGIs and the danger they pose?

Those are, of course, the questions at hand. As I mentioned, I’m optimistic about the progress we’ve seen in both narrow AI and in AGI, so 2029 seems, well, possible.

darth_maul_by_legadema666-d4mrneg
Tech Is Always A Double-Edged Sword

Regarding the danger question here I agree with Kurzweil: now is the time to understand the issues and put in place safeguards; admittedly not too reassuring if what you are concerned about is malevolent super intelligences. However, like every other technology,  there will be both good and bad aspects. Our job is to understand the threat, if any, while maximizing the benefit and minimizing the damage (assuming that can be done). There is no shortage of literature in this area either (start here or check out Nick Bostrom et. al.’s The Ethics of Artificial Intelligence if you are interested). No small task we’ve embarked on here folks.

Demystifying Artificial Neural Networks, Part 2

In this blog I thought I’d give you a bit of an idea about how Feed Forward Artificial Neural Networks work. As you might imagine, this is a huge topic and this blog became tl;dr before I really knew it, so try to hang in there and give me feedback (if you like) on what you read here.

And BTW, if you know a good way to write math in wordpress please let me know.

Thanks, and safe and happy holidays to you and your families.

–dmm

So What is Machine Learning Anyway?

While there are several formal models of what Machine Learning (ML) is or does, whenever I’m asked what machine learning is all about, this quote from Andrew Ng  is always the first thing that comes to mind:

The complexity in traditional computer programming is in the code (programs that people write). In machine learning, algorithms (programs) are in principle simple and the complexity (structure) is in the data. Is there a way that we can automatically learn that structure?  That is what is at the heart of machine learning.

That is, ML is the about the construction and study  of systems that can learn from data. What exactly does this mean? Basically ML is about building statistical models from training data that can be used to predict the future, usually by either classification or by computing a function.  This is a very different paradigm than we find in traditional programming. The difference is depicted in cartoon form in Figure 1.

Traditional Programming vs. Machine Learning

Figure 1: Traditional Programming vs. Machine Learning

So basically we want to build models that, when given some prior information (which may not only be training data but also knowledge of priors), can generalize to predict

  • The probability of a yes/no event
    • e.g. customer likes product, image contains a face, …
  • The probability of belonging to a category
    • e.g. emotion in face image: anger, fear, surprise, happiness, or  sadness, …
  • The expected value of a  continuous variable
    • e.g. expected game score, time to next game, …
  • The probability density of a  continuous variable
    • e.g. probability of any interval of values, …
  • And many others…

Basic Assumption: Smoothness

So what assumptions do we need to be able to predict the future?  That is, what are the key assumptions that allow us to build generalizable statistical models from input data? To get a qualitative idea of what is required,  consider what it takes to learn the function f(x) depicted in Figure 2.

Figure 2: Easy Learning (courtesy Yoshua   Bengio)

Figure 2: Easy Learning (courtesy Yoshua Bengio)

In this case  learning f(x) is relatively easy because the training data closely models the true (but unknown) function, that is, the training data that hits most of the ups and downs in the function to be learned.  The basic property of these easy cases (e.g., Figure 2) that allows the learned model to generalizes is that for a training example x and a test input x’, if x is geometrically close to x’, then we assume that  f(x) ≈ f(x’). Clearly this is true for the function and data depicted in Figure 2.

The assumption that f(x) ≈ f(x’) when x is close to x’ is called the Smoothness Assumption, and it is core to our ability to build generalizable models. The Smoothness Assumption is depicted for a simple quadratic  function  in Figure 3.

Figure 3: Smoothness Assumption

Figure 3: Smoothness Assumption — x geometrically close to x’  → f(x) ≈ f(x’)

Well, that is all cool except that the functions we want to learn in ML are nothing like the functions depicted in Figures 2 or 3.  Instead, real data from functions we want to learn are typically concentrated near highly curved (potentially high dimension) sub-manifolds, as shown in Figure 4. This concentration is sometimes called “probability mass”, since the probability of finding representatives of the class we’re trying to learn will in general be higher near the surface where the data are concentrated.

Figure 4: Manifold

Figure 4: Sub-manifold for the handwritten digit 4

Notice that the manifold in Figure 4 is something of an “invariance” manifold. That is, images on the  manifold (in this case the handwritten digit 4) are in theory invariant to translation and rotation. We need this invariance property, because after all, I want to be able to recognize a face (or whatever) even in the presence of say, out of plane rotation, so we need a representation that accommodates this kind of translation. So we assume that things that are alike concentrate on a manifold as shown in Figure 4 (note that this is just a higher dimensioned generalization of what is shown in Figure 3).

Incidently, rotational/translational invariance of the kind shown in Figure 4 is precisely what breaks down in the presence of the custom crafted adversarial images discussed in Intriguing properties of neural networks?.   These studies show that under certain conditions an adversarial image can “jump off” the manifold and as a result won’t be recognized by a learner trained to build such an invariance manifold. Other studies have shown the converse: adversarial images can constructed (in the case cited, with evolutionary algorithms) that fool the trained network into misclassifying what is essentially white noise.

Before moving on, I’ll just note here that deep learning has emerged as a promising technique to learn useful hidden representations (such representations that residmust generalize well. e on manifolds as described above) which previously required hand-crafted preprocessing; we’ll  talk about what hidden means below.  This input preprocessing essentially came down to using a human to tell the learning algorithm what the important features were; the SIFT algorithm is a canonical example from the field of computer vision. What SIFT does is preprocess the input data so that the machine learning step(s) will be more efficient.  I’ll note here that prior to the success of unsupervised deep learning systems in discovering sailent features, most of the effort in building say, a machine learning system to recognize faces, was in the crafting of features. Modern machine learning techniques seek to remove humans from the feature crafting/discovery loop. We’ll take a closer look at deep learning and its application to representation learning in upcoming blogs.

When might we use Machine Learning?

Machine learning is most effective in the following situations:

  • When patterns exists in our data
    • Even if we don’t know what they are
    • Or perhaps especially when we don’t know what they are
  • We can not pin down the functional relationships mathematically
    • Else we would just code up the algorithm
    • This is typically the cases for image or sensor data, among others
  • When we have lots of (unlabelled) data
    • Labelled training sets harder to come by
      • Labelled data has a tag that tells you what it is, e.g., this is a cat
      • Most data isn’t labelled
  • Data is of high-dimension
    • Dimensionality of input data is a problem that all machine learning approaches need to deal with. For example,  we’re looking at 1Kx1K grey scale images, the input has 1048576 dimensions in pixel space. This high dimensionality leads to what is know as the curse of dimensionality;  basically as the dimensionality of the input data increases, the volume of the input space increases exponentially, so the available data become sparse, and sparse input poses interesting problems for statistical (and other) methods. This also means we are unlikely to see examples from most regions of the input space, so in order to be effective a model or set of models must generalize well (noting that  model averaging is a powerful technique to minimize prediction error). he curse of dimensionality problem is depicted in Figure 5.
      • BTW, how do we learn to recognize images in nature?  Well, consider the space of images that we humans can see. These images have something like 10^6×10^6 or 10^12 dimensions in pixel space  (yes, your eyes are perceiving input in more than a trillion dimensional space);  this number comes from the fact that there are on the order of 10^6 fibers in the optic nerve running from each of your retinas to your visual cortex.  Given that the distribution of cones and rods in the average human retina has a maximum of something like (1.5 x 10^5)/mm^2 in the average human retina, we can make a conservative guess and say we have 256 bits to represent color at each point (we’re clearly under-estimating the number of colors/shadings we can perceive;  color vision in mammals is of course more complex than this, with fascinating theories going back to Newton and Maxwell). But even without looking at color (or gray scale, or …), using this simplifying assumption gives us a  conservative estimate of the size of the trillion dimensional image space to be  something like 2^(10^12) , an enormous and computationally intractable number. Given this huge space and the computational challenges it implies, how is that we learn to recognize any image at all?  The answer is that we can learn the images we encounter in nature because that set is much, much smaller than the size of the total image space. That is, the set of naturally occurring images is sparse.  
    • Sensor, audio, video, and network data, to name a few, are all high dimension
  • Want to “discover” lower-dimension (more abstract) representations
    •  Several dimension-reduction techniques are used in machine learning (e.g., Principal Component Analysis, or PCA) and we will discuss these in later blogs but as a hint, the basic idea behind techniques like PCA is to find the important aspects of the training data and project them onto a lower dimension space.
  • Aside: Machine Learning is heavily focused on implementability
    • Frequently using well know numerical optimization techniques
    • Heavy use of GPU technology
    • Lots of open source code available
      • See e.g., libsvm (Support Vector Machines)
      • Most of my code these days is in python (lots of choices here)
Figure 2: The Curse of Dimensionality (courtesy Yoshua  Bengio)

Figure 5: The Curse of Dimensionality — at high dimension data becomes sparse

As we can see from the discussion above, machine learning is primarily about building models that will allow us to efficiently predict the future. What does this mean? Basically its a statement about how well a particular model generalizes.  

But what does generalizing really mean? We can characterize a model that generalizes as one that captures the dependencies between random variables while spreading out the probability mass from the empirical distribution (homework: why is this part of generalization?). At the end of the day, however, what we really want to do is to discover the underlying abstractions and explanatory factors which make a learner understand what is important (and frequently hidden) about the input data set. No small task.  Two of the key problems encountered when trying to build generalizable models are overfitting and underfitting (the evil twins of  bias and variance) which I’m not going to talk about here here other than to say that there are also basic statistical barriers that make model generalization tricky; I’ll leave that for a future blog.

Examples of Machine Learning Problems

Machine learning is a very general method for solving a wide variety of problems, including

  • Pattern Recognition
    • Facial identies  or facial expressions
    • Handwritten or spoken words (e.g., Siri)
    • Medical images
    • Sensor Data/IoT
  • Optimization
    •  Many parameters have “hidden” relationships that can be the basis of optimization
  • Pattern Generation
    • Generating images or motion sequences
  • Anomaly Detection
    •  Unusual patterns in the telemetry from physical and/or virtual plants
      • For example, data centers
    • Unusual sequences of credit card transactions
    • Unusual patterns of sensor data from a nuclear power plant
    • or unusual sounds in your car engine or …
  • Prediction
    •  Future stock prices or currency exchange rates

Types of Machine Learning

Supervised Learning

The main types of machine learning include Supervised Learning, in which the training data includes desired outputs (i.e., supervised learning requires “labelled” data sets).  All kinds of “standard” (benchmark)  training data sets are available, including  http://archive.ics.uci.edu/ml/ (UCI Machine Learning Repository), http://yann.lecun.com/exdb/mnist/ (subset of the MNIST database of handwricen digits), and http://deeplearning.net/datasets/, to name a few.

Unsupervised learning

In Unsupervised Learning, the training data does not include desired outputs. This is called “unlabelled data” (for obvious reasons). Most data is of this type, and new techniques such auto-encoders have been developed to take advantage of the vast amount of unlabelled data that is available (consider the number of video frames available on youtube, for example).

Other Learning Types

There are many other learning types that have been explored. Two of the most popular are Semi-supervised Learning, in which the training data includes a few desired outputs (some of the training data is labelled, but not all), and Reinforcement Learning, in which the learner is given rewards for certain sequences of actions.

Artificial Neural Networks (ANNs)

Before diving into ANNs, I’ll just point out that you can think of an ANN is a learning algorithm that learns a program from its training data set(s). How does it learn? See below. That said, there are various kinds of learning that broadly fall into supervised, semi-supervised, and unsupervised classes.  There are a daunting number of learning algorithms created every year, but the ANN remains one of the most interesting as well as being one of the most successful on various tasks. And as we will see below, you can think of the weights and biases that an ANN learns as a program that runs on the ANN. In this blog we will consider only Feed Forward ANNs. Feed Forward ANNs  form directed acyclic graphs (DAGs), whereas in Recurrent Neural Networks (RNNs)  the connections between units are allowed to form directed cycles. This allows RNNs to model memory and time in ways that a Feed Forward ANN can not. See such as Long Short Term Memory (LSTM) networks for an interesting RNN.

As described in an earlier blog, an ANN is comprised of Artificial Neurons which are arranged in input, hidden, and output layers. Figure 6 reviews the structure and computational aspects of an Artificial Neuron and compares it to its biological counterpart.

Figure 2: Biological and Artificial Neurons

Figure 6: Biological and Artificial Neurons

As described in Demystifying Artificial Neural Networks, Part 1, a single artificial neuron (AN) can compute functions with a linear decision boundary. The intuition here is that if you can draw a straight line separating the values of the function that the AN is trying to compute, then that function is said to be linearly separable and is computable by a single AN. This is depicted in Figure 7.

linearly_separable

Figure 7: OR and AND are linearly separable

On the other hand, functions such as XOR are not linearly separable, since a straight line can not be drawn that separates the values of the XOR function (this result generalizes to a hyper-plane in higher dimensional spaces).  This is shown in Figure 8.  Note however that we may be able to transform the input into something that is linearly separable with another layer of ANs as shown in the right side of Figure 8.

Figure 4: XOR is not linearly separable

Figure 8: XOR is not linearly separable

A single hidden layer  ANN is depicted in Figure 9. The hidden layer is called hidden because it is neither input nor output, and its representation of the salient features of the input are not “seen”. From this example we might guess that if we have more layers in our ANN we can compute progressively more complex functions.  So maybe the “deeper” the network (more layers), the more we can compute? It turns out that this intuition is for the most part correct.

Figure 5: Single Hidden Layer ANN

Figure 9: Single Hidden Layer ANN

So we saw that a single AN can compute functions that are linearly separable. What about ANN’s with a single hidden layer? Well, it turns out that there is an Universal Approximation Theorem which states that “a single hidden layer neural network a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units”.  Well, that’s cool. The problem is that the single layer ANN may need exponentially many hidden units, and and there is no guarantee that a learning algorithm can find the necessary parameters.

Feed Forward Neural Networks

The computation of the activation of a single AN is shown in Figure 10. The activation (or output) of the neuron, h_sub_theta_x,  is the linear sum of the weights w_sub_i times the input values x_sub_i plus a bias term b.  Frequently a “pre-activation”, usually denoted as a(x),  is useful to compute as an intermediate result, where a_of_x.

A cartoon of the basic forward propagation in a Feed Forward Neural Network is shown in Figure 11. The basic idea is that you calculate the output of each neuron at each layer, starting with layer 1 (the input to layer 2 are the outputs from layer 1, etc), and propagate those results forward through the network to the output layer, which is why its called a feed-forward network.

More formally, we see that the output of a single AN is  sum or just  g_of_a_of_x.  h_sub_theta_x is called the hypothesis (for historical reasons), and is parametrized by θ, the set of weights and biases that have been learned so far.  This is called the feed forward activation of the AN (you might wonder where the w_sub_i‘s  and the b‘s come from…more on that in a minute).  Popular choices for the activation function g are shown in Figure 12.

Figure 6: Artificial Neuron Computation (graphic courtesy Hugo Larochelle)

Figure 10: Artificial Neuron Computation (graphic courtesy Hugo Larochelle)

Figure x: Feed Forward Processing

Figure 11: Feed Forward Processing

I’ll just note here that popular ANNs use non-linear activation functions such as the logistic or hyperbolic tangent activation functions. A notable exception is the Google deep auto-encoder, which uses a linear activation function. In addition, the Rectified Linear Unit (ReLU) has recently become a popular activation function; this is interesting because the ReLU doesn’t have a smooth derivative (or a derivative at all in some regions), which will make the optimization problem described below a bit harder. The general form of the ReLU function is relu. Kind of amazingly, the derivative of the ReLU function is f_prime, which is just the logistic function shown in Figure 12. Interesting…

Figure 7: Popular Activation Functions

Figure 12: Popular Activation Functions

So how do we compute the feed forward output of a more complex ANN? The feed forward computation for a single hidden layer single output function ANN is shown in Figure 13.

Figure 7: Single Hidden Layer ANN

Figure 13 Single Hidden Layer ANN Feed Forward Computation

As you can see from Figure 13, we calculate the activations for the (single) hidden layer in the say way we did the single AN, namely, for a hidden unit  i in layer 1 (the single hidden layer), the activation of any neuron is simply g. We can generalize this to L hidden layers, as shown in Figure 14. In general, the higher hidden layers of an ANN compute successively more abstract representations of the input data.

Figure 6: Feed Forward Neural Network

Figure 14: L-Layer Feed Forward Neural Network

The final piece of the feed forward computation is the output  function. In Figure 14, the output layer is layer 4, or the top layer of the ANN.  While the computation works in the same way. Suppose we have L hidden layers.  Then by convention the input layer is called layer 0 and hence the output layer is layer L+1. The input to each output unity i  is L and a unit i in the output layer is  output_unit.  The output function o  varies depending what what we’re trying to accomplish. For example, if we’re building an n-way classifier, we might use the softmax function as the output function o.

 Ok, But Where Do the Weights and Biases Come From?

Learning in ANNs comes down to learning the weights and biases (this is a form of Hebbian Learning, which is named after Canadian psychologist Donald Hebb). So for our purposes, the application of the Hebbian principle at work is that learning consists of adjusting weights and biases.  So how are the weights and biases learned?

Overview of the Learning Process

I suspect that this blog is already a tl;dr, so I’m just going to overview how training works and I’ll leave the details (there are many) for the next blog. So we know how, given an ANN, how to calculate the feed forward output  h_sub_theta_x  of the ANN for a given set θ of weights w_sub_i and biases b_sub_i.  Figure 15 reviews the feed-forward stage for an ANN. 

 

Figure x: Feed-Forward Calculation

Figure 15: Feed-Forward Calculation

How might you code up the feed forward step in python? Figure 16 shows how to do it with pybrain. These days I tend to use pylearn2 but there are lots of choices. Others include scikit-learn, theano, weka (if you like java), and many more. There are also quite a few languages that are primarily  built for numerical computation (and as such have key functions built in) such as octave.  You can also write code in octave since it has its own language.

Figure x:

Figure 16: Build and activate a Feed Forward ANN in pybrain

For this discussion, assume we are using supervised learning (we’ll talk about the unsupervised case in the next blog).  In practice this means that for each input x_sub_i  we also know the label y_sub_i.  The y_sub_i‘s will typically be something like “this is a cat”,  “this is a motorcycle”, etc.

Estimating The Loss (or Cost, Error) of Our Prediction

Now that we know how to calculate  h_theta_x_sub_i and we are given y_sub_i, we can get some intuition about what kind of error or loss that the current value of θ produces for each x_sub_i.  As a first approximation, a reasonable guess for current error might be  something like h_sub_theta_x_i_minus_y_i or maybe h_sub_theta_minus_y_i_squared (the standard squared error).  So a candidate loss function might look  something like loss_sum ; this is a typical loss function for linear regression (note that convention is to use a bold font to indicate that you are  talking about a vector and subscript when you’re talking about an element of the vector). BTW, construction of loss functions are a big part of building a machine learning system.  I’ll just note here that loss functions are also called cost functions and are sometimes denoted J(θ).

Given our loss function l, a simple learning algorithm might look something like:

  1. Initialize the parameters θ to random values
  2. Foreach x_sub_i in our training set  t
    1. Calculate h_theta_x_sub_i                                                      # forward propagation step
    2. Calculate the loss function l_of_h
    3. Adjust θ so that our error  l_of_h is minimized 
  3. End

We might consider this to be “learning”.  Kinda makes sense. We want the error of our prediction to be minimized. So here is the important leap:

Learning is, among other things,  an optimization problem

More generally,  learning algorithms that try to minimize the error (loss) such as described in the pseudo-code above are instances of Empirical Risk Minimization (ERM). What ERM tries to do is to find the values of θ that minimize the values of the loss function on the training data set. Basically what we want to do is find the weights and biases that minimize our prediction error. In the algorithm described above,  l_of_h can be minimized using the back propagation algorithm. Backpropagation is an abbreviation for the “backward propagation of errors”, and is a common method of training artificial neural networks that is used in conjunction with an optimization method such as gradient descent.  The method calculates the gradient of a loss function with respects to all the weights in the network (hint: a gradient is just a fancy word for the generalization of the usual concept of derivative of a function in one dimension to a function in several dimensions). The gradient is fed to the optimization method which in turn uses it to update the weights θ in an attempt to minimize the loss function. As we’ll see below, there are ways for this to go wrong, such getting caught in a local minima when the function being optimized is non-convex.

Solving this optimization problem is one of the major challenges for machine learning (along with choosing a good representation and finding an effective objective/loss function), and there are many techniques that yield good results in a variety of learning setting.  And of course, these problems are typically NP-hard so exact solutions aren’t going to be feasible in most cases. As a result, many of the problems that we want solve with machine learning do not have known analytic solutions for the optimization problem (in fact, only the “easy” problems such as linear regression have closed forms).  Consequently most machine learning optimization problems are solved with a numerical method such as Stochastic Gradient DescentL-BFGS, or Newton’s method.

Just one more quick comment on all of this: how many parameters (θ) are being learned in a state of the art machine learning system? Another way to ask this is to ask what is the size of the optimization problem that these systems are solving?  Literally billions, which gives you some idea of what kind of computational resources are required and why GPUs are so popular with folks who implement machine learning systems.

Brief Aside On Optimization

Solving large sometimes non-convex optimization problems is a key challenge for machine learning. Why is non-convex optimization hard?  There are many reasons but  let’s look at one obvious feature of convex vs. non-convex optimization. First, a technique like gradient descent essentially follows the gradient “down”, looking for the minimum value; makes sense, if the “slope” is pointing down, maybe that is the direction of the minimum. So in the case of a convex function such as the one depicted in Figure 17, this works perfectly. As you can see, for a convex function any local minima is also a global minimum, a very nice feature indeed. The bottom line here is that if you are trying to minimize a convex objective function, you can start anywhere and get to the global minimum.

 

Figure 17: Convex Loss Function

Figure 17: Convex Loss Function

Unfortunately, almost any interesting loss function we might want to use for machine learning will be non-convex. A typical non-convex “landscape” is shown in Figure 18.

Figure 18: Non-Convex Loss Function

Figure 18: Non-Convex Loss Function

As you can see from Figure 18,  if you start in the “wrong” place, you might follow the gradient down into a local minima which is not the global minimum. In other words, the optimization procedure can get stuck in a local minima, since in such a minima everywhere you “look” the gradient points upward and as a result you think you’re at the bottom (but you aren’t) and you’re stuck.  So in the case of non-convex loss functions, the optimization procedure will be sensitive to the initial settings of θ, suggesting that random initialization might not be a sensible way to initialize θ.

Given that this is tl;dr already, we’ll talk about methods for solving the optimization problem in the next blog (so dust off your vector calculus and brush up on gradients and Hessian matrices).

Keeping Your Options Open: The Anthropic Problem, Entropy, and AI

The Anthropic Problem

One of the biggest meta-problems in cosmology over the past decade or so has been the so-called Anthropic Problem, which asks: Why does the universe that have the properties that it does? The standard answer to this question has traditionally been (approximately): Because if it didn’t have the properties it did we wouldn’t be able to exist to ask the question. Another answer can be found in the concept of naturalness, which proposes a weaker form of the same answer.

The Entropic Principle

There is, however, another way to look at the Anthropic Problem. Raphael Bousso and his colleagues have articulated the Entropic Principle, which holds that universes that create the maximum amount of entropy over their lifetimes (clipping out certain inconvenient portions of the universe like black holes which are causally disconnected from the rest of the universe) tend to reproduce certain critical values that we observe in our own universe, such as the Cosmological Constant. Huh? That is a pretty amazing result, and makes an immediate connection between maximizing causal entropy (i.e., keeping options open) and intelligence. A further discussion of the Entropic principle and its implication for intelligent observers can be found in Wissner-Gross et. al.’s perhaps controversial discussion of Causal Entropic Forces. Wissner-Gross’ approach has gained traction over the past few years in the Artificial General Intelligence (AGI) community as a candidate for the physical basis of AGI.

Game Playing and Keeping Your Options Open

Another example of keeping options open surrounds AI game playing. Note that it took years before a chess playing program could defeat a human master when Deep Blue defeated Garry Kasparov back in 1997. After Kasparov was defeated, the next “Grand Challenge” for AI seemed to be the game of Go, since Go has a vastly larger search space (Chess has 10_to_the_120possible games, while Go has 10_to_the_761 possible games). Somewhat surprisingly machine Go players such as MoGo  are already competitive with humans. How did MoGo and its successors accomplish this feat?  Perhaps paradoxically, these Go players use a very simple algorithm known as Monte Carlo Tree Search (MCTS).

Basically,  MCTS is a form of sparse tree sampling that simply looks at the set of all possible legal next moves and plays the game out against a random player making random moves, then choose the ones that win. To a first order of approximation MCTS is nothing more complex than this and almost all MCTS optimizations are about pruning the search space. In summary,  what appeared to be something of a Grand Challenge for AI was conquered by an incredibly simple algorithm. But why?

Hints

So what are we to make of this apparent contradiction? That is, what is the hint here? Well, MCTS optimizes the number of future choices (paths) that are available to the player (i.e., it is maximizing  causal entropy).  According to Wissner-Gross, the idea then is that this is evidence that there is some deep and profound connection between casual entropy (keeping the future open) and intelligence.

Deep Connections

So we are getting hints from AGI and Cosmology (among other disciplines) that seem to imply that there is some sort of deep connection between causal entropy production and intelligent observer concentrations in a universe.  The same thing said in Wissner-Gross’ vernacular:

Keeping Options Open implies Capturing Possible Futures implies Constrained  Maximization of Causal Entropy implies Causal Entropic Force

Ok, so what is the takeaway?  Quite simply:  keep your options open.

Why I stood down as Chair of the OpenDaylight Technical Steering Committee

logo_ogimage_od500

Over the past several days quite a few people have asked me why I stood down as the chair of the OpenDaylight Project Technical Steering Committee (TSC). Basically, I had three reasons:

  1. First, I have things I want to accomplish which require my focused attention. And note that the TSC chair job is substantial time commitment, definitely not a hobby.
  2. Next, the project has matured to the point where the TSC should for the most part be populated by active developers (a well accepted principle of operation in open source projects);  I am currently not an active developer on OpenDaylight.
  3. Finally, I wanted to give some other (hopefully younger) people a chance to have leadership roles in the project; that is not only good for the community but it also creates organizational survivability for the project (among other things).

Colin Dixon is off to a great start as the new chair of the TSC.  Good things are in the offing for OpenDaylight.

Intriguing properties of neural networks?

Intriguing?

Last year an interesting paper entitled Intriguing properties of neural networks pointed out what could be considered systemic “blind spots” in deep neural networks.  Much of this work has focused on what are called Convolutional Neural Networks or CNNs. CNNs are a form of Multilayer Artificial Neural Network that have had great success in a variety of classification tasks such as image recognition. Figure 1 shows a typical CNN.

 

Figure 1: Typical CNN

Figure 1: Typical CNN

Convolution and Pooling/Subsampling

Two of the important features of CNNs shown in Figure 1 are the alternation of  convolutional and pooling or subsampling layers. The basic idea behind convolution is to produce a measure of the overlap between two functions. In this case we convolve the part of the image in the Receptive Field of a neuron (the part of the image that a set of features can “see”; Figure 1 shows a receptive field in the bottom right corner of the input, the “A” ) with a filter that looks for some element in the image such as a vertical line, edge, or corner. This is important in image processing as this overlap describes what are in some sense the translation independent features of the image. The second operation, pooling/subsampling not only reduces the dimensionality of the input but does a kind of  model averaging (in much the same way as dropout averages models; BTW, dropout also has the amazing property that it prevents co-adaptation of feature detectors which improves a network’s ability to generalize). Two popular pooling techniques are max-pooling, where you take the maximum value of the pooled region and average-pooling, where (not surprisingly) you use the average value of the pooled region.

In both cases a primary goal is to extract invariant features (i.e., those that are independent of some set of translations, scaling, etc) from the input stream. This is of critical importance; for example, you want to be able to recognize a face in an image independent of out of plane rotations and the like. I highly recommend the introduction by Hugo Larochelle on this topic if you have further interest. See also  Visualizing and Understanding Convolutional Networks.

Convolution and Pooling are depicted in cartoon form in Figure 2.

convolution_plus_subsampling

Figure 2: Convolution and Pooling/Subsampling

So why is the paper cited above “intriguing”?  Well, here’s why: There were two main results. The first result is that they show that there are unlikely to be higher-level Grandmother cells (neuron’s that are tuned to detect some specific input); rather it appears that the semantic information in the high layers of a deep neural network is a property of the network space itself (this result didn’t jump out at me as shocking). This is at least somewhat at odds with much of the contemporary thinking in the Machine Learning  community where higher-layer neurons are thought to be discrete feature detectors.

A perhaps more interesting result is that they show that deep neural networks learn input-output mappings that can be discontinuous in the behavior of the output manifold (basically the invariance properties that were created, in part, by the convolution and pooling/subsampling layers). In particular, they construct “adversarial” examples by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error, which causes the network to misclassify the image;  basically what is happening here is that the crafted perturbation causes the invariance to break down and the example jumps off the output manifold. This appears to be a universal property: they show the effect for several different kinds of networks. This effect is shown in Figure 3 where an “imperceptible” change causes a bus to be misclassified as an ostrich. The figure on the left in Figure 3 is the correctly classified image, the image on the right is the changed/misclassified image, and the figure in the middle is the “difference”. It is thought that the “ghosting” effect (middle image) in the adversarial image is somehow key, but the paper doesn’t explore this further.

Figure 3: Adversarial Examples

Figure 3: Adversarial Examples

See also  Encoded Invariance in Convolutional Neural Networks and Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images for similar results (using different algorithms) and further discussion.

My Takeaway

My experience has been that when you see results such as these, it usually indicates that something very fundamental is being missed. There is no reason to believe that this case to be any different.  In fact, in discussing the great successes that CNNs have enjoyed, especially in image classification tasks,  Zeller and Fegus lament that

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientic standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.

I have to agree with this sentiment and further,  the paper we’ve been discussing is more or less evidence of just this situation. There is great depth and breadth of ideas and literature in the Machine Learning community (dauntingly so) but even with our wealth of research and implementation and deployment experience, the fundamental principles of machine intelligence remain largely hidden.

Interesting Ideas From Artificial General Intelligence (AGI)

BTW, as an example of a very different (and apparently not well accepted)  approach to machine intelligence is  Jeff Hawkins‘ Hierarchical Temporal Memory idea. He posits that the operation of the brain (and therefore intelligence) is more about memory than it is about computation. Hence, one of his core principles of machine intelligence is something he calls Variable Order Sequence Memory, or VOSM.  VOSM is heavily biologically-inspired, and an example is shown in Figure 4. Note that while VOSM itselfl is not a mainstream idea, the idea of sparseness and its connection to intelligence is well established (see, for example Olshausen and Field for the foundational work in this area).

The central idea behind VOSMs is that if you represent things (objects, ideas, music, …) with what are called Sparse Distributed Representations or SDRs, you can solve many of the problems that have vexed the machine learning and AI communities  for decades; essentially you can solve what is known as the Knowledge Representation problem.  You can also represent time in a VOSM, a dimension notably missing in most Artificial Neural Networks and an obviously necessary component of intelligent reasoning and prediction.  For example, the different contexts in a VOSM can be thought of as representing different temporal sequences. A comparison of Dense and Sparse representations is shown in Figure 5. BTW, its pretty clear that sparseness is a not just a nice feature of intelligence, it is a basic requirement (we should spend some time thinking about the difference between machine learning and machine intelligence (“narrow AI” vs. Artificial General Intelligence), but I’ll leave that for future blogs).

Figure 4: Variable Order Sequence Memory

Figure 4: Variable Order Sequence Memory (VOSM)

Figure 5: Dense vs. Sparse Represenations

Figure 5: Dense vs. Sparse Represenations

The representations shown in Figure 4 are sparse because most bits are zero. This is also illustrated in Figure 5.  Sparseness has many advantages including that each bit can have semantic meaning (contrast dense encodings such as ASCII) and subsampling works well because the probability of any bit being on at random is very small (1_over_10_to_the_40 in the example in Figure 4). BTW, the number of different contexts for one input that a VOSM in the neocortex can represent is thought to be something like 6_to_the_2000 , a pretty large number. The 6 comes from the number of layers in the neocortex (see Figure 6; columns of height 6) and 2000 because it appears there are on average about 2000 polarized dendrites in an active region of a neuron (you can think of this as a 2000 bit sparse “word”). In theory the number of possible contexts for one input is actually quite a bit larger since memory in the neocortex is arranged hierarchically.

Figure 5: Layers of the Neocortex (6)

Figure 6: Layers of the Neocortex

Finally…

Note that most of this thinking is not surprisingly quite controversial, and much of it comes from the study of Artificial General Intelligence, or AGI (contrast AGI with most of what we see in AI today which could be considered “narrow AI”). For more on AGI, see Ben Goertzel‘s work on opencog (and beyond) or Randal Koene on AGI, among many others. In any event, there is a lot going on in both narrow AI and AGI that warrants our attention, both from technological and ethical points of view. That much is for sure. More on this next week.

 

 

 

 

Demystifying Artificial Neural Networks, Part 1

Biological and Artificial Neurons

This week I thought I’d give you a very brief introduction to one of the most powerful and perhaps most mysterious learning algorithms we have: the Artificial Neural Network or ANN. This will take us on an journey through what a biological neuron is (and briefly how they work/learn), how an artificial neuron model might model a biological neuron, what the computational complexity of a single artificial neuron looks like, and how these artificial neurons can be assembled to learn statistical models of their input data sets. This stuff lies at the intersection of computer science, robotics, systems biology, neuroscience and what we all think of as networking. What could be cooler than that? Lets take a look…

Biological Neurons

Much of machine learning claims to be “biologically  inspired”. While this is reassuring (some of our brains learn  pretty well), it is not required and many learning algorithms are  considered “biologically infeasible”.  So lets briefly look at the  structure of a single biological neuron and consider it as a feasible model for an artificial neuron. Figure 1 shows a typical biological neuron. BTW, in case you are wondering, the human brain is thought to have roughly  O(10^11) neurons; you can compare that to a cockroach which has  O(10^6)  neurons or a chimpanzee which has O(10^10).

For purposes of our discussion, I want to call your attention to three  features of biological neurons: First are the dendrites; think of these  as the terminals at which the neuron receives its inputs (real dendrites can do computations and have feedback control; this is not typically modeled an artificial neuron; a notable exception is Jeff Hawkins‘ Hierarchical Temporal Memory model). Next is the cell body; think of this as where any processing  occurs. Finally, the axon carries the output from the cell body to neighboring neurons and their dendrites. Basically the neuron is  a “simple” computational device: it receives input at its dendrites, does a computation at the cell body, and carries the output on its axon (I’ll just note here that a real neuron is much more complicated; for example the dendrites themselves can carry out some kinds of computation).

Figure 1: Biological Neuron

Figure 1: Biological Neuron

Biological neurons communicate across a synapse, which is a chemical “connection”, typically between an axon and dendrite. Signals flow from the axon terminal to receptors on the dendrite, mediated by the chemical state of both the axon and dendrite. So what does learning mean in this setting? According to Hebbian theory (named after Canadian psychologist Donald Hebb), learning  is related to the increase in synaptic efficacy that arises from the presynaptic cell’s repeated and persistent stimulation of the postsynaptic cell.  It is this increase in “communication” efficacy that we call learning.  One way to think about this is that the connection between the axon terminal and the dendrite is weighted, and the larger the weight (concentration of neurotransmitters in the axon terminal along with other chemical components including receptors in the dendrites) the more likely the neuron is to “fire”.  So in this setting learning consists of optimizing neuron firings in certain patterns; to put it a different way, learning consists of optimizing the connection weights between axons and dendrites in a way that leads to some observed behavior. We will return to this idea of optimizing weights as learning when we talk about training Artificial Neural Networks.

Artificial Neurons

Artificial Neurons, or ANs, are biologically inspired algorithms (sorry, move along, no magic here). Figure 2 compares biological and artificial neurons (BTW, I have no idea of how to write mathematical notation in wordpress, so hey, if you know how to do that drop me a note).

Figure 2: Biological and Artificial Neurons

Figure 2: Biological and Artificial Neurons

What you can see is that the artificial neuron has inputs, typically labelled  x_n and an output y. The inputs x_i roughly correspond to the signals received by the dendrites of a biological neuron, and the output models what the neuron soma (cell body) sends down the axon to its neighbors (i.e., the neuron’s output).

Now, in order to calculate the “pre-activation” or the AN, it calculates the sum  artificial_neuron_calculation (b is a bias term) . Finally, to calculate the output (the equivalent of what the biological neuron would send on its axon), the AN applies an activation function, usually called g(.), to the sum to get the AN’s activation. This winds up looking like this: activation.  

So here is the really amazing thing: learning in an ANN is in large part about setting weights and biases. This is also a pretty good (although admittedly simplistic) analogy to what how biological networks learn; this kind of falls out of Hebbian theory. In fact, the w_i‘s are frequently called “synaptic weights” as they play a role similar to that of the chemistry of the synapse in learning processes. Of course, computation in a biological neuron is much more complicated, so the artificial neuron is in some sense an “abstraction”.

The next question is what the activation function g(.) might look  like. Among the first ANs was the perceptron, invented by Frank Rosenblatt and his colleagues in 1957. In modern terms the perceptron was an algorithm for learning a binary classifier. However, the perceptron, had a problem that limited its utility. In particular, the perceptron computed a step function. The fact that a step function is by definition not continuous made it mathematically unwieldy; for example step functions aren’t generally guaranteed to have a derivative at the “step” (i.e., the slope of the curve is infinite/undefined at the step). The activation function cartoon for the  perceptron is shown in Figure 3.

Figure 4: Perceptron Activation

Figure 3: Perceptron Activation

Modern ANs use a variety of activation functions which are smoother than the perceptron’s step function. The most  popular activation functions are shown in Figure 4. Note that the logistic function is sometimes called “sigmoid” because of its shape. These functions are smoother than the step function an as such have nicer mathematical properties.

Figure 5: Activation Functions

Figure 4: Activation Functions

What Can A Single AN Compute?

We’ve seen that a single AN is comprised of connection weights (the Ws), a bias term, and an activation function. Given this,  what can we say about what a single AN compute? Figure 5 shows the situation for the a sigmoid activation function with two inputs, x_sub_1 and x_sub_2. Note here that I’m using sigmoid and logistic interchangeably when referring to activation functions;  BTW, logistic regression doesn’t have much to do with linear regression, so don’t get confused by that either.  In any event what you can see in Figure 5 is that the “decision boundary” for an AN with a sigmoid activation function is linear (“decision boundary” is an important concept in machine learning that we’ll spend more time on later). What this means at a hight level is that the values of the two classes that the AN is trying to predict or classify can be separated by a straight line. Similarly, in the general case of multiple classes a decision boundary is linear if there exists a hyper-plane that separates the classes.

Figure 6: Decision Boundary for a Single Neuron (graphic courtesy Hugo Larochelle

Figure 5: Decision Boundary for a Single Neuron

The intuition here is that a single AN can compute functions with a linear decision boundary.  These are frequently called linear features (we will look at how to handle non-linear features in a later blog). As an example, consider the cases shown in Figure 6, and consider the OR function on the left. The triangles indicate where the value of  or is one and where there are circles indicate where  or is zero.  You can draw a straight line separating the triangles and circles, so or is linearly separable. The result is that since the OR and AND functions are linearly separable, they can be computed by a single  AN.

Figure 7: Linearly Separable Problems

Figure 6: Linearly Separable Problems

On the other hand,  the XOR function is not linearly separable; this situation is shown on the left in Figure 7. On the right  we transformed the input (with functions that are computable by a single AN) into something that is linearly separable, and hence the XOR of the transformed input is computable by a single AN. This gives us the hint that we might be able to compute more complex functions with a multi-layer ANN. As we shall see, that intuition is correct.

Figure 8: XOR is not Linearly Separable

Figure 7: XOR is not Linearly Separable (graphic courtesy Hugo Larochelle)

Next time we’ll take a look at multi-layer ANNs and how to train them (i.e., more on what learning means and how it works). Once we have that machinery, we’ll be able to look much of the cool contemporary work in Machine Learning including  amazing ideas like stacked denoising auto-encoders and Restricted Boltzmann Machines, and some of the beautiful techniques such as dropout which are designed to make your algorithms learn more effectively. Armed with all of that,  we’ll write some phython code that implements a multi-layer ANN and generate some data to test it.

And Now For Something Completely Different

Last week I gave a talk to the Networking Field Day crew about something I’m calling Software Defined Intelligence, or SDI. SDI is a new interdisciplinary approach that integrates Compute, Storage, Networking, Security, Energy, and IoT (and I’m sure many other things) with Machine Learning.  Like many “big” ideas, the some of the components of SDI may not themselves be new; rather it is the combination of machine learning with networking which creates a new discipline and which is novel here. From what we have already seen,  we know that machine learning is having (and will continue to have) a dramatic effect on the way we build, operate and monetize networks, data centers, as well as diverse technologies including  mobile handsets and just about everything else. Indeed, as renowned venture capitalist Vinod Khosla has opined “In the next 20 years, machine learning will have more impact than mobile has.”.  While Mr. Khosla’s statement is perhaps a bit hyperbolic, few who are familiar with machine learning doubt its potential impact (complete with the hypothetical apocalyptic down sides [think Skynet] described by noted physicist Stephen Hawking and his colleagues).

Machine Learning?

Machine Learning is the foundational and enabling technology of SDI, and as such data science and “big data” are an inherent part of SDI. Andrew Ng, former Stanford professor and Coursea co-founder (now at Baidu), puts a finer point on it:

A trained learning algorithm (e.g., neural network, boosting, decision tree, SVM, …) is very complex. But the learning algorithm itself is usually very simple. The complexity of the trained algorithm comes from the data, not the algorithm.

This is a radically different paradigm than the programming approaches that we might have learned in school (or otherwise be used to), in which we write programs that process data to produce output. In the case of machine learning, the output plus data gives us a program that can predict outputs (sometimes called regression) given a new input, classify new inputs (is this an image of a cat or a dog or a motorcycle?), and many other tasks. This difference is depicted in cartoon form in Figure 1.

Traditional Programming vs. Machine Learning

Figure 1: Traditional Programming vs. Machine Learning

Of course, tasks like regression, classification and correlation analysis are key functions of any network analytics engine. Another key attribute of any analytics engine is whether it’s actions are proactive or reactive. To understand the difference, consider the example of DDOS mitigation. Today’s DDOS mitigation solutions are largely reactive: We collect SFLOW (and/or other kinds of) data and analyze it at some analytics backend; if we find that we’re seeing a DDOS attack we change the network’s configuration to deal with the offending flows in whatever way the network administrator sees fit.  On the other hand, the SDI approach is proactive. We use the configurations, telemetry and flow data to predict the probability  that a DDOS is about to occur. This is an instance of a class of techniques that are sometimes collectively called “Predictive Security”.

What Was The Promise Of SDN Anyway?

The promise of SDN has always been, among other things, to enable us to  build much more intelligent networks. SDI is a framework that is designed to do just this: it takes advantage of the programmability and software orientation of technologies like SDN and combines it with Machine Learning to enable powerful new class of intelligent networks. These networks are based not only on SDN but also on Software Defined Compute, Storage, Security, Energy, IoT and beyond.

By the way, if you want to write some machine learning code to solve a problem you have or just to explore what is possible, there are many great open source libraries such as scikit-learn and pylearn2 (again, among many others), written in almost whatever language you like. Personally I like python; see  these code snippets for a quickstart on how you can build a support vector machine with various kernels or an artificial neural network that is trained with backpropagation. Finally, the slides and video from last week’s talk can be found on the respective links.

There’s a lot to say on this topic but I wanted to get people thinking about the idea of using machine learning in conjunction with SDN (orchestration, …) to build a new and powerful kind of network intelligence. I’ll give you more detail on machine learning theory (and how this all works) as well as more use cases in my next blog on this topic.