Last year an interesting paper entitled Intriguing properties of neural networks pointed out what could be considered systemic “blind spots” in deep neural networks. Much of this work has focused on what are called Convolutional Neural Networks or CNNs. CNNs are a form of Multilayer Artificial Neural Network that have had great success in a variety of classification tasks such as image recognition. Figure 1 shows a typical CNN.
Convolution and Pooling/Subsampling
Two of the important features of CNNs shown in Figure 1 are the alternation of convolutional and pooling or subsampling layers. The basic idea behind convolution is to produce a measure of the overlap between two functions. In this case we convolve the part of the image in the Receptive Field of a neuron (the part of the image that a set of features can “see”; Figure 1 shows a receptive field in the bottom right corner of the input, the “A” ) with a filter that looks for some element in the image such as a vertical line, edge, or corner. This is important in image processing as this overlap describes what are in some sense the translation independent features of the image. The second operation, pooling/subsampling not only reduces the dimensionality of the input but does a kind of model averaging (in much the same way as dropout averages models; BTW, dropout also has the amazing property that it prevents co-adaptation of feature detectors which improves a network’s ability to generalize). Two popular pooling techniques are max-pooling, where you take the maximum value of the pooled region and average-pooling, where (not surprisingly) you use the average value of the pooled region.
In both cases a primary goal is to extract invariant features (i.e., those that are independent of some set of translations, scaling, etc) from the input stream. This is of critical importance; for example, you want to be able to recognize a face in an image independent of out of plane rotations and the like. I highly recommend the introduction by Hugo Larochelle on this topic if you have further interest. See also Visualizing and Understanding Convolutional Networks.
Convolution and Pooling are depicted in cartoon form in Figure 2.
So why is the paper cited above “intriguing”? Well, here’s why: There were two main results. The first result is that they show that there are unlikely to be higher-level Grandmother cells (neuron’s that are tuned to detect some specific input); rather it appears that the semantic information in the high layers of a deep neural network is a property of the network space itself (this result didn’t jump out at me as shocking). This is at least somewhat at odds with much of the contemporary thinking in the Machine Learning community where higher-layer neurons are thought to be discrete feature detectors.
A perhaps more interesting result is that they show that deep neural networks learn input-output mappings that can be discontinuous in the behavior of the output manifold (basically the invariance properties that were created, in part, by the convolution and pooling/subsampling layers). In particular, they construct “adversarial” examples by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error, which causes the network to misclassify the image; basically what is happening here is that the crafted perturbation causes the invariance to break down and the example jumps off the output manifold. This appears to be a universal property: they show the effect for several different kinds of networks. This effect is shown in Figure 3 where an “imperceptible” change causes a bus to be misclassified as an ostrich. The figure on the left in Figure 3 is the correctly classified image, the image on the right is the changed/misclassified image, and the figure in the middle is the “difference”. It is thought that the “ghosting” effect (middle image) in the adversarial image is somehow key, but the paper doesn’t explore this further.
See also Encoded Invariance in Convolutional Neural Networks and Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images for similar results (using different algorithms) and further discussion.
My experience has been that when you see results such as these, it usually indicates that something very fundamental is being missed. There is no reason to believe that this case to be any different. In fact, in discussing the great successes that CNNs have enjoyed, especially in image classification tasks, Zeller and Fegus lament that
Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientic standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.
I have to agree with this sentiment and further, the paper we’ve been discussing is more or less evidence of just this situation. There is great depth and breadth of ideas and literature in the Machine Learning community (dauntingly so) but even with our wealth of research and implementation and deployment experience, the fundamental principles of machine intelligence remain largely hidden.
Interesting Ideas From Artificial General Intelligence (AGI)
BTW, as an example of a very different (and apparently not well accepted) approach to machine intelligence is Jeff Hawkins‘ Hierarchical Temporal Memory idea. He posits that the operation of the brain (and therefore intelligence) is more about memory than it is about computation. Hence, one of his core principles of machine intelligence is something he calls Variable Order Sequence Memory, or VOSM. VOSM is heavily biologically-inspired, and an example is shown in Figure 4. Note that while VOSM itselfl is not a mainstream idea, the idea of sparseness and its connection to intelligence is well established (see, for example Olshausen and Field for the foundational work in this area).
The central idea behind VOSMs is that if you represent things (objects, ideas, music, …) with what are called Sparse Distributed Representations or SDRs, you can solve many of the problems that have vexed the machine learning and AI communities for decades; essentially you can solve what is known as the Knowledge Representation problem. You can also represent time in a VOSM, a dimension notably missing in most Artificial Neural Networks and an obviously necessary component of intelligent reasoning and prediction. For example, the different contexts in a VOSM can be thought of as representing different temporal sequences. A comparison of Dense and Sparse representations is shown in Figure 5. BTW, its pretty clear that sparseness is a not just a nice feature of intelligence, it is a basic requirement (we should spend some time thinking about the difference between machine learning and machine intelligence (“narrow AI” vs. Artificial General Intelligence), but I’ll leave that for future blogs).
The representations shown in Figure 4 are sparse because most bits are zero. This is also illustrated in Figure 5. Sparseness has many advantages including that each bit can have semantic meaning (contrast dense encodings such as ASCII) and subsampling works well because the probability of any bit being on at random is very small ( in the example in Figure 4). BTW, the number of different contexts for one input that a VOSM in the neocortex can represent is thought to be something like , a pretty large number. The 6 comes from the number of layers in the neocortex (see Figure 6; columns of height 6) and 2000 because it appears there are on average about 2000 polarized dendrites in an active region of a neuron (you can think of this as a 2000 bit sparse “word”). In theory the number of possible contexts for one input is actually quite a bit larger since memory in the neocortex is arranged hierarchically.
Note that most of this thinking is not surprisingly quite controversial, and much of it comes from the study of Artificial General Intelligence, or AGI (contrast AGI with most of what we see in AI today which could be considered “narrow AI”). For more on AGI, see Ben Goertzel‘s work on opencog (and beyond) or Randal Koene on AGI, among many others. In any event, there is a lot going on in both narrow AI and AGI that warrants our attention, both from technological and ethical points of view. That much is for sure. More on this next week.