Here I continue my effort to write an accessible introduction to Parallel Distributed Processing. In Part I, I introduced units, connections, and connection strengths. In Part 2, I introduced the concept of the activation function, and the difference between linear and saturating (e.g., sigmoidal) activation functions. Here, in Part 3, I briefly describe the difference between shallow and deep networks at a level hopefully suitable for a beginner. With vampires (and werewolves, etc)!
5. Complication: Deep Networks At this point, the network is configured to tell Sookie how much danger—broadly categorized—she is in in any particular romantic configuration. Say, however, that Sookie wants a more detailed breakdown of the peril that is likely to befall her than simply whether she will be in danger or not. Perhaps she wants to know, for a given situation, what the likelihoods are that she will 1) have her blood drained 2) be shot by a werewolf 3) be shot by a human hunting the shapeshifter or 4) just be penetratingly chastised by the best friend of one of the humans. We might imagine that the likelihood of each of these consequences is related to how much danger Sookie is in in totum. Thus, we might extend our network to look more like this:
Figure 7 is a very simple example of what we would call a deep network. It is deep because it has more than 2 layers. “Deep” networks can be arbitrarily “deep”, that is, while Sookie’s network has 3 layers, there’s no reason it couldn’t have 4, or 5, or 100. In deep networks, we will refer to the inputs still as inputs, the outputs still as outputs, but all non input/output layers as hidden. In Sookie’s network, the inputs are still the suitor units, the outputs are now the danger constituent units, and the DANGER! unit is a hidden unit. In what we read, the input and output units will usually (but not always) be more important to understand than the hidden units. That is because, unlike in Sookie’s network, the hidden units won’t always refer to easily labeled quantities: they will simply be a way of mathematically transforming the input in to the output.
The advantage of deep networks is that the deeper they get, the more fine-grained problems they can solve—in this example we have recoded simple “DANGER” in to some of its constituent parts. However, deep networks come with a significant disadvantage: the more layers that are added, the longer they take to train (exponentially so). This is a consequence of back-propagation through time, the training algorithm that is typically used to train this kind of network (though there are other algorithms that are faster-- training routines for deep networks is an active area of research in machine learning).