Wednesday, October 31, 2012

Center Embedding

Hello LAUGers!

You all know that I simply can't discuss computational linguistics without some reference to Lt. Cmdr Data.  Here is a really amazing video of Brent Spiner working himself up into a quadruple center embedding:


Hopefully my students will benefit from this amazing example and forever connect it with Chomskian linguistics and the Competence / Performance debate.

Wednesday, August 15, 2012

Better than a punch in the arm

Scientists convince their friends to get punched in the arm by robots... for science.

I'm assigning this article to my intro modeling class to try and get them thinking about AI representations.  A robot doesn't know what "pain" is, so you can't just tell it "don't cause pain."  Instead, you have to operationalize "pain":  that's what this study is trying to do by saying "look, it is bad if you move at X speed when you are grasping Y implement."  Then, when the robot detects a human nearby (a whole nother gigantic, difficult AI problem), it can know not to move faster than X.


Friday, August 10, 2012

More vampires in psychology: Signal Detection Theory and The Vampire in the Crowd

The summer is trailing off, my grants are submitted, and my thoughts have turned to my fall courses.

I am teaching the Lab-Requirement-Filling (and therefore desperately sought out) Perception Lab.  This involves a review of Signal Detection Theory.

As an undergrad (and a grad student, and a post doc, and an assistant professor), I always found SDT SOOOOOOO Boring, no doubt thanks to material like this (from the textbook I use for the class):

"Suppose that, for an individual subject, the proportion of hits was 0.70 and the proportion of false alarms was 0.10.  Looking up 0.70 (under "p") in the table, we find a corresponding value of z = 0.52"

--St. James, D. J., Schneider, W., & Eschman, A.  (2005).  PsychMate Student Guide Version 2.0.  Psychology Software Tools, Pittsburgh.

Even though I tend to think the discussion of SDT in this text is fairly clear and straightforward, it's pretty impossible (especially as a non-expert) to read pages of that and not become completely disenfranchised.  This is too bad because, when I stop to think about it, SDT is actually AWESOME, and useful / applicable ALL THE TIME, such as in situations like these:

[ETA:  For full resolution slides, just click.  Your fancy browser will probably open a full res slide show.]




...And in all kinds of less ridiculous but hugely important ventures such as:

  • Testing for diseases
  • Finding tumors in radiology screens
  • Monitoring food in a gross "unwrapped" style food-sembly line for "abnormalities" (read, a mouse in a snickers)
  • Screening for weapons in baggage x-rays
  • Identifying counterfeit currency

Thus, I decided to present SDT to my students this year in terms of the Vampire in a Crowd parable.

Namely:

Say you go to a crowded nighttime showing of Bela Lugosi's Dracula.  Despite the cheesy makeup and special effects, you become so freaked out by the movie that you are CONVINCED that the other audience members may be vampires.  You get out your stake for the spooky walk to your car after the movie.  When you hear footsteps behind you, you are faced with a problem:  should you stake the person approaching you through the heart or should you ignore them?  Assume that at some point in your journey to your morbidly parked car, each member of the audience approaches you from the shadows of the parking lot.  







It goes on to motivate the ROC in terms of the "consequences" of hits, misses, false alarms, and correct rejections in this scenario (e.g., having your blood sucked, accidentally staking a member of the Brooklyn Tabernacle Choir).

In sum, I am very sad that I don't actually get to give this lecture:  my TAs do all the actual teaching for this course, because it is so big.  Maybe I can do a guest spot!







  

Tuesday, July 31, 2012

Sookie’s Complicated Love Life Or: An Absolute Beginner’s Guide to Connectionist Modeling, Part III: Deep Networks

Here I continue my effort to write an accessible introduction to Parallel Distributed Processing.  In Part I, I introduced units, connections, and connection strengths.  In Part 2, I introduced the concept of the activation function, and the difference between linear and saturating (e.g., sigmoidal) activation functions.  Here, in Part 3, I briefly describe the difference between shallow and deep networks at a level hopefully suitable for a beginner.  With vampires (and werewolves, etc)!

5.  Complication:  Deep Networks  At this point, the network is configured to tell Sookie how much danger—broadly categorized—she is in in any particular romantic configuration.  Say, however, that Sookie wants a more detailed breakdown of the peril that is likely to befall her than simply whether she will be in danger or not.  Perhaps she wants to know, for a given situation, what the likelihoods are that she will 1) have her blood drained 2) be shot by a werewolf 3) be shot by a human hunting the shapeshifter or 4) just be penetratingly chastised by the best friend of one of the humans.  We might imagine that the likelihood of each of these consequences is related to how much danger Sookie is in in totum.  Thus, we might extend our network to look more like this:



Figure 7 is a very simple example of what we would call a deep network.  It is deep because it has more than 2 layers.  “Deep” networks can be arbitrarily “deep”, that is, while Sookie’s network has 3 layers, there’s no reason it couldn’t have 4, or 5, or 100.  In deep networks, we will refer to the inputs still as inputs, the outputs still as outputs, but all non input/output layers as hidden.  In Sookie’s network, the inputs are still the suitor units, the outputs are now the danger constituent units, and the DANGER! unit is a hidden unit.  In what we read, the input and output units will usually (but not always) be more important to understand than the hidden units.  That is because, unlike in Sookie’s network, the hidden units won’t always refer to easily labeled quantities:  they will simply be a way of mathematically transforming the input in to the output.

The advantage of deep networks is that the deeper they get, the more fine-grained problems they can solve—in this example we have recoded simple “DANGER” in to some of its constituent parts.  However, deep networks come with a significant disadvantage:  the more layers that are added, the longer they take to train (exponentially so).  This is a consequence of back-propagation through time, the training algorithm that is typically used to train this kind of network (though there are other algorithms that are faster-- training routines for deep networks is an active area of research in machine learning).

All of the networks we will read about in this class will be deep, to a greater or lesser extent.  However, this shouldn’t trip you up:  deep networks work exactly the same as our original 2 layer network.  That is, activation starts in the input layer, and then flows to the second layer.  In Sookie’s case, activation begins with her paranormal suitors and flows to the danger unit.  Activation then flows in exactly the same way to subsequent layers.  Again in Sookie’s case, activation flows from the DANGER! unit into each of the constituent danger units in exactly the same manner that it go to the DANGER! unit in the first place.  The networks we read about may seem a lot more complicated than this due to the way they are described, but in reality activation flowing from one level to another is all that ever happens. 

Tuesday, June 26, 2012

Sookie’s Complicated Love Life Or: An Absolute Beginner’s Guide to Connectionist Modeling, Part II: The Activation Function

Here I continue my effort to write an accessible introduction to Parallel Distributed Processing.  In Part I, I introduced units, connections, and connection strengths.  In Part 2, I introduce the concept of the activation function, and the difference between linear and saturating (e.g., sigmoidal) activation functions.  But with vampires!



3.  The Activation Function Remember, the goal of this network is to answer questions like “If I date young vampire and old vampire simultaneously, how much danger will I be in?” or “Is it more dangerous to date a human and an old vampire or a young vampire and a shapeshifter?”  In order to answer these questions with mathematical rigor, we need to find away to put numbers on our units and connections, in order to put a number on DANGER!. 

We already used our intuition to come up with some relative connection strengths:  these were reflected in the thicknesses of the connections in Figure 3.  As it would turn out, using intuition to come up with connection weights is, historically, the first method that was used to build this type of network.  However, it was quickly discovered that when you have networks that have, say, 100,000 connections, this is not a very practical plan.  What works better for most of the networks we will consider is using a mathematical rule called a training algorithm to set the weights.  You won’t ever have to know much about training rules, other than that they exist and they set weights automatically, without the modeler having to figure out 100,000 (or more!) weights. 

We definitely don’t need to think about training rules right now.  Instead, let’s use the intuitive pipe-thicknesses we came up with earlier to assign some numerical weights to the network:

Figure 4

In Figure 4, numbers are assigned to weights based on our intuitions about how dangerous each type of suitor is.  That is, we thought that we would rank these potential suitors as follows in order of dangerousness, from least to most:

1:  Human 2: Young Vampire 3(tie): Shapeshifter, Werewolf 4: Old Vampire

The weights in Figure 4 are simply the numbers from that ranking. 

Now, how do these numbers help us to find numerical solutions to Sookie's DANGER problem?  One way to think of them is as how much the DANGER unit would be activated for each type of suitor, if Sookie poured “1” drop of activation in to that suitor’s unit.  So, if in her simulations Sookie activates the “Young Vampire” unit to “1”, the DANGER unit would be activated to “2”. That number is, of course, meaningless by itself, and it only starts to make sense when we consider what would happen if Sookie activated the “Old Vampire” unit instead.  In that situation, based on these weights, the DANGER unit would be activated to “4”.  This reflects the idea that the old vampire is twice as dangerous as the young vampire.

In this example, the activation function for our network is the following:


Work out the examples from above for yourself, to convince yourself this is true.  Remember, Sookie is starting out with simple simulations, so she’s only pouring “1” of activation in to a single unit to start.   

Our activation function permits quantification of some more of our intuitions.  For example, if Sookie enters in to only a half-hearted relationship—pours in only “1/2” activation to a suitor—that is probably only half as dangerous as a more serious relationship.  This intuition is reflected by the fact that a weight has to be multiplied by its suitor’s activation before being “sent” to the DANGER unit.  Similarly, the more suitors Sookie engages in a relationship with, the more items will go in to the sum, resulting in more activation on the DANGER unit.  In fact, with this formula, we now have the mathematical machinery to answer questions of arbitrary complexity.  For example “How dangerous is it to date 2 humans very seriously (human unit activation “6”) while simultaneously dating a shapeshifter and werewolf half-heartedly (shapeshifter activation “.5” and werewolf activation .5) after having revoked young vampire’s invitation to my house (young vampire activation “-1”)?”  Check for yourself! 

Once again, exploring our intuitions about this model also reveals one of its flaws.  Consider the following:  In this model, every time Sookie adds a new suitor (either by increasing the activation in a pre-existing unit or adding some activation in a new unit—say she wants to start dating a necromancer, for some reason), the total amount of predicted DANGER will increase.  However, after a certain threshold of DANGER, Sookie will no longer be in DANGER at all, because she will simply be DEAD.  That is, DANGER is a quantity that saturates.  There’s no realistic situation where Sookie can be in infinite danger.  Mathematically, the problem here is that we have used a linear activation function, instead of a saturating activation function.  That is, the relationship between input (suitor) activation and output (DANGER) activation in our network looks like this:


When in reality it should look like this:

Figure 6
For essentially this very reason (though not usually spelled out so luridly), most of the networks we will study will use the sigmoidal activation shown in Figure 6. 

4.  Interim Summary At this point, in a very serious sense, we have learned everything there ever will be to know about connectionist networks.  Namely, there will be units.  They will be connected.  The connections will differ in their weights, and most of the time weights will be determined by a learning algorithm.  Information will flow between units at different levels as numerically specified by the activation function, which will usually saturate.  Your first step in understanding any model that we read about will be to identify what kind of units there are, and how they are connected.  If you can do that, you will be in good shape. 

However, the models we read about will be substantially more complicated than this one.  Next, I will talk about two of the most common complications we will encounter:  deep networks and networks with distributed representation. 






Tuesday, June 19, 2012

The summer of grant writing...

... is good for $$ (well, at least in theory), bad for blog posts.  Here's what I've been up to:

Consider the following scenario:  two children are identified in school as having a specific reading impairment.  This is a common scenario in the United States, where between 17.5 and 36% of elementary aged children meet the broad diagnostic criteria for dyslexia defined in the DSM-IV (APA, 1994; Shaywitz, Shaywitz, Fletcher, & Escobar, 1990; Perie, Grigg, & Donahue, 2005), which specifies, essentially, that children can be diagnosed with dyslexia so long as they have difficulty reading in the absence of other apparent learning disabilities.  One child may have difficulty reading as a result of perceptual impairment that prevents effective processing of orthography, while the other child may have a delayed or otherwise incomplete phonological awareness, preventing the critical link between orthography and semantics that is often posited to be necessary in learning to read (see Harm & Seidenberg, 2004, for review).  Effective treatments for these two children would therefore be different.

In order to discover the separate loci of impairment in these two children, under a common mode of current practice, each would need to complete a battery of behavioral assessments, designed to assess ability in different sub-domains of reading—such as the perceptual and phonological impairments in fact present.  Even supposing that a satisfactorily nuanced cognitive assessment is provided, production demands may obscure the diagnostic relevance of any particular behavioral test:  it is well documented in the behavioral literature that production demands can substantially alter the estimated reading ability of a child, unrelated to that child’s actual expertise in a particular domain (see Stanovich, Cunningham, & Cramer, 1994, for a particularly striking example of this phenomenon in the case of phonological awareness.)

What if instead there were a measure that could inform perceptual, phonological, and semantic processing as they unfold in real time, in a single record from a single test?  A measure that could be collected in the absence of overt responses, in order to measure process-pure underlying expertise, free from contamination by response demands?  Such a measure exists!  The Event-Related Potential (ERP) technique provides millisecond level temporal resolution, is decomposable into well-studied, functionally specific components, and can be collected in the absence of an overt task.  Recent data in the ERP literature underscore ERPs as a tool that can probe language ability in the absence of overt responses; Parise & Csibra (2012) have demonstrated semantic priming effects in 9 month old infants through analysis of the N400 ERP component—illustrating receptive vocabulary ability in children that certainly cannot respond verbally to behavioral assessments.  This study reiterates in infants what is already known in even the adult bilingualism literature, where productive competency in overt tasks has been shown to be dissociated from receptive competency as revealed by implicit tasks in combination with ERPs (Tokowicz & MacWhinney, 2005).

Not only can ERPs reveal process-pure receptive abilities, they can do so for multiple sub-components of reading in a single record—even in children with specific reading impairment.  Any ERP recorded in response to an orthographic stimulus will display components that represent automatic processing at the perceptual, phonological, and semantic levels of representation, components known to change with development and in specific reading impairment.  Specifically, the visual N1, which peaks between ~100-150 ms post stimulus onset, is known to represent processing of the onset of a visual stimulus (e.g, Luck, Heinze, Mangun, & Hillyard, 1990), and has been shown to be modulated differentially in children with and without a diagnosis of dyslexia (Araujo, Bramao, Faisca, Petersson, & Reis, 2012)—consistent with subtypes of dyslexia that result from perceptual impairment.  Moving forward in time, the N250 component has been hypothesized to reflect the grouping of orthographic or phonological features into more complex representations (see Grainger & Holcomb, 2009, for review), and has a delayed peak in children with reading disorders—consistent with subtypes of dyslexia that result from phonological impairment.  Finally, the N400 component is a known marker of attempted lexical-semantic access (see Federmeier & Laszlo, 2009, for review), and has been shown to still be developing towards its adult functionality in our target demographic (e.g., Coch et al., 2002).

The ERP technique provides for measurement of process-pure ability in perceptual analysis, phonological recoding, and semantic access in a single record.  Of course, ERPs are more expensive and invasive than traditional behavioral measurements, but the advantages that they provide warrant exploration of their use in clinical assessment.  This research program will achieve such an exploration.   

Saturday, April 28, 2012

Limitless.. suffocation

It's true, he's dreamy... even if his character
in this movie is a huge douchebag who
obtains his mental prowess via scientifially
questionable means.
I watched Limitless tonight.  That's that Bradley Cooper movie where Bradley Cooper starts taking a pill that gives him phenomenal cosmic brain powers.  The conceit of the movie is that the pill works by giving access to the 60% of his brain that normally he wouldn't be able to "access."  This is an old trope-- that humans can only "access" some small percentage of our brains.  In some sense this is true-- we can't consciously think many parts of our brains in to performing their functions more or less, better or worse.  But, let me assure you (lest you are sad that you have to carry all that useless brain around with you), our lack of control over our lower brain is fairly critical to keeping us alive.  Once Bradley popped that pill and we were treated to some dramatic renderings of neurons and structural MRIs, I thought "why would anyone WANT to access the other 60% of their brain" (not that "60" is a scientifically derived number, that's just what they use in the film.)

Look, if all of a sudden you started using your midbrain to learn Cantonese and game the stock market (as Bradley Cooper does in the movie), you wouldn't have much time to enjoy your ill gotten gains because it would be a race between suffocation, heart failure, and falling over and cracking open your skull on the nearest inconveniently placed hard object to determine what would kill you first (and probably even more, hard to foresee, miserable, more or less instantaneous ways to die).  You need your lower, "inaccessible" brain parts to keep you breathing, keep your heart pumping, keep your muscles working!  It's not like "60%" of your brain is just sitting idle while your prefrontal cortex mans the ship.

Respect the midbrain, people!