1 17-NeuralNetworks


https://binds.cs.umass.edu/BrainPower/purplebrain.html

1.1 Opening thoughts

“We shall envision the mind (or brain) as composed of many partially autonomous”agents”,
as a “Society” of smaller minds.
Each sub-society of mind must have its own internal epistemology and phenomenology,
with most details private,
not only from the central processes,
but from one another.”
(Minsky, K-Lines; 1980)

Lesson in neuronal politics:
Strong local/individual policies have many strengths:
sustainable, realistic, flexible, robust, and fault-tolerant.

At the end of this section you should be able to:

2 Brains

2.1 Neurons

Real neurons
17-NeuralNetworks/complete_neuron.png

2.2 Connections

Pre- and Post- synaptic
17-NeuralNetworks/Neuroplasticity-Intact-Brain-Figure81.jpg

2.3 Signals

Action potentials
17-NeuralNetworks/spikes.png
A 4 second recording of neural activity,
recording from 30 neurons (rows) of the visual cortex of a monkey.
Each vertical bar indicates a spike.
The human brain can recognize a face within 150ms,
which correlates to less than 3mm in this diagram;
dramatic changes in firing frequency occur in this time span,
neurons have to rely on information carried by solitary spikes.
How many neurons, or “hops” does it takes until recognition occurs?

Neurons spike to “think” (mostly)
17-NeuralNetworks/image33.png
Neurons are the primary basis of human/animal thinking, learning, consciousness, etc.

Synapses: inter-neuron signaling / learning
17-NeuralNetworks/synapse_schematic.png
Rate-limited step is transmission between neurons.
Learning is mostly rooted in the synapses.
Neurons change their reactivity and “weights” to learn.

2.4 Diversity

Diversity of neuron types
17-NeuralNetworks/neurons0.jpg
“What magical trick makes us intelligent?
The trick is that there is no trick.
The power of intelligence stems from our vast diversity (and size),
not from any single, perfect principle.”
(Marvin Minsky, Society of Mind; 1987)

Diversity of neuron types cont…
17-NeuralNetworks/variability.png
Network structure varies on a macro scale.

2.5 Levels

Level of abstraction
17-NeuralNetworks/levels.png
Which level of abstraction to model?

Discuss: Cortical columns as an expandable, general-purpose module.

2.6 Scale

Neurons are slow and fairly small:
17-NeuralNetworks/scale.png
Compared to computers at least…

2.7 vs. Computers

Brains vs. Computers
17-NeuralNetworks/comparison.png

× process elements element size speed computation robust learns intelligent, conscious
Brain 10^14 synapses 10e-6m 100Hz parallel, distr yes yes usually…
Computer 10^8 transistors 10e-6m 10^9 Hz serial, central no a little Debateably yes

2.7.1 Brains vs. Computers: Robustness

Performance degrades gracefully under partial damage.
In contrast, most programs and engineered systems are brittle:
If you remove some arbitrary parts,
very likely the whole will cease to function.
Brain reorganizes itself from experience.
It performs massively parallel computations extremely efficiently.
For example, complex visual perception occurs within less than 30 ms,
that is, potentially 10 processing steps!
Brain is flexible, and can adjust to new environments.
Can tolerate (well) information that is fuzzy,
inconsistent, probabalistic, noisy, or inconsistent.
Brain is very energy efficient.

2.7.2 Brains vs. Computers: function

Traditional computing excels in many areas, but not in others.
A funny definition:
AI is the the development of algorithms or paradigms that require machines to perform cognitive tasks at which humans are currently better.
Symbolic rules don’t reflect processes actually used by humans.

2.8 Computation

Types of computation
17-NeuralNetworks/computation.png
Neural networks can be universal general purpose computers,
and in some app-specific hardware instances do better than Turing machines.

2.8.1 Types of computation

3 Neural networks

3.0.1 (Artificial) Neural networks

3.0.2 Domains studying NNs

3.1 Applications

3.2 Benefits of neural networks

3.3 Models

of neurons

3.3.1 One basic neuron model

17-NeuralNetworks/model.png
Neuron operations:
1. Sum (inputs x weights)
2. Apply activation function
3. Transmit signal
17-NeuralNetworks/model2.png
* Often a bias \(\theta\) can be applied/learned
17-NeuralNetworks/haykin2.png
17-NeuralNetworks/haykin00.png

3.4 Activation functions

Many types:
17-NeuralNetworks/haykin3.png

top above
17-NeuralNetworks/haykin0.png

middle above
17-NeuralNetworks/haykin4.png

bottom above
17-NeuralNetworks/haykin5.png
Note: \(exp(x)\) is \(e^x\)

3.5 Stochasticity

3.5.1 Alternative: Probability-based firing

17-NeuralNetworks/haykin7.png
17-NeuralNetworks/haykin8.png
T is pseudo temperature used to control noise level (uncertainty)

3.6 Signal flow

Signal flow diagram
17-NeuralNetworks/haykin9.png

Graph structure

3.6.1 Architectural graphs and recurrence

17-NeuralNetworks/haykin10.png

Single layer network
17-NeuralNetworks/haykin11.png

3.6.2 Multi-layer feed forward fully connected

17-NeuralNetworks/haykin12.png

Recurrent network with no self feedback
17-NeuralNetworks/haykin13.png

3.6.3 Recurrent network with hidden neurons

17-NeuralNetworks/haykin14.png

Knowledge representation?
newsgroup example
17-NeuralNetworks/neural-net-user-action-aispace.png

3.6.3.0.1 Knowledge? distributed / learned

Knowledge refers to stored information used to interpret,
predict, or respond to the outside world.
In a neural network:
* Similar inputs should elicit similar activations/representations in the network
* The inverse: dissimilar items should be represented very differently
* Important features should end up dominating the network
* Prior information can be built into the network, though it is not required, e.g., receptive fields

3.6.4 Receptive fields: What is different here?

17-NeuralNetworks/haykin15.png

3.6.5 Learning

3.6.5.1 Learning in NN:

Learning is a process by which the free parameters,
synaptic weights, of the network are adapted,
through a process of stimulation/activation,
by the environment in which the network is embedded.

The type of learning is determined by the ways the parameters are changed:
Supervised (with sub-types),]
Unsupervised (with sub-types), and
Reinforcement learning.

A set of well-defined rules for updating weights is defined as a learning algorithm.

The mapping from environment to network to task is often coined the learning paradigm.

3.6.5.2 Unsupervised

Unsupervised learning
17-NeuralNetworks/haykin20.png
E.g., clustering, auto-associative, Hebbian, etc

3.6.5.2.1 Hebbian learning

https://en.wikipedia.org/wiki/Hebbian_theory

Neurons that “Fire together, wire together”.

Recall our basic neuron:
17-NeuralNetworks/haykin2.png
With pre-synaptic inputs (xn),
and post-synaptic output (y).

\(\Delta w_i = \eta x_i y\)
the change in the ith synaptic weight wi,
is equal to:
a learning rate \(\eta\),
times the ith input xi,
times the postsynaptic response y.

Weights updated after every training example.

Variants of this are very successful at clustering problems,
and can provably perform ICA, PCA, etc.

3.6.5.3 Associative learning

17-NeuralNetworks/assoc.png
Hebbian-like rule:
17-NeuralNetworks/assoc2.png
17-NeuralNetworks/assoc3.png
First the learning phase, when like activations become more linked.
After learning, a partial activation will produce activity in neurons more closely linked.
Thus, during this phase a partial noisy degraded image can produce activations like that of the full image.

This is probably the best way to see parts of cortical function!
Much of learning is driven by coincidence detection with massive cross-referencing.
During recognition, behavior is thus like the associative example above, in time and space.

3.6.5.4 Clustering

See upcoming spiking networks lectures.

3.7 Credit

3.7.1 Credit assignment problem

Structural: Which weights need changing due to good/bad outcome?

Temporal: Which preceding internal decisions resulted in the delayed reward?

3.8 Supervised

One way of learning with a teacher (an involved, micromanaging teacher).
17-NeuralNetworks/haykin18.png
Supervised learning:
attempts to minimize the error between the actual outputs,
i.e., the activation at the output layer,
and the desired or “target” activation,
by changing the values of the weights.

3.8.1 Competitive

Competitive learning
17-NeuralNetworks/haykin17.png
Winner-takes all based weight updates
(inhibition of lateral neighbors).
Similar to functions in retina

3.8.2 Error Corr.

How much should we change each weight?
In proportion to its influence on the error.
The bigger the influence of weight wi,
the greater the reduction of error that can induced by changing it

3.8.2.1 Basic error correction learning

For a given learning instance, input n:
17-NeuralNetworks/haykin16.png

Error:
17-NeuralNetworks/ec1.png
Minimize this error:
17-NeuralNetworks/ec2.png
Update via:
17-NeuralNetworks/ec3.png
17-NeuralNetworks/ec4.png

3.8.2.2 Example

After a hypothetical learning example,
below is some possible resulting weights and biases (theta).
17-NeuralNetworks/xor3.png
Easy for linear single layer network,
with 2 neurons and a bias,
with step activation.

XOR
17-NeuralNetworks/xor.png
Problem:
Requires a hidden layer (for non-linearity)

3.8.3 Multi-layer

Solution: N-layer network
17-NeuralNetworks/layers.png
Solution: Can solve any non-linear function

XOR
17-NeuralNetworks/xor2.png
Separation into 3D via hidden layer allows solving XOR
Problem:
How to solve for errors in hidden layer??

Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png

Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png
Given input example, \(e\), what is output prediction?

3.8.4 Error

Error gradients
17-NeuralNetworks/errorVis.png
Top left: original samples;
Top right: network approximation;
Bottom left: true function which generated samples;
Bottom right: raw error

Error gradients: simple
17-NeuralNetworks/errorVis0.png
* Error (vertical) as function of 2 weights (\(x_1\) and \(x_2\))

Error
* How much should we change each weight?
* In proportion to its influence on the error.
* The bigger the influence of weight \(w_m\) , the greater the reduction of error that can induced by changing it
* This influence wouldn’t be the same everywhere: changing any particular weight will generally make all the others more or less influential on the error, including the weight we have changed.

3.8.5 Backprop

Solution: Error backpropagation

3.8.5.1 Step 1: Propagation:

Each propagation involves the following:

Forward propagation of a training pattern’s input through the neural network,
in order to generate the propagation’s output activations.

Backward propagation of output activations through the neural network,
using the training pattern target,
in order to generate the deltas of all output and hidden neurons,
(delta is the difference between the input and output values)

3.8.5.2 Step 2: Weight update:

For each weight-synapse do the following:
* Multiply its output delta, and input activation, to get the gradient of the weight.
* Subtract a ratio (percentage) of the gradient from the weight.

Learning rate
The ratio (percentage) influences the speed and quality of learning;
it is called the learning rate.
The greater the ratio, the faster the neuron trains;
the lower the ratio,
the more accurate the training is.
The sign of the gradient of a weight indicates where the error is increasing,
this is why the weight must be updated in the opposite direction.

Finally:
Repeat step 1 and 2 until the performance of the network is satisfactory.

3.9 Learning rate

17-NeuralNetworks/mularge.png
Learning rate is too large

17-NeuralNetworks/musmall.png
Learning rate is too small

Solution:
Error backpropagation overview and basic idea:

1 initialize network weights (often small random values)
2 do
3     for Ēach training example ex
4         prediction = neural-net-output(network, ex) // forward pass
5         actual = teacher-output(ex)
6         compute error $(prediction - actual)$ at the output units, as $\triangle$
7         Starting with output layer, repeat until layer I (input):
7             propagate $\triangle$ values back to previous layer
9             update network weights between the two layers
10 until all examples classified correctly or another stopping criterion satisfied
11 return the network

Backprop(from ArtInt)
17-NeuralNetworks/algo.png
This approach assumes:
\(n\) input features,
\(k\) output features, and
\(nh\) hidden units.

Both \(hw\) and \(ow\) are two-dimensional arrays of weights.

Note that
\(0:nk\) means the index ranges from \(0\) to \(nk\) (inclusive), and
\(1:nk\) means the index ranges from \(1\) to \(nk\) (inclusive).

This algorithm assumes that \(val(e,X_0)=1\) for all \(e\)

Backprop (from AIMA)
Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png
One hidden layer containing two units,
trained on the travel data, can perfectly fit.
One run of back-propagation with the learning rate η=0.05, and taking 10,000 steps,
gave weights that accurately predicted the training data:

Comparison: digit recognition

3 NN 300 Hidden NN LeNet Boosted LeNet SVM Virtual SVM Shape match
Error rate 2.4 1.6 0.9 0.7 1.1 0.56 0.63
Run time 1000 10 30 50 2000 200
Memory req 12 .49 0.012 0.21 11
Training time 0 7 14 30 10
% rejected to reach 0.5% 8.1 3.2 1.8 0.5 1.8

Prediction!
17-NeuralNetworks/train.png
Neural networks can predict complex time-series, e.g., prices, economies, etc

Prediction!
17-NeuralNetworks/trainin.png
Input can be given by experts via intervention indicators
Prediction!

17-NeuralNetworks/trainset.png
Training via a shifting window

Prediction!
17-NeuralNetworks/rozdel.png
Like other methods, training, validation, and testing sets help

3.10 Reinforcement

Another way of learning with a teacher (a negligent, rarely-there teacher).
17-NeuralNetworks/haykin19.png
Temporal credit assignment problem.
More to come with spiking networks.

3.11 Overfitting

17-NeuralNetworks/overfit.png
* Over-fitting impedes generalization

Regularization
17-NeuralNetworks/regularization00.png
Straight line might be an under-fit to these data points.

17-NeuralNetworks/regularization000.png
Left, 10th order might be an over-fit.
Right, the true function from which the data were sampled.

Regularization
17-NeuralNetworks/regularization0.png
\(\lambda\) defined as a constant to penalize higher order during the error calculation (for neurons)

Regularization:
too little or too much
17-NeuralNetworks/regularization1.png
* dotted = train, solid = test
* y=error, x= \(\lambda\), such that either too low or high order is worse, with a happy medium in the middle.

Regularization: Bayesian
17-NeuralNetworks/regularization2.png
* Pre-specify your hypothesis about \(\lambda\)
* Left, \(\lambda\) 1000
* Right, \(\lambda\) 1

Regularization: Bayesian
17-NeuralNetworks/regularization3.png
* \(p(w|\lambda, H) \propto exp[-\frac{\lambda}{2}w^2]\)
* \(p(\textbf{w}|D, \lambda, H) = \frac{ p(D|\textbf{w}, \gamma, H) p(\textbf{w}|\lambda, H)}{p(D|\lambda, H)}\) such that \(D\) are data
* \(p(\textbf{w}|D, \lambda, H) = p(D|\textbf{w}) \propto \prod_u exp[-\frac{1}{2}(y^u-f(x^u-\textbf{w}))^2]\)