1 17-NeuralNetworks

clean up latex in latex to latex conversion…

https://binds.cs.umass.edu/BrainPower/purplebrain.html

1.1 Opening thoughts

“We shall envision the mind (or brain) as composed of many partially autonomous”agents”,
as a “Society” of smaller minds.
Each sub-society of mind must have its own internal epistemology and phenomenology,
with most details private,
not only from the central processes,
but from one another.”
(Minsky, K-Lines; 1980)

Lesson in neuronal politics:
Strong local/individual policies have many strengths:
sustainable, realistic, flexible, robust, and fault-tolerant.

At the end of this section you should be able to:

Detail the basic features of biological neurons
Draw and formulate the equations for a basic neuron and its structure
Describe various network structures
Understand various learning rules and their limitations

2 Brains

2.1 Neurons

Real neurons
17-NeuralNetworks/complete_neuron.png

2.2 Connections

Pre- and Post- synaptic
17-NeuralNetworks/Neuroplasticity-Intact-Brain-Figure81.jpg

2.3 Signals

Action potentials
17-NeuralNetworks/spikes.png
A 4 second recording of neural activity,
recording from 30 neurons (rows) of the visual cortex of a monkey.
Each vertical bar indicates a spike.
The human brain can recognize a face within 150ms,
which correlates to less than 3mm in this diagram;
dramatic changes in firing frequency occur in this time span,
neurons have to rely on information carried by solitary spikes.
How many neurons, or “hops” does it takes until recognition occurs?

Neurons spike to “think” (mostly)
17-NeuralNetworks/image33.png
Neurons are the primary basis of human/animal thinking, learning, consciousness, etc.

Synapses: inter-neuron signaling / learning
17-NeuralNetworks/synapse_schematic.png
Rate-limited step is transmission between neurons.
Learning is mostly rooted in the synapses.
Neurons change their reactivity and “weights” to learn.

2.4 Diversity

Diversity of neuron types
17-NeuralNetworks/neurons0.jpg
“What magical trick makes us intelligent?
The trick is that there is no trick.
The power of intelligence stems from our vast diversity (and size),
not from any single, perfect principle.”
(Marvin Minsky, Society of Mind; 1987)

Diversity of neuron types cont…
17-NeuralNetworks/variability.png
Network structure varies on a macro scale.

2.5 Levels

Level of abstraction
17-NeuralNetworks/levels.png
Which level of abstraction to model?

Discuss: Cortical columns as an expandable, general-purpose module.

2.6 Scale

Neurons are slow and fairly small:
17-NeuralNetworks/scale.png
Compared to computers at least…

2.7 vs. Computers

Brains vs. Computers
17-NeuralNetworks/comparison.png

×	process elements	element size	speed	computation	robust	learns	intelligent, conscious
Brain	10^14 synapses	10e-6m	100Hz	parallel, distr	yes	yes	usually…
Computer	10^8 transistors	10e-6m	10^9 Hz	serial, central	no	a little	Debateably yes

2.7.1 Brains vs. Computers: Robustness

Performance degrades gracefully under partial damage.
In contrast, most programs and engineered systems are brittle:
If you remove some arbitrary parts,
very likely the whole will cease to function.
Brain reorganizes itself from experience.
It performs massively parallel computations extremely efficiently.
For example, complex visual perception occurs within less than 30 ms,
that is, potentially 10 processing steps!
Brain is flexible, and can adjust to new environments.
Can tolerate (well) information that is fuzzy,
inconsistent, probabalistic, noisy, or inconsistent.
Brain is very energy efficient.

2.7.2 Brains vs. Computers: function

Traditional computing excels in many areas, but not in others.
A funny definition:
AI is the the development of algorithms or paradigms that require machines to perform cognitive tasks at which humans are currently better.
Symbolic rules don’t reflect processes actually used by humans.

2.8 Computation

Types of computation
17-NeuralNetworks/computation.png
Neural networks can be universal general purpose computers,
and in some app-specific hardware instances do better than Turing machines.

2.8.1 Types of computation

The use of neural networks may seem to challenge the physical symbol system hypothesis, which relies on symbols having meaning.
Although meaning is attached to the input and output units, the designer does not associate a meaning with the hidden units.
What the hidden units actually represent is something that is learned.
After a neural network has been trained, it is often possible to look inside the network to determine what a particular hidden unit actually represents.
Arguably, the computer has an internal meaning; it can explain its internal meaning by showing how examples map into the values of the hidden unit.

3 Neural networks

3.0.1 (Artificial) Neural networks

Massively parallel distributed processor made up of simple units, which has a natural propensity for storing and using experiential knowledge.
Knowledge is acquired by the network from its environment through learning
Interconnection strengths (synaptic weights) store acquired knowledge

3.0.2 Domains studying NNs

Machine learning:
- Having a computer program itself from a set of examples so you don’t have to program it yourself.
- Optimization: given a set of constraints and a cost function, how do you find an optimal solution? E.g. traveling salesman problem.
- Classification: grouping patterns into classes: i.e. handwritten characters into letters.
- Associative memory: recalling a memory based on a partial match.
- Regression: function mapping
Cognitive science:
- Modelling higher level reasoning: language, problem solving
- Modelling lower level reasoning: vision, audition speech recognition, speech generation
Neurobiology:
- Modelling models of how the brain works.
- neuron-level
- higher levels: vision, hearing, etc. Overlaps with cognitive folks.
Mathematics:
- Nonparametric statistical analysis and regression.

3.1 Applications

Signal processing: suppress line noise, with adaptive echo canceling, blind source separation
Control: e.g. backing up a truck: cab position, rear position, and match with the dock get converted to steering instructions. Manufacturing plants for controlling automated machines.
Siemens successfully uses neural networks for process automation in basic industries, e.g., in rolling mill control more than 100 neural networks do their job, 24 hours a day
Robotics - navigation, vision recognition
Pattern recognition, i.e. recognizing handwritten characters, e.g. Apple’s Newton used a neural net
Medicine, i.e. storing medical records based on case information
Speech production: reading text aloud (NETtalk)
Speech recognition
Vision: face recognition , edge detection, visual search engines
Business,e.g.. rules for mortgage decisions are extracted from past decisions made by experienced evaluators, resulting in a network that has a high level of agreement with human experts.
Financial Applications: time series analysis, stock market prediction
Data Compression: speech signal, image, e.g. faces
Game Playing: backgammon, chess, go, …

3.2 Benefits of neural networks

Nonlinearity: distributed throughout the network
Input-output mapping: supervised learning
Adaptivity: learn via synaptic weights
Evidential response: can give probability/confidence in decision
Contextual information: distributed store of info, association
Fault tolerance: individual neurons can be damaged
VLSI implementability: hardware networks
Standardized design, analysis, and theoretical literature
Neurobiological analogy: much reciprocity between fields

3.3 Models

of neurons

3.3.1 One basic neuron model

17-NeuralNetworks/model.png
Neuron operations:
1. Sum (inputs x weights)
2. Apply activation function
3. Transmit signal
17-NeuralNetworks/model2.png
* Often a bias \(\theta\) can be applied/learned
17-NeuralNetworks/haykin2.png
17-NeuralNetworks/haykin00.png

3.4 Activation functions

Many types:
17-NeuralNetworks/haykin3.png

top above
17-NeuralNetworks/haykin0.png

middle above
17-NeuralNetworks/haykin4.png

bottom above
17-NeuralNetworks/haykin5.png
Note: \(exp(x)\) is \(e^x\)

3.5 Stochasticity

3.5.1 Alternative: Probability-based firing

17-NeuralNetworks/haykin7.png
17-NeuralNetworks/haykin8.png
T is pseudo temperature used to control noise level (uncertainty)

3.6 Signal flow

Signal flow diagram
17-NeuralNetworks/haykin9.png

Graph structure

3.6.1 Architectural graphs and recurrence

Single layer network
17-NeuralNetworks/haykin11.png

3.6.2 Multi-layer feed forward fully connected

Recurrent network with no self feedback
17-NeuralNetworks/haykin13.png

3.6.3 Recurrent network with hidden neurons

Knowledge representation?
newsgroup example
17-NeuralNetworks/neural-net-user-action-aispace.png

3.6.3.0.1 Knowledge? distributed / learned

Knowledge refers to stored information used to interpret,
predict, or respond to the outside world.
In a neural network:
* Similar inputs should elicit similar activations/representations in the network
* The inverse: dissimilar items should be represented very differently
* Important features should end up dominating the network
* Prior information can be built into the network, though it is not required, e.g., receptive fields

3.6.4 Receptive fields: What is different here?

3.6.5 Learning

3.6.5.1 Learning in NN:

Learning is a process by which the free parameters,
synaptic weights, of the network are adapted,
through a process of stimulation/activation,
by the environment in which the network is embedded.

The type of learning is determined by the ways the parameters are changed:
Supervised (with sub-types),]
Unsupervised (with sub-types), and
Reinforcement learning.

A set of well-defined rules for updating weights is defined as a learning algorithm.

The mapping from environment to network to task is often coined the learning paradigm.

3.6.5.2 Unsupervised

Unsupervised learning
17-NeuralNetworks/haykin20.png
E.g., clustering, auto-associative, Hebbian, etc

3.6.5.2.1 Hebbian learning

https://en.wikipedia.org/wiki/Hebbian_theory

Neurons that “Fire together, wire together”.

Recall our basic neuron:
17-NeuralNetworks/haykin2.png
With pre-synaptic inputs (x_n),
and post-synaptic output (y).

\(\Delta w_i = \eta x_i y\)
the change in the i^th synaptic weight w_i,
is equal to:
a learning rate \(\eta\),
times the i^th input x_i,
times the postsynaptic response y.

Weights updated after every training example.

Variants of this are very successful at clustering problems,
and can provably perform ICA, PCA, etc.

3.6.5.3 Associative learning

17-NeuralNetworks/assoc.png
Hebbian-like rule:
17-NeuralNetworks/assoc2.png
17-NeuralNetworks/assoc3.png
First the learning phase, when like activations become more linked.
After learning, a partial activation will produce activity in neurons more closely linked.
Thus, during this phase a partial noisy degraded image can produce activations like that of the full image.

This is probably the best way to see parts of cortical function!
Much of learning is driven by coincidence detection with massive cross-referencing.
During recognition, behavior is thus like the associative example above, in time and space.

3.6.5.4 Clustering

See upcoming spiking networks lectures.

3.7 Credit

3.7.1 Credit assignment problem

Structural: Which weights need changing due to good/bad outcome?

Temporal: Which preceding internal decisions resulted in the delayed reward?

3.8 Supervised

One way of learning with a teacher (an involved, micromanaging teacher).
17-NeuralNetworks/haykin18.png
Supervised learning:
attempts to minimize the error between the actual outputs,
i.e., the activation at the output layer,
and the desired or “target” activation,
by changing the values of the weights.

3.8.1 Competitive

Competitive learning
17-NeuralNetworks/haykin17.png
Winner-takes all based weight updates
(inhibition of lateral neighbors).
Similar to functions in retina

3.8.2 Error Corr.

How much should we change each weight?
In proportion to its influence on the error.
The bigger the influence of weight w_i,
the greater the reduction of error that can induced by changing it

3.8.2.1 Basic error correction learning

For a given learning instance, input n:
17-NeuralNetworks/haykin16.png

Error:
17-NeuralNetworks/ec1.png
Minimize this error:
17-NeuralNetworks/ec2.png
Update via:
17-NeuralNetworks/ec3.png
17-NeuralNetworks/ec4.png

3.8.2.2 Example

After a hypothetical learning example,
below is some possible resulting weights and biases (theta).
17-NeuralNetworks/xor3.png
Easy for linear single layer network,
with 2 neurons and a bias,
with step activation.

XOR
17-NeuralNetworks/xor.png
Problem:
Requires a hidden layer (for non-linearity)

3.8.3 Multi-layer

Solution: N-layer network
17-NeuralNetworks/layers.png
Solution: Can solve any non-linear function

XOR
17-NeuralNetworks/xor2.png
Separation into 3D via hidden layer allows solving XOR
Problem:
How to solve for errors in hidden layer??

Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png

Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png
Given input example, \(e\), what is output prediction?

\(val(e,H1) = f(w_3+w_4×val(e,Culture)+w_5×val(e,Fly) +w_6×val(e,Hot)+ w_7×val(e,Music)+ w_8×val(e,Nature)\)
\(val(e,H2) = f(w_9+w_10×val(e,Culture)+w11×val(e,Fly) +w_12×val(e,Hot)+ w_13×val(e,Music) + w_14×val(e,Nature))\)
\(pval(e,Likes) = f(w_0+w_1×val(e,H1)+w2×val(e,H2))\)

3.8.4 Error

Error gradients
17-NeuralNetworks/errorVis.png
Top left: original samples;
Top right: network approximation;
Bottom left: true function which generated samples;
Bottom right: raw error

Error gradients: simple
17-NeuralNetworks/errorVis0.png
* Error (vertical) as function of 2 weights (\(x_1\) and \(x_2\))

Error
* How much should we change each weight?
* In proportion to its influence on the error.
* The bigger the influence of weight \(w_m\) , the greater the reduction of error that can induced by changing it
* This influence wouldn’t be the same everywhere: changing any particular weight will generally make all the others more or less influential on the error, including the weight we have changed.

3.8.5 Backprop

Solution: Error backpropagation

3.8.5.1 Step 1: Propagation:

Each propagation involves the following:

Forward propagation of a training pattern’s input through the neural network,
in order to generate the propagation’s output activations.

Backward propagation of output activations through the neural network,
using the training pattern target,
in order to generate the deltas of all output and hidden neurons,
(delta is the difference between the input and output values)

3.8.5.2 Step 2: Weight update:

For each weight-synapse do the following:
* Multiply its output delta, and input activation, to get the gradient of the weight.
* Subtract a ratio (percentage) of the gradient from the weight.

Learning rate
The ratio (percentage) influences the speed and quality of learning;
it is called the learning rate.
The greater the ratio, the faster the neuron trains;
the lower the ratio,
the more accurate the training is.
The sign of the gradient of a weight indicates where the error is increasing,
this is why the weight must be updated in the opposite direction.

Finally:
Repeat step 1 and 2 until the performance of the network is satisfactory.

3.9 Learning rate

17-NeuralNetworks/mularge.png
Learning rate is too large

17-NeuralNetworks/musmall.png
Learning rate is too small

Solution:
Error backpropagation overview and basic idea:

1 initialize network weights (often small random values)
2 do
3     for Ēach training example ex
4         prediction = neural-net-output(network, ex) // forward pass
5         actual = teacher-output(ex)
6         compute error $(prediction - actual)$ at the output units, as $\triangle$
7         Starting with output layer, repeat until layer I (input):
7             propagate $\triangle$ values back to previous layer
9             update network weights between the two layers
10 until all examples classified correctly or another stopping criterion satisfied
11 return the network

Backprop(from ArtInt)
17-NeuralNetworks/algo.png
This approach assumes:
\(n\) input features,
\(k\) output features, and
\(nh\) hidden units.

Both \(hw\) and \(ow\) are two-dimensional arrays of weights.

Note that
\(0:nk\) means the index ranges from \(0\) to \(nk\) (inclusive), and
\(1:nk\) means the index ranges from \(1\) to \(nk\) (inclusive).

This algorithm assumes that \(val(e,X_0)=1\) for all \(e\)

Backprop (from AIMA)
Neural network for traveling example
17-NeuralNetworks/neural-net-holiday-aispace.png
One hidden layer containing two units,
trained on the travel data, can perfectly fit.
One run of back-propagation with the learning rate η=0.05, and taking 10,000 steps,
gave weights that accurately predicted the training data:

\(H1=f(-2.0×Culture -4.43×Fly + 2.5×Hot +2.4×Music-6.1×Nature+1.63)\)
\(H2=f(-0.7×Culture +3.0×Fly + 5.8×Hot +2.0×Music-1.7×Nature-5.0)\)
\(Likes=f(-8.5×H1 -8.8×H2 + 4.36)\)

Comparison: digit recognition

	3 NN	300 Hidden NN	LeNet	Boosted LeNet	SVM	Virtual SVM	Shape match
Error rate	2.4	1.6	0.9	0.7	1.1	0.56	0.63
Run time	1000	10	30	50	2000	200	•
Memory req	12	.49	0.012	0.21	11	•	•
Training time	0	7	14	30	10	•	•
% rejected to reach 0.5%	8.1	3.2	1.8	0.5	1.8	•	•

3-nearest neighbor (memory)
300 hidden, fully connected, 123,00 weights
LeNet (below) a convolution net
3 copies of LeNet
SVM, Virtual SVM, Shape match

Prediction!
17-NeuralNetworks/train.png
Neural networks can predict complex time-series, e.g., prices, economies, etc

Prediction!
17-NeuralNetworks/trainin.png
Input can be given by experts via intervention indicators
Prediction!

17-NeuralNetworks/trainset.png
Training via a shifting window

Prediction!
17-NeuralNetworks/rozdel.png
Like other methods, training, validation, and testing sets help

3.10 Reinforcement

Another way of learning with a teacher (a negligent, rarely-there teacher).
17-NeuralNetworks/haykin19.png
Temporal credit assignment problem.
More to come with spiking networks.

3.11 Overfitting

17-NeuralNetworks/overfit.png
* Over-fitting impedes generalization

Regularization
17-NeuralNetworks/regularization00.png
Straight line might be an under-fit to these data points.

17-NeuralNetworks/regularization000.png
Left, 10th order might be an over-fit.
Right, the true function from which the data were sampled.

Regularization
17-NeuralNetworks/regularization0.png
\(\lambda\) defined as a constant to penalize higher order during the error calculation (for neurons)

Regularization:
too little or too much
17-NeuralNetworks/regularization1.png
* dotted = train, solid = test
* y=error, x= \(\lambda\), such that either too low or high order is worse, with a happy medium in the middle.

Regularization: Bayesian
17-NeuralNetworks/regularization2.png
* Pre-specify your hypothesis about \(\lambda\)
* Left, \(\lambda\) 1000
* Right, \(\lambda\) 1

Regularization: Bayesian
17-NeuralNetworks/regularization3.png
* \(p(w|\lambda, H) \propto exp[-\frac{\lambda}{2}w^2]\)
* \(p(\textbf{w}|D, \lambda, H) = \frac{ p(D|\textbf{w}, \gamma, H) p(\textbf{w}|\lambda, H)}{p(D|\lambda, H)}\) such that \(D\) are data
* \(p(\textbf{w}|D, \lambda, H) = p(D|\textbf{w}) \propto \prod_u exp[-\frac{1}{2}(y^u-f(x^u-\textbf{w}))^2]\)