Emmanuel Bengio

folinoid.com

A pdf of the slides

- A language
- A compiler
- A Python library

```
```import theano
import theano.tensor as T

What you really do:

- Build
**symbolic**graphs of computation (w/ input nodes) - Automatically compute gradients through it

```
```gradient = T.grad(cost, parameter)

- Feed some data
- Get results!

x = T.scalar('x' )

x = T.scalar('x' )
y = T.scalar('y' )

x = T.scalar('x' )
y = T.scalar('y' )
z = x + y

x = T.scalar('x' )
y = T.scalar('y' )
z = x + y

Ops are the building blocks of the computation graph

They (usually) define:

- A computation (given inputs)
- A partial gradient (given inputs and output gradients)
- C/CUDA code that does the computation

x = T.scalar()
y = T.scalar()
z = x + y
f = theano.function([x,y],z)
f(2,8) # 10

x = T.vector('x' )
W = T.matrix('weights' )
b = T.vector('bias' )
z = T.nnet.softmax(T.dot(x,W) + b)
f = theano.function([x,W,b],z)

a = T.vector()
b = f(a)
c = g(b)
d = h(c)
full_fun = theano.function([a],d) # h(g(f(a)))
part_fun = theano.function([c],d) # h(c)

\(\newcommand{\dd}[2]{\frac{\partial #1}{\partial #2}} \) $$ \dd{f}{z} = \dd{f}{a} \dd{a}{z} $$ $$ \dd{f}{z} = \dd{f}{a} \dd{a}{b} \dd{b}{c} ... \dd{x}{y} \dd{y}{z} $$

x = T.scalar()
y = x ** 2

x = T.scalar()
y = x ** 2
g = T.grad(y, x) # 2*x

$$ \dd{f}{z} = \dd{f}{a} \dd{a}{b} \dd{b}{c} ... \dd{x}{y} \dd{y}{z} $$

You don't really need to think about the gradient anymore.

- all you need is a
**scalar**cost - some parameters
- and a call to
T.grad

(or, wow, sending things to the GPU is long)

Data reuse is made through 'shared' variables.

initial_W = uniform(-k,k,(n_in, n_out))
W = theano.shared(value=initial_W , name="W" )

That way it sits in the 'right' memory spots

(e.g. on the GPU if that's where your computation happens)

Shared variables act like any other node:

prediction = T.dot(x,W ) + b
cost = T.sum((prediction - target)**2)
gradient = T.grad (cost, W )

You can compute stuff, take gradients.

Most importantly, you can:

gradient = T.grad(cost, W )
update_list = [(W , W - lr * gradient )]
f = theano.function(
[x,y,lr],[cost],
updates=update_list )

Remember,

# this updates W
f (minibatch_x, minibatch_y, learning_rate)

If dataset is small enough, use a shared variable

index = T.iscalar()
X = theano.shared(data['X' ])
Y = theano.shared(data['Y' ])
f = theano.function(
[index,lr],[cost],
updates=update_list,
givens ={x:X [index], y:Y [index]})

You can also take slices:

`X[`idx :idx+n ]

There are 3 major ways of printing values:

- When building the graph
- During execution
- After execution

And you should do a lot of 1 and 3

Use a test value

# activate the testing
theano.config.compute_test_value = 'raise'
x = T.matrix()
x.tag.test_value = numpy.ones((mbs, n_in))
y = T.vector()
y.tag.test_value = numpy.ones((mbs,))

You should do this when designing your model to:

- test shapes
- test types
- ...

Now every node has a

Use the

from theano.printing import Print
a = T.nnet.sigmoid(h)
# this prints "a:", a.__str__ and a.shape
a = Print ("a" ,["__str__" ,"shape" ])(a)
b = something(a)

Print acts like the identity- gets activated whenever
b "requests"a - anything in
dir(numpy.ndarray) goes

Add the node to the outputs

theano.function([...],
[..., some_node ])

Any node can be an output

You should do this:

- To acquire statistics
- To monitor gradients, activations...
- With moderation*

You can reshape arrays:

`b = a.reshape((n,m,p))`

As long as their

You can change the dimension order:

# b[i,k,j] == a[i,j,k]
b = a.dimshuffle(0,2,1)

You can also add

# a.shape == (n,m)
b = a.dimshuffle(0,'x' ,1)
# or
b = a.reshape([n,1,m])

This allows you to do elemwise* operations

with

\(p\) can be arbitrary.

* e.g. addition, multiplication

If an array lacks dimensions to match the other operand, the broadcast pattern is automatically expended to the

to match the number of dimensions

(But you should always do it yourself)

When compiling a function, ask theano to profile it:

`f = theano.function(..., `profile=True )

when exiting python, it will print the profile.

Class
---
<% time> < sum %>< apply time>< time per call>< type><#call> <#apply> < Class name>
30.4% 30.4% 10.202s 5.03e-05s C 202712 4 theano.sandbox.cuda.basic_ops.GpuFromHost
23.8% 54.2% 7.975s 1.31e-05s C 608136 12 theano.sandbox.cuda.basic_ops.GpuElemwise
18.3% 72.5% 6.121s 3.02e-05s C 202712 4 theano.sandbox.cuda.blas.GpuGemv
6.0% 78.5% 2.021s 1.99e-05s C 101356 2 theano.sandbox.cuda.blas.GpuGer
4.1% 82.6% 1.368s 2.70e-05s Py 50678 1 theano.tensor.raw_random.RandomFunction
3.5% 86.1% 1.172s 1.16e-05s C 101356 2 theano.sandbox.cuda.basic_ops.HostFromGpu
3.1% 89.1% 1.027s 2.03e-05s C 50678 1 theano.sandbox.cuda.dnn.GpuDnnSoftmaxGrad
3.0% 92.2% 1.019s 2.01e-05s C 50678 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
2.8% 94.9% 0.938s 1.85e-05s C 50678 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
2.4% 97.4% 0.810s 7.99e-06s C 101356 2 theano.sandbox.cuda.basic_ops.GpuAllocEmpty
0.8% 98.1% 0.256s 4.21e-07s C 608136 12 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.5% 98.6% 0.161s 3.18e-06s Py 50678 1 theano.sandbox.cuda.basic_ops.GpuFlatten
0.5% 99.1% 0.156s 1.03e-06s C 152034 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.2% 99.3% 0.075s 4.94e-07s C 152034 3 theano.tensor.elemwise.Elemwise
0.2% 99.5% 0.073s 4.83e-07s C 152034 3 theano.compile.ops.Shape_i
0.2% 99.7% 0.070s 6.87e-07s C 101356 2 theano.tensor.opt.MakeVector
0.1% 99.9% 0.048s 4.72e-07s C 101356 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 100.0% 0.029s 5.80e-07s C 50678 1 theano.tensor.basic.Reshape
0.0% 100.0% 0.015s 1.47e-07s C 101356 2 theano.sandbox.cuda.basic_ops.GpuContiguous
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)

Finding the culprits:

` 24.1% 24.1% 4.537s 1.59e-04s 28611 2 GpuFromHost(x)`

**Gemm/Gemv**, matrix\(\times\)matrix / matrix\(\times\)vector**Ger**, matrix update**GpuFromHost**, data CPU \(\to\) GPU**HostFromGPU**, the opposite**[Advanced]Subtensor**, indexing**Elemwise**, element-per-element Ops (+, -, exp, log, ...)**Composite**, many elemwise Ops merged together.

x = T.vector('x' )
n = T.scalar('n' )
def inside_loop (x_t, acc, n):
return acc + x_t * n
values, _ = theano.scan(
fn = inside_loop ,
sequences=[x],
outputs_info=[T.zeros(1)],
non_sequences=[n],
n_steps=x.shape[0])
sum_of_n_times_x = values[-1 ]

def inside_loop (x_t, acc, n):
return acc + x_t * n

- This function is called at each iteration
- It takes the arguments in this order:

- Sequences (default:
seq[t] ) - Outputs (default:
out[t-1] ) - Others (no indexing)

- Sequences (default:
- It returns
out[t] for each output - There can be many sequences, many outputs and many others:

`f(seq_0[t], seq_1[t], .., out_0[t-1], out_1[t-1], .., other_0, other_1, ..): `

values, _ = theano.scan(
# ...
sum_of_n_times_x = values[-1 ]

values = [ [out_0[1], out_0[2], ...],
[out_1[1], out_1[2], ...],
...]

If there's only one output then

fn = inside_loop ,

The loop function we saw earlier

sequences=[x],

Sequences are indexed over their

If you want

then you need to give

outputs_info=[T.zeros(1)],

If you don't want

pass

outputs_info=[None, out_1[0], out_2[0], ...],

You can also do more advanced "tapping", i.e. get

non_sequences=[n],

Variables that are used inside the loop (but not indexed).

n_steps=x.shape[0])

The number of steps that the loop should do.

Note that it is possible to do a "while" loop

x = T.vector('x' )
n = T.scalar('n' )
def inside_loop (x_t, acc, n):
return acc + x_t * n
values, _ = theano.scan(
fn = inside_loop ,
sequences=[x],
outputs_info=[T.zeros(1)],
non_sequences=[n],
n_steps=x.shape[0])
sum_of_n_times_x = values[-1 ]

$$ h_t = \mbox{tanh}(x_tW_x + h_{t-1}W_h + b_h) $$ $$ \hat{y} = \mbox{softmax}(h_{T}W_y + b_y) $$

def loop (x_t , h_tm1 , W_x, W_h, b_h ):
return T.tanh(T.dot(x_t,W_x) +
T.dot(h_tm1, W_h) +
b_h)
values,_ = theano.scan(loop ,
[x ], [T.zeros(n_hidden) ], parameters )
y_hat = T.nnet.softmax(values[-1])

Usually you want to use minibatches (\(x_{it}\in \mathbb{R}^k\)):

# shape: (batch size, sequence length, k)
x = T.tensor3('x' )
# define loop ...
v,u = theano.scan(loop,
[x.dimshuffle(1,0,2)],
...)

This way scan iterates over the "sequence" axis.

Otherwise it would iterate over the minibatch examples.

$$x:(.,1,100,100)\;\;\; W:(3,1,9,9)$$

input \( x: (m_b,n_c^{(i)},{\color{blue}h},{\color{green}w})\)

filters \( W: (n_c^{(i+1)},n_c^{(i)},{\color{red}f_s},{\color{red}f_s})\)

# x.shape: (batch size, n channels, height, width)
# W.shape: (n output channels, n input channels,
# filter height, filter width)
output = T.nnet.conv.conv2d(x, W)

This convolves \(W\) with \(x\), the output is

\(o: (m_b, n_c^{(i+1)}, {\color{blue}h}-{\color{red}f_s}+1, {\color{green}w}-{\color{red}f_s}+1)\)

Example input, 32\(\times\)32 RGB images:

# x.shape: (batch size, n channels, height, width)
x = x.reshape((mbsize, 32, 32, 3))
x = x.dimshuffle(0,3,1,2)
# W.shape: (n output channels, n input channels,
# filter height, filter width)
W = theano.shared(randoms((16,3,5,5)),
name='W-conv' )
output_1 = T.nnet.conv.conv2d(x, W)

The flat array for an image is typically stored as a sequence of

RGBRGBRGBRGBRGBRGBRGBRGBRGB...

So you want to flip (

Another layer:

W = theano.shared(randoms((32,16,5,5)),
name='W-conv-2' )
output_2 = T.nnet.conv.conv2d(output_1, W)
# output_2.shape: (batch size, 32, 24, 24)

You can also do pooling:

from theano.tensor.downsample import max_pool_2d
# output_2.shape: (batch size, 32, 24, 24)
pooled = max_pool_2d(output_2, (2,2))
# pooled.shape: (batch size, 32, 12, 12)

Finally, after (many) convolutions and poolings:

flattened = conv_output_n.flatten(ndim=2)
# then feed `flattened` to a normal hidden layer

we want to keep the minibatch dimension, but flatten all the other ones for our hidden layer, thus the

Make reusable classes for layers, or parts of your model:

class HiddenLayer :
def __init__ (self, x, n_in, n_hidden):
self.W = shared(...)
self.b = shared(...)
self.output = activation(T.dot(x,W)+b)

It's really easy with theano/python to save and reload data:

class HiddenLayer :
def __init__ (self, x, n_in, n_hidden):
# ...
self.params = [self.W, self.b]
def save_params (self):
return [i.get_value() for i in self.params]
def load_params (self, values):
for p, value in zip (self.params, values):
p.set_value(value)

It's really easy with theano/python to save and reload data:

import cPickle as pickle
# save
pickle.dump(model.save_params(),
file('model_params.pkl' , 'w' )
# load
model.load_params(
pickle.load(
file('model_params.pkl' ,'r' )))

You can even save whole models and functions with

ValueError: GpuElemwise. Input dimension mis-match. Input 1 (indices
start at 0) has shape[1] == 256, but the output's size on that axis is 128.
Apply node that caused the error: GpuElemwise{add,no_inplace}
(<CudaNdarrayType(float32, matrix)>,
<CudaNdarrayType(float32, matrix)>)
Inputs types: [CudaNdarrayType(float32, matrix),
CudaNdarrayType(float32, matrix)]

It tells us we're trying to add \(A+B\) but \(A:(n, 128), B:(n, 256)\)

Theano has a default float precision:

For now GPUs can only use float32:

`TensorType(float32, matrix) cannot store a value of dtype float64 without risking loss of precision. If you do not mind this loss, you can: 1) explicitly cast your data to float32, or 2) set "allow_input_downcast=True" when calling "function".`

http://deeplearning.net/software/theano/library/tensor/basic.html

http://deeplearning.net/data/mnist/mnist.pkl.gz

*Opens console*

- Random numbers (
T.shared_randomstreams ) - Printing/Drawing graphs (
theano.printing ) - Jacobians, Rop, Lop and Hessian-free
- Dealing with NaN/inf
- Extending theano (implementing Ops and types)
- Saving whole models to files (
pickle )