Back to Home

### On weight initialization of deep nets

#### Preliminaries

#### Fully connected nets

#### Rectified linear units

I will illustrate here the computations made in Glorot et al.(2010) and He et al.(2015), hopefully in a clear and simple way. Any suggestions and additonal references are welcome!

I will try to use simple and detailed math. Again suggestions and clarifications are welcome. All integrals are done in Wolfram Alpha, as I am skeptical of my own integration skills, and would not want to claim falsehoods.

- We need to assume $X\sim D$ and each dimension $X_i$ as i.i.d., with some known mean and variance $\mu_X$ and $\sigma^2_X$,
- $W_{l,ij} \sim \mathcal{W}_l$ is the sampled weight at layer $l$,
- $\mathbb{E}[\mathcal{W}] = 0$ will simplify computations,
- $U_{l,i}$ is the pre-activation value of unit $i$ in layer $l$,
- $H_{l,i}$ is the post-activation value of the same unit,
- for simplicity of notation we assume there are $n$ inputs, $n$ hidden units per layer.

While the i.i.d. assumption on $X$ is possibly strong, an identical mean and variance across some normalized $X_i$s is sufficient under most cases.

We will analyse fully-connected neural nets, but this results holds for convolutional networks and to an extent recurrent networks, without many modifications to the analysis. Consider the value of $H_{l,i}$: $$ H_{l,i} = f\left( U_{l,i} = \sum_j H_{l-1,j}W_{l,ji}\right) $$ In general: $$\begin{align} H_i &= f(U_i)\\ U_i &= \sum_j X_j W_{ji} \\ \mathbb{E}[X_j W_{ji}] &= \mathbb{E}[X_{j}]\mathbb{E}[W_{ji}] = 0 \tag{1} \\ Var(X_j W_{ji}) &= \mathbb{E}[X_j]^2 Var(W_{ji}) + \mathbb{E}[W_{ji}]^2 Var(X_j)+Var(W_{ji})Var(X_j)\\ &= \mu_X^2 \sigma_W^2+0+\sigma_W^2\sigma_X^2 \\ &= \sigma_W^2 (\mu_X^2+\sigma_X^2) \tag{2} \end{align}$$

(1) is possible because we assume $\mu_\mathcal{W}=0$, similarly for (2) we can eliminate the second right hand side term. To simplify computations, we will assume that $X$ is normalized, i.e. $\mu_X=0$ and $\sigma_X=1$.

Notice that $U_i$ is the sum of many i.i.d. variables, as such from the central limit theorem, we know its distribution: $$ \begin{align} U_i &\sim \mathcal{N} \left[ n\mathbb{E}[X_jW_{ji}], nVar(X_jW_{ji})\right] \\ &\sim \mathcal{N} \left[ 0, n\sigma^2_W\right] \tag{3} \end{align}$$

Note that pre-activation Batch Normalization (Ioffe 2015) forces this distribution to be $\mathcal{N}(0,1)$, which as we will see lessens the need for "smart" initialization (at the cost of computation).

Now, the distribution of $H_i$ will depend on the activation function $f$. We will start with ReLUs. To find the expected value and variance of $H_i$, we need to know the following values using scary integrals:

$$\mathbb{E}(ReLU(X);X\sim\mathcal{N}(0,\sigma^{2}))=\int_{0}^{\infty}x\frac{e^{-\frac{x^{2}}{2\sigma^{2}}}}{\sqrt{2\pi\sigma^{2}}}dx=\sqrt{\frac{\sigma^{2}}{2\pi}} \tag{4}$$

$$\mathbb{E}(ReLU(X)^{2};X\sim\mathcal{N}(0,\sigma^{2}))=\int_{0}^{\infty}x^{2}\frac{e^{-\frac{x^{2}}{2\sigma^{2}}}}{\sqrt{2\pi\sigma^{2}}}dx=\frac{1}{2}\sigma^{2} \tag{5}$$

As such: $$\mathbb{E}[H_i] = \sqrt{\frac{n\sigma^2_W}{2\pi}}$$ and $$\begin{align} Var(H_i) &= \mathbb{E}[H_i^2]-\mathbb{E}[H_i]^2 &\\ &= \frac{1}{2}\sigma^2_U -\mathbb{E}[H_i]^2 &\mbox{ from (5)}\\ &= \frac{1}{2}\sigma^2_U -\frac{\sigma^2_U}{2\pi} &\mbox{ from (4)} \\ &= \frac{n}{2}\sigma^2_W - \frac{n\sigma^2_W}{2\pi} &\mbox{ from (3)}\\ & = \frac{n}{2}\sigma^2_W(1-\frac{1}{\pi}) \end{align}$$

Note that had we not assumed $\sigma_X=1$ and $\mu_X=0$ this would be $$\frac{n}{2}\sigma^2_W(\mu^2_X+\sigma^2_X)(1-\frac{1}{\pi})$$

Now the magic happens. Instead of $X$, since we have a deep network, the input to a layer is simply the output of the previous one. So we can write this variance term recursively in $l$ (I will write $H_l$ to reprent all units in the $l$th layer, $W_l$ for the weights): $$\begin{align} Var(H_l) &= \frac{n}{2}\sigma^2_{W_l}(\mu^2_{H_{l-1}}+\sigma^2_{H_{l-1}})(1-\frac{1}{\pi}) \\ &= (1-\frac{1}{\pi})(\frac{n}{2}n\sigma^2_{W_l} \frac{\sigma^2_{W_{l-1}}}{2\pi} + \frac{n}{2}\sigma^2_{W_l}\frac{n}{2}\sigma^2_{W_{l-1}}(1-\frac{1}{\pi})) \\ & = (1-\frac{1}{\pi})\frac{n^2}{4}\sigma^2_{W_l}\sigma^2_{W_{l-1}}(\frac{1}{\pi} + (1-\frac{1}{\pi})) \\ & = \frac{n^2}{4}\sigma^2_{W_l}\sigma^2_{W_{l-1}}(1-\frac{1}{\pi}) \end{align}$$

Note that we can repeat this process $L$ times to get: $$ Var(H_L) = \frac{n^L}{2^L} (\mu^2_X+\sigma^2_X)(1-\frac{1}{\pi}) \prod_{l=1}^L \sigma^2_{W_l}$$

We would like the variance not to explode in the number of layers, as such we can choose $\sigma^2_{W_l}$ to counteract each $n/2$ factor by being $\sigma^2_{W_l} = 2/n$. Note that the $n$ we really want here is found via the sum in (1), which is simply the number of inputs to each neuron, $n_{in}$.

For a uniform $U[a,b]$ initialization distribution, which has variance $1/12(b-a)^2$, we want to set a zero mean and a variance of $2/n$, as such we want to sample from $U[-k,k]$ with $$ k = \sqrt{\frac{6}{n_{in}}}$$

For a gaussian $N(0,\sigma^2)$, we already have zero mean and by definition a variance of $\sigma^2$, as such we want a standard-deviation (which most packages use as a parameter instead of variance) of: $$\sigma = \sqrt{\frac{2}{n_{in}}}$$

The integrals performed in (4) and (5) are even scarier with a tanh activation function. In fact (4) with tanh is quite easy, it is 0, but as far as I know there is no clean way of computing: $$ \mathbb{E}_{X\sim\mathcal{N}}(\tanh^2(X))=\int_{-\infty}^{\infty}\tanh^{2}(x)\frac{e^{-\frac{x^{2}}{2\sigma^{2}}}}{\sqrt{2\pi\sigma^{2}}}dx $$ As such, we need to assume that we are in the linear region of tanh, which makes the integral quite cleaner, with the help of tools like mathematica (or wolframalpha): $$ \begin{align} \mathbb{E}_{X\sim\mathcal{N}}(X^2)&=\int_{-\infty}^{\infty}x^{2}\frac{e^{-\frac{x^{2}}{2\sigma^{2}}}}{\sqrt{2\pi\sigma^{2}}}dx \\ &= \sqrt{\frac{1}{\sigma^2}}\sigma^3 = \sigma^2 \end{align}$$

We can repeat our analysis, and find that: $$ \mathbb{E}[H_i] = 0$$ $$ Var(H_i) = n\sigma^2_W (\mu_X^2+\sigma_X^2)$$

For $L$ layers: $$Var(H_L) = n^L \prod_l \sigma^2_{W_l}$$ suggesting to initialize with $\sigma^2_{W_l}=1/n$. For a uniform that is $U[-\sqrt{12/n_{in}},\sqrt{12/n_{in}}]$, for a gaussian that is using a stdev of $\sqrt{1/n_{in}}$.

A very interesting point is brought by Glorot et al. (2010), what happens in backpropagation? Especially in the case where not all layers have the same $n_{hidden}$?

Understand that the gradient of the loss w.r.t some linear hidden unit $H_i$ is roughly the sum of incoming gradients:

$$\frac{\partial \mathcal{L}}{\partial H_{l,i}} = \sum_j \frac{\partial \mathcal{L}}{\partial H_{l+1,j}} W_{l+1,ij}$$

We can do in 'reverse' the exact same analysis that we just did for the forward propagation, but with incoming gradients instead of $X$, and we will find that similarly: $$Var \frac{\partial \mathcal{L}}{\partial H_{1,i}} = n_{out}^L \prod_{l=1}^L \sigma^2_{W_l}$$ Notice how that is the variance of the input layer, also notice that for some matrix $W$ of shape $(n_{in}, n_{out})$ there are $n_{out}$ incoming gradients, as such if we want the gradient not to explose, we also need this variance to be 1, suggesting to initialize to a variance of $1/n_{out}$ instead of $1/n_{in}$.

Glorot et al. propose a compromise, to initialize at halfway, $2/(n_{in}+n_{out})$.

For a uniform initialization that implies the result $U[-k,k]$, $k=\sqrt{6/(n_{in}+n_{out})}$, which Glorot et al. find experimentally to perform better.

The same analysis holds for ReLUs (He et al., 2015), but He et al. do not suggest to use the same 'halfway' trick, claiming that the ratio between back and forward variance is small enough in their architecture that there is no practical difference.

Although apparently seldom used anymore, Dropout still remains an effective regularization scheme. I just wanted to emphasize in this small section that it is important to correct the activations when using dropout with rate $p$.

Say a unit only has half of the incoming units, $p_{in}=1/2$, its expected variance is divided by two (although its expected mean remains 0, pre-activation). To fix this one can:

- multiply activations by $1/p$ at training time, not multiply at test time, or
- initialize with a variance multiplied by $1/p$ for training time, multiply by $p$ at test time.