I had a dumb idea while I was looking at how to model SGD with a SDE. If you randomly initliaze matrix weights according to some Gaussian, \(\theta \sim \mathcal N (0, I)\), where I is an identity matrix, then the first gradient updates should be normally distributed to, at least for \(L_2\) loss.
So what if that kept happening? Like, you sample from one normal distribution to form another normal distribution. And so on. This is plausibly something which can occur, like difference equations with I.I.D. brownian noise. Though the exact form I'm thinking of is kind of dumb.
Suppose you have a sequence of random variables. We define
\begin{align*}
\forall n \in \mathbb N (n &= 0) \rightarrow \sigma_n \sim \mathcal N (0,1)\tag{1}\\
\forall n \in \mathbb N (n+1 &\neq 0) \rightarrow \sigma_{n+1} \sim \mathcal N(0, \sigma_n)\tag{2}
\end{align*}
Though really, \(\sigma_i\) will be sampled from a half-Gaussian. But I don't want to write that out, so whatever.
How does a sequence of jumps evolve? Well, intuitively there is a ~2/3 chance of \(\sigma_{n+1}\) being smaller than \(\sigma_n\). If this occurs infinitely many times, we might think the "event" of the variance growing smaller will occur infinitely many times. So if you're doing the obvious thing if you're not a mathematician and ask "what's the stationary distribution", you might suspect we'll get a Dirac Delta.
Yet the "event" of the variance growing larger must also occur infinitely many times in any sequence. So, sadly, we need more than infinitely many shrinks to solve our problem. Well, let's just do a quick numerical check that eq. 1 leads to an infinitely shrinking sequence.
Running a bit of code, we get a sequence like
+ 0.06619689276821521,
+ -0.004180368466071496,
+ -3.632035781799425e-06,
+ -1.780514134591378e-11,
+ -3.809091727146634e-22,
+ -3.858858262617346e-44,
+ 1.3234421571769518e-87,
+ 7.107311345616059e-175,
+ 0.0,
+ 0.0
So it looks like we're on the right track.
What now? Well, since we're interested in growth, we might reasonably ask what \(E[\frac{\sigma_{n+1}}{\sigma_n}| \sigma_n]\) is. A quick calculation shows
\begin{align*}
E[\sigma_{n+1}/\sigma_n| \sigma_n] &= \int \sqrt{\frac{2}{\pi \sigma_n^2}} \ exp(-\frac{x^2}{2 \sigma_n^2}) dx\tag{3}\\
E[\sigma_{n+1}/\sigma_n| \sigma_n] &= \sqrt{\frac{2}{\pi}}\tag{4}
\end{align*}
So in expectation, \(\sigma_{n+1}\) is smaller than \(\sigma_n\). What do we do for \(E[\sigma_{n+2}/\sigma_n| \sigma_n]\)? Well, we have
\begin{equation*}\tag{5}
E[\sigma_{n+2}/\sigma_n| \sigma_n] =E[\sigma_{n+2}/\sigma_{n+1} \sigma_{n+1}/\sigma_n| \sigma_n]
\end{equation*}
If we could somehow break that product of factors of successive \(\sigma_i\) into a sum, then we'd be closer to re-using eq. 4. Luckily, the arithmetic-geometric mean inequality,
\begin{equation*}
\Pi_i a_i \leq (\frac{\Sigma_i a_i}{n})^n,\tag{6}
\end{equation*}
and Jensen's inequality,
\begin{equation*}
f: \mathbb R \rightarrow \mathbb R \text{ is a convex function} \rightarrow f(E[X]) \leq E[f(X)],\tag{7}
\end{equation*}
will let us do that.
\begin{align*}
& (5)\land (6) \vdash E[\sigma_{n+2}/\sigma_n| \sigma_n] \leq E[ \frac{1}{4} \big{(} (\frac{\sigma_{n+2}}{\sigma_{n+1}})^2 + (\frac{\sigma_{n+1}}{\sigma_{n}})^2 + 2 \frac{\sigma_{n+2}}{\sigma_n} \big{)} |\sigma_n]\tag{8}\\
& (8) \vdash E[\sigma_{n+2}/\sigma_n| \sigma_n] \leq \frac{1}{2} E[ \big{(} (\frac{\sigma_{n+2}}{\sigma_{n+1}})^2 + (\frac{\sigma_{n+1}}{\sigma_{n}})^2\big{)}|\sigma_n]\tag{9}\\
& (7) \land (9)\vdash E[\sigma_{n+2}/\sigma_n| \sigma_n] \leq \frac{1}{2} \big{(} (\frac{E[\sigma_{n+2}|\sigma_n]}{\sigma_{n+1}})^2 + (\frac{E[\sigma_{n+1}|\sigma_n]^2}{\sigma_{n}}) \big{)}\tag{10}\\
& (4)\land (10)\vdash E[\sigma_{n+2}/\sigma_n| \sigma_n] \leq \frac{1}{\sqrt{2 \pi}}\tag{11}
\end{align*}
~~It seems pretty likely~~ Generalizing the above arguement to show
\begin{equation*}
E[\frac{\sigma_n}{\sigma_{n-m}}|\sigma_{n-m}]\leq \frac{1}{\sqrt{\pi}}\tag{12}
\end{equation*}
is left as an exercise to the reader.
Eq. 12 implies our sequence is bounded above almost surely. Moreover, if there is some kind of limiting distribution for \(\sigma_n\) as \(n\rightarrow \infty\), it would have to have mean 0. But since we've restricted \(\sigma\) to be in \( [ 0, \infty )\), it would have to have zero measure over the positive reals.
That doesn't imply, though, that the operator K (eq. 13) doesn't have a fixed point
\begin{equation*}
K[f] = \int \sqrt{\frac{2}{\pi \sigma^2}} exp(-\frac{x^2}{2 \sigma^2}) f(x) dx.\tag{13}
\end{equation*}
Or that it does. Only that, if this equilibrium is the limit of some jump process, it must be a Dirac Delta.