Jekyll2021-01-05T18:34:32-08:00/feed.xmlPatrick O’Neillmy websitePatrick O'Neillpatrick.kaileigh.oneill@gmail.comSloppy Models: Background and Motivation2021-01-04T00:00:00-08:002021-01-04T00:00:00-08:00/sloppy-models-1<p>One of the most influential lectures I ever caught in grad school was
<a href="http://gutengroup.mcb.arizona.edu/">Ryan Gutenkunst’s</a> talk on
<em>sloppy models</em>. <a href="http://www.lassp.cornell.edu/sethna/Sloppy/WhatAreSloppyModels.html">Model
sloppiness</a>,
first developed by Jim Sethna and his students and collaborators, is a
ubiquitous phenomenon in mathematical models for the natural sciences.
Often, especially in biology, a model of a system of interest will
have many more parameters to be fit than observations to fit them
with: a scenario like eighty parameters and six data points is not
uncommon. Standard theory tells you that you are doomed here. But if
you ignore standard theory and plow ahead, you typically find that the
(under)fit model will predict future experiments perfectly well. How
could that possibly work?</p>
<p>This act of “<a href="http://www.lassp.cornell.edu/sethna/Sloppy/FittingExponentials.html">prediction without
parameters</a>”
works only because complex models seem to generically exhibit
hierarchies of parameter importance: a few linear combinations of
parameters end up running almost the whole show, and the other
dimensions of the parameter space become effectively irrelevant. (The
technical details get filled in by information geometry, a beautiful
field of mathematics wedding statistics to differential geometry, that
we’ll explore later in this series.) Model sloppiness seemed to offer a unifying
perspective on a fundamental question: how, in general, do simple
effective theories bubble up from the chaotic microdynamics of
complicated systems? Why don’t we have to keep track of every little
detail when we model a cloud of gas, a cell, a brain or a national
economy? How is science possible at all?</p>
<p>I went back to the lab and got back to work on my project, which had
nothing obvious to do with model sloppiness. But I kept thinking
about it. It seemed to come up everywhere. Recently it even came up
in my kitchen.</p>
<h2 id="sloppy-pecans">Sloppy Pecans</h2>
<p>A while ago I needed some toasted nuts for a recipe, so I put some
pecans in a pan and stuck them in the oven. When I pulled them out
again they were too hot to handle, so I put them on the stove to cool.
How long would I have to wait, I wondered? I placed a meat
thermometer in the pan and started taking measurements. After 20
minutes I had a dataset that looked like this:</p>
<p><img src="/assets/sloppy-models-1/init_plot.png" alt="Plot of cooling data" /></p>
<p>What would happen in another ten minutes? Another hour? Basic
theory recommends fitting a model based on Newton’s Law of Cooling,
which states:</p>
\[\dot{T} = k (T - T_{env}),\]
<p>which is to say that the change in temperature $\dot{T}$ is
proportional to the difference between the current temperature $T$ and
the ambient temperature of the environment $T_{env}$. The
proportionality constant $k$ governs the rate of equilibriation,
absorbing details about the conductivity of the interface between
object and environment, the surface area of said interface, and so on.</p>
<p>This equation has solutions of the form:</p>
\[T(t) = T_0 + (T_0 - T_{env})e^{-kt}.\]
<p>To fix the initial temperature $T_0$ we can simply take the first
observation, which we define to have occurred at $t=0$. To find $k$
and $T_{env}$, however, we will need to do some model fitting.</p>
<p>Using mean squared error as our fitting criterion, our loss function is:</p>
\[L(k, T_{env}) = \frac{1}{N} \sum_{i=1}^N (T_i - \hat T(t_i; T_{env}, k))^2.\]
<p>While this non-linear least squares problem has no analytic solution
in general, we can use a standard optimizer (in this case, the
out-of-the-box Nelder-Mead solver as implemented in SciPy) in order to
find the maximum likelihood parameter estimates. Our MLE parameter
estimates are $\hat k \approx 0.11 (min)^{-1}$ and $\hat T_{env}
\approx 302.3\mathrm{K}$. These values check out: 302K is in the
eighties on the Fahrenheit scale, entirely believable for the stovetop
of a recently used oven. Similarly, a cooling rate constant of $0.11
(min)^{-1}$ implies that an object heated to 350F would become
touchable after fifteen minutes and completely cool within an hour.
This might be slightly fast for a cake, but checks out for roasted
nuts on a thin metal pan. We can therefore have some confidence that
the fitting procedure has not gone completely awry.</p>
<p>We can also check the goodness of fit visually and confirm that our
model is reasonable:</p>
<p><img src="/assets/sloppy-models-1/residual_plot.png" alt="MLE fit" /></p>
<p>While there is some non-normality in the residuals (presumably
driven by the outlying penultimate data point), this is hardly the
worst model for eight observations casually collected in the kitchen
with a meat thermometer and a wristwatch.</p>
<p>(As an aside, this should also help you to believe that this data is
indeed real, because if I had faked it would have looked nicer!)</p>
<p>Having found reasonable point estimates for our parameters, though, we
might now ask: how much uncertainty is attached to these estimates?</p>
<p>In order to proceed we must probabilize the model, i.e. render it in a
form where it can assign probabilities to observations, given some
choice of parameters. A natural choice, given that we are already
fitting according to MSE loss, is to assume that our observed data is
independently and identically normally distributed about the true
temperature. This assumption is not merely physically plausible (if
the errors are the result of many small additive sources) but is
actually suggested by the loss function itself. The loss</p>
\[L(\theta) = \frac{1}{N} \sum_i (T_i - \hat T(\theta, t_i))^2\]
<p>can be reconstrued as:</p>
\[\ell(\theta; \sigma) = \log Z (\sigma) + \sum_i \frac{-(T_i - \hat T(\theta, t_i))^2}{2\sigma^2}\]
<p>i.e. the negative MSE is (up to a data-independent constant) just the
log-likelihood of independent normally-distributed data, where the
$i$th observation has mean $\hat T_i(\theta, t_i)$, and all
observations share a variance $\sigma^2$. What value of $\sigma^2$
should we choose? One natural choice is to simply let the data speak
for itself. We can find the value of $\sigma^2$ that maximizes its
likelihood by numerically solving:</p>
\[\sigma_{MLE} = \textrm{argmax}_\sigma \ell(\theta; \sigma).\]
<p><img src="/assets/sloppy-models-1/sigma_plot.png" alt="Log-likelihood as a function of $\sigma^2$" /></p>
<p>We find that the optimal noise scale is given by $\sigma \approx
1.3$K, roughly 2.3 $^\circ$F. The fact that the maximum likelihood
estimate puts the typical error at about 1K (and not, say, 0.01K or
100K), is another helpful sanity check: it agrees with our intuition
about the precision we can expect from a meat thermometer. In
any case, 1.3 is perhaps not so different from unity as to be worth
worrying about.</p>
<p>With the noise scale estimated, we can sample from the posterior joint
distribution of $T_{env}$ and $k$ using MCMC. The resulting marginals
look like this:</p>
<p><img src="/assets/sloppy-models-1/marginal_plot.png" alt="Marginals" /></p>
<p>There’s a fair bit of residual uncertainty here: the standard
deviation of $T_{env}$ is about 5K, and $k$ has wiggle room amounting to a
factor of two in either direction.</p>
<p>When we examine the likelihood surface, however, a different picture
emerges. In the figure below, log-likelihood is plotted as a function
of $T_{env}$ and $k$. The yellow ribbon corresponds to the region of
highest likelihood; the indigo basin, the lowest. MLE parameters are
indicated with a red dot, and the right-most plot simply shows a
zoomed-in view of the left (NB that the aspect ratios are not
necessarily identical).</p>
<p><img src="/assets/sloppy-models-1/error_surface.png" alt="Error Surface" /></p>
<p>This plot bears several features that we can come to recognize as
generic characteristics of physical models. For now, we can simply
pick them out and list them:</p>
<ul>
<li>The region of high likelihood is, to a first approximation, a 1-D
manifold embedded within a 2-D parameter space.</li>
<li>To a second approximation, the manifold forms a kind of <em>ribbon</em> with one
length scale many times longer than the other.</li>
<li>The likelihood changes sharply in response to movement along the
thin (or <em>stiff</em>) dimension of the manifold. Conversely, the model is
practically indifferent to movement in the long (or <em>sloppy</em>)
dimension.</li>
<li>The manifold is not well-described in terms of the bare parameters of
the model. Instead, taking the MLE estimate as the origin of a new
coordinate frame, the stiff dimension is given by some linear
combination of parameters, and the sloppy dimension by another. In
words, the model is largely indifferent between cooling slowly to a
lower ambient temperature or cooling quickly to a higher one. But
it <em>can</em> tell the difference between cooling quickly to a lower
temperature and slowly to a higher one.</li>
<li>Because the manifold is curved (as viewed from the bare parameter
space), the natural coordinate frame (i.e. directions of maximal
stiffness and sloppiness) changes as a function of position along
the manifold.</li>
</ul>
<h2 id="conclusions-for-now">Conclusions (for now)</h2>
<p>It was a pleasant surprise to find that, even in a simple
two-parameter model fit with science-fair-grade data, some of the
essential phenomena of model sloppiness already appear. In future
posts I’d like to explore the math behind model sloppiness in more
general terms and perhaps revisit our pecan cooling model through a
series of increasingly powerful lenses.</p>
<h2 id="references">References</h2>
<p>Most of the literature on model sloppiness, as well as some
introductory tutorials, can be found through <a href="http://www.lassp.cornell.edu/sethna/Sloppy/WhatAreSloppyModels.html">Jim Sethna’s
website</a>.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comOne of the most influential lectures I ever caught in grad school was Ryan Gutenkunst’s talk on sloppy models. Model sloppiness, first developed by Jim Sethna and his students and collaborators, is a ubiquitous phenomenon in mathematical models for the natural sciences. Often, especially in biology, a model of a system of interest will have many more parameters to be fit than observations to fit them with: a scenario like eighty parameters and six data points is not uncommon. Standard theory tells you that you are doomed here. But if you ignore standard theory and plow ahead, you typically find that the (under)fit model will predict future experiments perfectly well. How could that possibly work?VC Incentive Alignment2020-11-17T00:00:00-08:002020-11-17T00:00:00-08:00/vc-incentive-alignment<p>Shared ownership is supposed to align incentives. If two founders
each own 50% of a given company, then they will tend to want the same
things for that company. But consider a founder who owns 99% of one
company and an investor who owns 1% of 99 companies– do these actors
always have the same incentives? Not necessarily. To see why,
consider the following model.</p>
<p>Suppose that the ultimate value of a company can be modeled by a log
normal distribution $X \sim \mathcal{LN}(\mu, \sigma^2)$. Assuming
log utility, a single founder who owns the entire company receives the
expected utility,</p>
\[U_F = E[\log(X)] = \mu.\]
<p>That is to say, the utility scales only with the mean and does not
depend on the variance, as the fluctuations in returns are precisely
canceled by the log utility function: the founder in this scenario is
perfectly indifferent to the variance.</p>
<p>Now consider the case of an investor who owns a $1\%$ stake in $100$
companies, each of whose valuations is independently and identically
distributed as $\mathcal{LN}(\mu, \sigma^2)$. The investor’s portfolio
$Y = \frac{1}{N}\sum_{i=1}^{100} X_i$ is the arithmetic mean of $N$ log-normal
random variates. The portfolio has the first two moments:</p>
\[\begin{align*}
E[Y] =& \exp{\left(\mu + \frac{\sigma^2}{2}\right)}\\\\
V[Y] =& \frac{(\exp{(\sigma^2)} - 1)\exp{\left(2\mu + \sigma^2\right)}}{N},\\
\end{align*}\]
<p>and gives the investor utility:</p>
\[U_I = \log(Y).\]
<p>To estimate the quantity $U_I$, we can appeal to a useful
approximation for functions of random variables. For sufficiently
well-behaved functions $f$ and random variables $X$ we may
Taylor-expand inside the expectation and write:</p>
\[\begin{align*}
E[f(X)] =& E[f(E[X]) + f'(E[X]) (X - E[X]) + f''(E[X]) \frac{(X - E[X])^2}{2} + \mathcal{O}(X^3)]\\
\approx& f(\mu) + \frac{f''(\mu)}{2}V[X].\\
\end{align*}.\]
<p>Applying this idea to $\log(Y)$, we obtain:</p>
<p>\(E[\log(Y)] \approx \log(E[Y]) + \frac{-V[Y]}{2 E[Y]^2}\).</p>
<p>Plugging in the relevant quantities and simplifying, we arrive at:</p>
\[\begin{align*}
E[\log(Y)] \approx& \mu + \frac{\sigma^2}{2} - \frac{\exp{(\sigma^2)} - 1}{2 N}\\
\approx& \mu + \frac{\sigma^2}{2},
\end{align*}\]
<p>the last line obtaining asymptotically when N is large. Notice that, whereas the
sole founder’s utility was simply $\mu$, the investor’s utility
contains an additional term $\frac{\sigma^2}{2}$ that scales with the
variance.</p>
<p>What does this portend for incentive alignment between founders and
investors? Generally speaking, both parties benefit from making $\mu$
go up. However, investors can benefit from increasing the quantity
$\mu + \frac{\sigma^2}{2}$ as well. This is true even if $\mu$ itself
goes down, so long as the decrease in $\mu$ is more than offset by an
increase in $\sigma^2$, i.e. $\Delta \sigma^2 > -2\Delta \mu$.</p>
<p>This argument is summarized graphically in the diagram below:</p>
<p><img src="/assets/vc-incentive-alignment/axis4.png" alt="Diagram" /></p>
<p>Considering the company as a point in a parameter space, the founder
benefits most from pushing the state of the business in the direction
of the blue arrow. An infinitely diluted investor, however, benefits
most from pushing things in the direction of the red arrow. The
shaded hemispheres show the directions that would be acceptable to
either party. The purple sector denotes the area of mutual benefit.
The opposing red and blue sectors indicate the
axis of antagonism: directions that would be preferred by one party
but not the other.</p>
<p>Are founders and investors explicitly thinking of companies as
log-normal random variables? Not necessarily. This is, of course, a
highly stylized model that caricatures one particular dynamic. But
every day, founders get up and tweak parameters that make successes of
various sizes more or less likely. It’s not such a leap to think of
startups as random variables whose parameters you can modify with
effort.</p>
<p>A friend of mine with a lot of experience in the startup world noticed
that investors would come in and make suggestions that didn’t seem to
reflect the best interests of the company. These suggestions were
mostly of the swing-for-the-fences variety, encouraging the founders
to take the company in directions that had a small chance of
spectacular success, even if the modal outcome was failure. To the
founders, who presumably didn’t relish the thought of betting years of
hard work on a lottery ticket, this advice seemed very bad. Did the
investors simply not know what they were talking about?</p>
<p>More likely, the investors did know what they were talking about–
they were just giving the advice that suited their own interests,
which might not necessarily be yours.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comShared ownership is supposed to align incentives. If two founders each own 50% of a given company, then they will tend to want the same things for that company. But consider a founder who owns 99% of one company and an investor who owns 1% of 99 companies– do these actors always have the same incentives? Not necessarily. To see why, consider the following model.Guest Lecture at MIT2020-11-01T00:00:00-07:002020-11-01T00:00:00-07:00/mit-talk<p>Next week I’ll be giving a guest lecture at MIT for 15.S08, Advanced
Data Analytics and Machine Learning in Finance, sketching an overview
of modern speech-to-text systems and offering some lessons learned
from building Scribe.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comNext week I’ll be giving a guest lecture at MIT for 15.S08, Advanced Data Analytics and Machine Learning in Finance, sketching an overview of modern speech-to-text systems and offering some lessons learned from building Scribe.Hello, world!2018-02-26T00:00:00-08:002018-02-26T00:00:00-08:00/hello-world<p>Blog coming soon.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comBlog coming soon.