Jekyll2021-01-05T18:34:32-08:00/feed.xmlPatrick O’Neillmy websitePatrick O'Neillpatrick.kaileigh.oneill@gmail.comSloppy Models: Background and Motivation2021-01-04T00:00:00-08:002021-01-04T00:00:00-08:00/sloppy-models-1<p>One of the most influential lectures I ever caught in grad school was <a href="http://gutengroup.mcb.arizona.edu/">Ryan Gutenkunst’s</a> talk on <em>sloppy models</em>. <a href="http://www.lassp.cornell.edu/sethna/Sloppy/WhatAreSloppyModels.html">Model sloppiness</a>, first developed by Jim Sethna and his students and collaborators, is a ubiquitous phenomenon in mathematical models for the natural sciences. Often, especially in biology, a model of a system of interest will have many more parameters to be fit than observations to fit them with: a scenario like eighty parameters and six data points is not uncommon. Standard theory tells you that you are doomed here. But if you ignore standard theory and plow ahead, you typically find that the (under)fit model will predict future experiments perfectly well. How could that possibly work?</p> <p>This act of “<a href="http://www.lassp.cornell.edu/sethna/Sloppy/FittingExponentials.html">prediction without parameters</a>” works only because complex models seem to generically exhibit hierarchies of parameter importance: a few linear combinations of parameters end up running almost the whole show, and the other dimensions of the parameter space become effectively irrelevant. (The technical details get filled in by information geometry, a beautiful field of mathematics wedding statistics to differential geometry, that we’ll explore later in this series.) Model sloppiness seemed to offer a unifying perspective on a fundamental question: how, in general, do simple effective theories bubble up from the chaotic microdynamics of complicated systems? Why don’t we have to keep track of every little detail when we model a cloud of gas, a cell, a brain or a national economy? How is science possible at all?</p> <p>I went back to the lab and got back to work on my project, which had nothing obvious to do with model sloppiness. But I kept thinking about it. It seemed to come up everywhere. Recently it even came up in my kitchen.</p> <h2 id="sloppy-pecans">Sloppy Pecans</h2> <p>A while ago I needed some toasted nuts for a recipe, so I put some pecans in a pan and stuck them in the oven. When I pulled them out again they were too hot to handle, so I put them on the stove to cool. How long would I have to wait, I wondered? I placed a meat thermometer in the pan and started taking measurements. After 20 minutes I had a dataset that looked like this:</p> <p><img src="/assets/sloppy-models-1/init_plot.png" alt="Plot of cooling data" /></p> <p>What would happen in another ten minutes? Another hour? Basic theory recommends fitting a model based on Newton’s Law of Cooling, which states:</p> $\dot{T} = k (T - T_{env}),$ <p>which is to say that the change in temperature $\dot{T}$ is proportional to the difference between the current temperature $T$ and the ambient temperature of the environment $T_{env}$. The proportionality constant $k$ governs the rate of equilibriation, absorbing details about the conductivity of the interface between object and environment, the surface area of said interface, and so on.</p> <p>This equation has solutions of the form:</p> $T(t) = T_0 + (T_0 - T_{env})e^{-kt}.$ <p>To fix the initial temperature $T_0$ we can simply take the first observation, which we define to have occurred at $t=0$. To find $k$ and $T_{env}$, however, we will need to do some model fitting.</p> <p>Using mean squared error as our fitting criterion, our loss function is:</p> $L(k, T_{env}) = \frac{1}{N} \sum_{i=1}^N (T_i - \hat T(t_i; T_{env}, k))^2.$ <p>While this non-linear least squares problem has no analytic solution in general, we can use a standard optimizer (in this case, the out-of-the-box Nelder-Mead solver as implemented in SciPy) in order to find the maximum likelihood parameter estimates. Our MLE parameter estimates are $\hat k \approx 0.11 (min)^{-1}$ and $\hat T_{env} \approx 302.3\mathrm{K}$. These values check out: 302K is in the eighties on the Fahrenheit scale, entirely believable for the stovetop of a recently used oven. Similarly, a cooling rate constant of $0.11 (min)^{-1}$ implies that an object heated to 350F would become touchable after fifteen minutes and completely cool within an hour. This might be slightly fast for a cake, but checks out for roasted nuts on a thin metal pan. We can therefore have some confidence that the fitting procedure has not gone completely awry.</p> <p>We can also check the goodness of fit visually and confirm that our model is reasonable:</p> <p><img src="/assets/sloppy-models-1/residual_plot.png" alt="MLE fit" /></p> <p>While there is some non-normality in the residuals (presumably driven by the outlying penultimate data point), this is hardly the worst model for eight observations casually collected in the kitchen with a meat thermometer and a wristwatch.</p> <p>(As an aside, this should also help you to believe that this data is indeed real, because if I had faked it would have looked nicer!)</p> <p>Having found reasonable point estimates for our parameters, though, we might now ask: how much uncertainty is attached to these estimates?</p> <p>In order to proceed we must probabilize the model, i.e. render it in a form where it can assign probabilities to observations, given some choice of parameters. A natural choice, given that we are already fitting according to MSE loss, is to assume that our observed data is independently and identically normally distributed about the true temperature. This assumption is not merely physically plausible (if the errors are the result of many small additive sources) but is actually suggested by the loss function itself. The loss</p> $L(\theta) = \frac{1}{N} \sum_i (T_i - \hat T(\theta, t_i))^2$ <p>can be reconstrued as:</p> $\ell(\theta; \sigma) = \log Z (\sigma) + \sum_i \frac{-(T_i - \hat T(\theta, t_i))^2}{2\sigma^2}$ <p>i.e. the negative MSE is (up to a data-independent constant) just the log-likelihood of independent normally-distributed data, where the $i$th observation has mean $\hat T_i(\theta, t_i)$, and all observations share a variance $\sigma^2$. What value of $\sigma^2$ should we choose? One natural choice is to simply let the data speak for itself. We can find the value of $\sigma^2$ that maximizes its likelihood by numerically solving:</p> $\sigma_{MLE} = \textrm{argmax}_\sigma \ell(\theta; \sigma).$ <p><img src="/assets/sloppy-models-1/sigma_plot.png" alt="Log-likelihood as a function of $\sigma^2$" /></p> <p>We find that the optimal noise scale is given by $\sigma \approx 1.3$K, roughly 2.3 $^\circ$F. The fact that the maximum likelihood estimate puts the typical error at about 1K (and not, say, 0.01K or 100K), is another helpful sanity check: it agrees with our intuition about the precision we can expect from a meat thermometer. In any case, 1.3 is perhaps not so different from unity as to be worth worrying about.</p> <p>With the noise scale estimated, we can sample from the posterior joint distribution of $T_{env}$ and $k$ using MCMC. The resulting marginals look like this:</p> <p><img src="/assets/sloppy-models-1/marginal_plot.png" alt="Marginals" /></p> <p>There’s a fair bit of residual uncertainty here: the standard deviation of $T_{env}$ is about 5K, and $k$ has wiggle room amounting to a factor of two in either direction.</p> <p>When we examine the likelihood surface, however, a different picture emerges. In the figure below, log-likelihood is plotted as a function of $T_{env}$ and $k$. The yellow ribbon corresponds to the region of highest likelihood; the indigo basin, the lowest. MLE parameters are indicated with a red dot, and the right-most plot simply shows a zoomed-in view of the left (NB that the aspect ratios are not necessarily identical).</p> <p><img src="/assets/sloppy-models-1/error_surface.png" alt="Error Surface" /></p> <p>This plot bears several features that we can come to recognize as generic characteristics of physical models. For now, we can simply pick them out and list them:</p> <ul> <li>The region of high likelihood is, to a first approximation, a 1-D manifold embedded within a 2-D parameter space.</li> <li>To a second approximation, the manifold forms a kind of <em>ribbon</em> with one length scale many times longer than the other.</li> <li>The likelihood changes sharply in response to movement along the thin (or <em>stiff</em>) dimension of the manifold. Conversely, the model is practically indifferent to movement in the long (or <em>sloppy</em>) dimension.</li> <li>The manifold is not well-described in terms of the bare parameters of the model. Instead, taking the MLE estimate as the origin of a new coordinate frame, the stiff dimension is given by some linear combination of parameters, and the sloppy dimension by another. In words, the model is largely indifferent between cooling slowly to a lower ambient temperature or cooling quickly to a higher one. But it <em>can</em> tell the difference between cooling quickly to a lower temperature and slowly to a higher one.</li> <li>Because the manifold is curved (as viewed from the bare parameter space), the natural coordinate frame (i.e. directions of maximal stiffness and sloppiness) changes as a function of position along the manifold.</li> </ul> <h2 id="conclusions-for-now">Conclusions (for now)</h2> <p>It was a pleasant surprise to find that, even in a simple two-parameter model fit with science-fair-grade data, some of the essential phenomena of model sloppiness already appear. In future posts I’d like to explore the math behind model sloppiness in more general terms and perhaps revisit our pecan cooling model through a series of increasingly powerful lenses.</p> <h2 id="references">References</h2> <p>Most of the literature on model sloppiness, as well as some introductory tutorials, can be found through <a href="http://www.lassp.cornell.edu/sethna/Sloppy/WhatAreSloppyModels.html">Jim Sethna’s website</a>.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comOne of the most influential lectures I ever caught in grad school was Ryan Gutenkunst’s talk on sloppy models. Model sloppiness, first developed by Jim Sethna and his students and collaborators, is a ubiquitous phenomenon in mathematical models for the natural sciences. Often, especially in biology, a model of a system of interest will have many more parameters to be fit than observations to fit them with: a scenario like eighty parameters and six data points is not uncommon. Standard theory tells you that you are doomed here. But if you ignore standard theory and plow ahead, you typically find that the (under)fit model will predict future experiments perfectly well. How could that possibly work?VC Incentive Alignment2020-11-17T00:00:00-08:002020-11-17T00:00:00-08:00/vc-incentive-alignment<p>Shared ownership is supposed to align incentives. If two founders each own 50% of a given company, then they will tend to want the same things for that company. But consider a founder who owns 99% of one company and an investor who owns 1% of 99 companies– do these actors always have the same incentives? Not necessarily. To see why, consider the following model.</p> <p>Suppose that the ultimate value of a company can be modeled by a log normal distribution $X \sim \mathcal{LN}(\mu, \sigma^2)$. Assuming log utility, a single founder who owns the entire company receives the expected utility,</p> $U_F = E[\log(X)] = \mu.$ <p>That is to say, the utility scales only with the mean and does not depend on the variance, as the fluctuations in returns are precisely canceled by the log utility function: the founder in this scenario is perfectly indifferent to the variance.</p> <p>Now consider the case of an investor who owns a $1\%$ stake in $100$ companies, each of whose valuations is independently and identically distributed as $\mathcal{LN}(\mu, \sigma^2)$. The investor’s portfolio $Y = \frac{1}{N}\sum_{i=1}^{100} X_i$ is the arithmetic mean of $N$ log-normal random variates. The portfolio has the first two moments:</p> \begin{align*} E[Y] =&amp; \exp{\left(\mu + \frac{\sigma^2}{2}\right)}\\\\ V[Y] =&amp; \frac{(\exp{(\sigma^2)} - 1)\exp{\left(2\mu + \sigma^2\right)}}{N},\\ \end{align*} <p>and gives the investor utility:</p> $U_I = \log(Y).$ <p>To estimate the quantity $U_I$, we can appeal to a useful approximation for functions of random variables. For sufficiently well-behaved functions $f$ and random variables $X$ we may Taylor-expand inside the expectation and write:</p> \begin{align*} E[f(X)] =&amp; E[f(E[X]) + f'(E[X]) (X - E[X]) + f''(E[X]) \frac{(X - E[X])^2}{2} + \mathcal{O}(X^3)]\\ \approx&amp; f(\mu) + \frac{f''(\mu)}{2}V[X].\\ \end{align*}. <p>Applying this idea to $\log(Y)$, we obtain:</p> <p>$$E[\log(Y)] \approx \log(E[Y]) + \frac{-V[Y]}{2 E[Y]^2}$$.</p> <p>Plugging in the relevant quantities and simplifying, we arrive at:</p> \begin{align*} E[\log(Y)] \approx&amp; \mu + \frac{\sigma^2}{2} - \frac{\exp{(\sigma^2)} - 1}{2 N}\\ \approx&amp; \mu + \frac{\sigma^2}{2}, \end{align*} <p>the last line obtaining asymptotically when N is large. Notice that, whereas the sole founder’s utility was simply $\mu$, the investor’s utility contains an additional term $\frac{\sigma^2}{2}$ that scales with the variance.</p> <p>What does this portend for incentive alignment between founders and investors? Generally speaking, both parties benefit from making $\mu$ go up. However, investors can benefit from increasing the quantity $\mu + \frac{\sigma^2}{2}$ as well. This is true even if $\mu$ itself goes down, so long as the decrease in $\mu$ is more than offset by an increase in $\sigma^2$, i.e. $\Delta \sigma^2 &gt; -2\Delta \mu$.</p> <p>This argument is summarized graphically in the diagram below:</p> <p><img src="/assets/vc-incentive-alignment/axis4.png" alt="Diagram" /></p> <p>Considering the company as a point in a parameter space, the founder benefits most from pushing the state of the business in the direction of the blue arrow. An infinitely diluted investor, however, benefits most from pushing things in the direction of the red arrow. The shaded hemispheres show the directions that would be acceptable to either party. The purple sector denotes the area of mutual benefit. The opposing red and blue sectors indicate the axis of antagonism: directions that would be preferred by one party but not the other.</p> <p>Are founders and investors explicitly thinking of companies as log-normal random variables? Not necessarily. This is, of course, a highly stylized model that caricatures one particular dynamic. But every day, founders get up and tweak parameters that make successes of various sizes more or less likely. It’s not such a leap to think of startups as random variables whose parameters you can modify with effort.</p> <p>A friend of mine with a lot of experience in the startup world noticed that investors would come in and make suggestions that didn’t seem to reflect the best interests of the company. These suggestions were mostly of the swing-for-the-fences variety, encouraging the founders to take the company in directions that had a small chance of spectacular success, even if the modal outcome was failure. To the founders, who presumably didn’t relish the thought of betting years of hard work on a lottery ticket, this advice seemed very bad. Did the investors simply not know what they were talking about?</p> <p>More likely, the investors did know what they were talking about– they were just giving the advice that suited their own interests, which might not necessarily be yours.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comShared ownership is supposed to align incentives. If two founders each own 50% of a given company, then they will tend to want the same things for that company. But consider a founder who owns 99% of one company and an investor who owns 1% of 99 companies– do these actors always have the same incentives? Not necessarily. To see why, consider the following model.Guest Lecture at MIT2020-11-01T00:00:00-07:002020-11-01T00:00:00-07:00/mit-talk<p>Next week I’ll be giving a guest lecture at MIT for 15.S08, Advanced Data Analytics and Machine Learning in Finance, sketching an overview of modern speech-to-text systems and offering some lessons learned from building Scribe.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comNext week I’ll be giving a guest lecture at MIT for 15.S08, Advanced Data Analytics and Machine Learning in Finance, sketching an overview of modern speech-to-text systems and offering some lessons learned from building Scribe.Hello, world!2018-02-26T00:00:00-08:002018-02-26T00:00:00-08:00/hello-world<p>Blog coming soon.</p>Patrick O'Neillpatrick.kaileigh.oneill@gmail.comBlog coming soon.