Likelihood Models

Jed Rembold

April 10, 2025

Announcements

  • HW5 due on Monday!
    • You have everything you need after today!
  • Quiz 2 later today
  • Poll will be going out about partner preferences on final group project
    • Please respond by end of Monday, as I want to get groups made by Tuesday
  • Check-in form this weekend for HW5

Recap

  • A Monte Carlo Markov Chain is a semi-random walk wherein the next step is determined by the current location and a bit of randomness
    • Semi-random in the sense that is has a preference to move “uphill”, toward higher parts of the function it is sampling
  • By letting the sampling run for a while, and keeping track of where the walker has been, you can regenerate the walked function by looking at a histogram of where the walker spent the most time
    • Really just regenerate the shape, the scaling will be different
  • We can sample the probability of parameters given particular data (\(P(\theta | D)\)) using Baye’s theorem
    • Requires being able to write out our priors and our likelihood

Today

  • How does this all apply to fitting models?
    • Writing out our priors and likelihood
    • Walking the probability
    • Interpretting the results
  • MCMC libraries?

Baye Life

PriorLikelihoodEvidencePosterior

Defining the Component Functions

The Prior

  • The prior dictates the probability of a parameter having a particular value, regardless of the data

  • If using an unbounded, flat, prior, then it should just return 1 always (0 in ln-space)

  • If bounded, check the parameters and return 1 (0 in ln-space) if within the bounds or \(-\infty\) otherwise (-np.inf in Python, -Inf in R)

  • Pseudo-example:

    |||function ln_prior(|||params|||)|||
      if |||illegal condition|||
        return -|||infinity|||
      return 0
    |||function ln_prior(|||params|||)|||
      if |||legal condition|||
        return 0
      return -|||infinity|||

The Likelihood

  • The likelihood essentially compares the data our model would output to our actual data
  • The goal is to minimize the differences between the two
    • Additionally, things are normally scaled by known errors, so that values with more error have less weight
  • If our individual data points were arranged around the model with some uncertainty: \[ P(y_i | \theta,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f(x_i, \theta))^2}{2\sigma^2}\right)\]
  • The likelihood is just the sum over all these points. So \[ \log \mathcal{L} = \sum^n_{i=1}\log P(y_i|\theta,\sigma) = \sum^n_{i=1}\left[ -\frac{(y_i - f(x_i,\theta))^2}{2\sigma^2}-\frac{1}{2}\log(2\pi\sigma^2)\right]\]

The Likelihood (In Code)

  • The ln-likelihood then could look like:

    |||function ln_likelihood(|||params, data|||)|||
      m,b = params
      x,y,errY = data # extract data
      y_model = m * x + b # compute model result
      residual = y - y_model # compute the difference
      term1 = - 0.5 * |||log|||(2 * |||pi||| * errY ** 2)
      term2 = - 0.5 * (y - y_model) ** 2 / errY ** 2 )
      return |||sum|||(term1 + term2)

All together now…

  • Bring both pieces together (since we don’t care about \(P(D)\)):

    |||function ln_pdf(|||params, data|||)|||
      p = |||call ln_prior|||
      if p == -|||infinity||| # no sense continuing
        return -|||infinity|||
      return p + ln_likelihood(params, data)

Example Time

Extended Example

  • Suppose we want to evaluate the uncertainties in our parameters for a fit to the data to the right.

  • Our model will look something like: \[y = ax^2 + bx\]

  • MCMC does not tell you about the quality of a fit. It tells you about the variability in the fit parameters.

Interpreting the Results

Burnin’

  • If you have not started your parameter chain near the max, it takes a bit of time for the chain to find its way to the most probably area and start bouncing around
  • We don’t care about this initial time period!
  • Commonly called burn-in, and we throw away a certain number of iterations before we start keeping track.
    • This generally requires you to look at your parameters vs interations to decide when things reasonably leveled off.

Visualizing the Best Fit with Errors

  • There are a few ways you can visualize the best fit model with uncertainties in the parameters reflected
  • My go-to looks like:
    • Randomly select some number of indices from your leveled chain
    • For each index, grab the corresponding parameters and compute your model output with those parameters, appending the result to a list
    • Compute the median and standard deviation of the results in your list, being sure to specify axis=0
    • Plot the median values for your best approximation, and shade the region between the median - std and the median + std
  • Alternatively, you could just plot the model output for each of the randomly selected parameter combinations, though I don’t think it looks as nice

MCMC Libraries (Python)

Emcee

  • While it isn’t difficult to create our own MCMC sampler, there can be some benefits to using existing libraries for more complicated use cases
    • Tend to give more intuitive interfaces through which to accomplish the basic tasks that we would want
  • In the Python landscape, the main option used is the emcee package, which you will probably need to install through pip
  • Emcee operates as an abstract data type, wherein you create a sampling object and then can interact with it and run samples using defined methods
  • In the R landscape, there is the mcmc package, which also seems reasonably strong, though it doesn’t seem to have all of the flexibility of emcee

Emcee Creation

  • When you create the object initially, you need to provide:
    • The number of walkers you’d like to generate
    • The number of dimensions you’ll be walking (number of parameters)
    • The function you’ll be walking (your log posterior)
    • Any extra arguments that your function will need beyond the parameters
    • A pool should you be using multiprocessing
    sampler = emcee.EnsembleSampler(
                  num_walkers, num_dims,
                  log_function, args=[extra arguments]
                  )

Emcee Running

  • You can then start a sampling run by telling the sampler where all the walkers should begin and how many steps they should take

  • Generating starting points usually done with some variation of a random gaussian near a starting point:

    starts = np.random.multivariate_normal(
                mean = [0,1,10],
                cov = [[1,0,0],[0,0.5,0], [0,0,5]],
                size = num_dims
                )
  • Then you can just run the sampler:

    sampler.run_mcmc(starts, num_iterations)

Emcee Retrieving Chains

  • You can get the iteration chains back from the sampler after a run using .get_chain()

    • This will usually return a 3D array, indexing over the parameter, walker, and iteration

    • Can visualize a particular parameter over all walkers using

      plt.plot(sampler.get_chain()[:,:,0], 'k', alpha=0.3)
  • After examining, will commonly want to discard the burn in and flatten all the individual walkers:

    flat_samples = sampler.get_chain(discard=num_dis, 
                                     flat=True)

Quiz Time!

// reveal.js plugins // Added plugins ,