Beta distribution failing for missing value imputation?

ben_slacker · December 20, 2021, 5:15pm

In Statistical Rethinking, 2nd ed, there is problem 15M2:

Reconsider the primate milk missing data example from the chapter. This time, assign B a distribution that is properly bounded between zero and 1. A beta distribution, for example, is a good choice."

I have been able to replicate the code from the original problem (where B used a normal distribution). However, it fails when I use a beta distribution in pymc, even though a colleague (@MarcoGorelli) was able to do this in numpyro.

df_primates["neocortex_prop"] = df_primates["neocortex.perc"] / 100
df_primates["k_std"] = standardize(df_primates["kcal.per.g"])
df_primates["logmass"] = np.log(df_primates["mass"])
df_primates["logmass_std"] = standardize(df_primates["logmass"])


with pm.Model() as m15_5b:

    # priors
    a = pm.Normal("a", 0, 0.5)
    bB = pm.Normal("bB", 0, 1)
    bM = pm.Normal("bM", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # obs/imputed B with beta distribution (avoid standardizing)
    Bi = pm.Beta("Bi", alpha=2, beta=2, observed=df_primates["neocortex_prop"])

    # linear model
    mu = a + bB * (Bi - 0.67) + bM * df_primates["logmass_std"]

    # likelihood
    K = pm.Normal("K", mu, sigma, observed=df_primates["k_std"])

    # sample
    trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

Here is the error message I get:

---------------------------------------------------------------------------
SamplingError                             Traceback (most recent call last)
<ipython-input-27-6bcd320dd5ad> in <module>
     23 
     24     # sample
---> 25     trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
    504             if start is None:
    505                 start = start_
--> 506                 check_start_vals(start, model)
    507         except (AttributeError, NotImplementedError, tg.NullTypeGradError):
    508             # gradient computation failed

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/util.py in check_start_vals(start, model)
    233 
    234         if not np.all(np.isfinite(initial_eval)):
--> 235             raise SamplingError(
    236                 "Initial evaluation of model at starting point failed!\n"
    237                 "Starting values:\n{}\n\n"

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'a': array(-0.80057862), 'bB': array(0.31003719), 'bM': array(0.06270603), 'sigma_log__': array(-1.31484483), 'Bi_missing': array([ 1.42181326,  1.29176305,  0.89386098,  1.39275087,  0.14458384,
       -0.39778831,  0.79556126, -0.09505109,  0.84795039,  0.27948074,
        0.02310535,  0.13706759])}

Initial evaluation results:
a               -1.51
bB              -0.97
bM              -0.23
sigma_log__     -1.58
Bi_missing       0.00
Bi               -inf
K             -342.50
Name: Log-probability of test_point, dtype: float64

Any ideas on what I can be doing wrong?

Martin_Ingram · December 20, 2021, 5:52pm

Just a hunch: does neocortex_prop contain values of zero or 1? The Beta(2, 2) rules out those values, so that would be an issue. Beta(1, 1) (uniform) would work. As I said though, just a hunch, let me know!

cluhmann · December 20, 2021, 6:01pm

Not a direct answer to you questions, but have you checked out the Rethinking (2ed) notebooks? If not, they may be a useful resource for this and other examples from the book.

ben_slacker · December 20, 2021, 6:11pm

Thank you for the suggestion @Martin_Ingram. Values of 0 or 1 are not present and a Beta(1,1) gives the same error.

ben_slacker · December 20, 2021, 6:12pm

Thank you @cluhmann. I was able to replicate the original problem in Chapter 15 with the repo’s help. But the end of chapter problems in the repo only go up to Chapter 9 (although your suggestion reminds me that I can contribute what I have for Chapters 10-15).

cluhmann · December 20, 2021, 6:19pm

Ah, makes sense. And yes, I am sure the contribution would be appreciated!

ricardoV94 · December 21, 2021, 7:55am

Bi_missing has two negative intial values as you can see in the error message. The problem is that in the current PyMC version imputed variables are not transformed as default variables are. That means imputation is not very stable for constrained variables.

This is fixed in the upcoming version of PyMC. In the meantime I would suggest you split the observed and missing variables into two separate Betas

ben_slacker · December 21, 2021, 4:58pm

Awesome, thank you!

I was able to get this working, but it’s a bit hacky. I concatenated the missing and observed variables from the two beta distributions, but then that changed the order of the indexes. To accommodate, I changed the order of predictor and outcome variables to match the Bi order. But is there a cleaner way to do this? Specifically, I’m wondering if there’s a way to let Bi concatenate based on index of missing (idx_miss) and observed (idx_obs).
@roesta07, @MarcoGorelli

with pm.Model() as m15_5b:

    # priors
    a = pm.Normal("a", 0, 0.5)
    bB = pm.Normal("bB", 0, 1)
    bM = pm.Normal("bM", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # paramaterization for beta, split into observed/missing
    idx_miss = df_primates.loc[df_primates["neocortex_prop"].isna(), "neocortex_prop"].index
    idx_obs = df_primates.loc[df_primates["neocortex_prop"].notna(), "neocortex_prop"].index
    Bi_miss = pm.Beta("Bi_miss", alpha=2, beta=2, shape = len(idx_miss))
    Bi_obs = pm.Beta("Bi_obs", alpha=1, beta=1, observed = df_primates.loc[idx_obs, "neocortex_prop"])
   # Is there a way to concatenate based on index?
    Bi = pm.math.concatenate([Bi_miss, Bi_obs], axis=0) 

    # linear model, changed the order of logmass_std to match Bi index
    mu = a + bB * (Bi - 0.67) + bM * df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "logmass_std"]
    
    # likelihood, changed the order of k_std to match Bi index
    K = pm.Normal("K", mu, sigma, observed=df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "k_std"])

    # sample
    trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

Topic		Replies	Views
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1859	September 7, 2020
Simple imputation difficulties Questions	3	544	July 12, 2021
Dealing with random missing values in a GLM model v5 modeling	0	302	July 18, 2023
Unable to fit Beta model to my data: SamplingError: Initial evaluation of model at starting point failed! v5 modeling	2	516	August 17, 2022
Beta distribution model gives initial evaluation error with scaled observed values version agnostic	6	679	March 21, 2022

Beta distribution failing for missing value imputation?

Related topics