Skip to content

Design for bound, truncated and censored distributions #1864

Closed
@aseyboldt

Description

@aseyboldt

There have been a couple of issues about these, and I thought it might help to collect them and try to find a good design for truncated and censored distributions.

#1843 #1847 #1833 #1861 #1860

This got a bit longer and less concrete than I wanted, but here it is:

Just to get everyone on the same page about the status quo:

  • "Bound" distributions: These are implemented as pm.Bound, which maps a distribution to a different distribution that can't take values outside some interval, but has the same logp in this interval. They aren't strictly speaking probability distributions anymore (the norm is less than 1).
  • Truncated distributions: Same as "Bound", but with a correction to the logp to make the norm 1. We do not have a build-in method of defining those (it's possible with a potential or a custom distribution).
  • Censored distributions: Values in a subset of the support are reported, values outside are only counted. We might want to support right/left/interval censoring. There is no current direct support for it.

cdf / ccdf

We need lcdf and/or lccdf functions for our distributions for both truncation and censoring. #1861 is a start for that. I would propose to add stubs to distribution.Distribution that raise a NotImplementedError and then overload those in the actual distributions. We might also want to add something like linterval and maybe lcinterval – sometimes there are more stable ways to compute the log of the difference of cdfs.

Bound distributions

I'm actually not sure that we should keep those. It seems to me that they could be useful for defining new distributions, but in those cases an interval transform seems more straight forward. There are cases where we can use a bound distribution instead of a truncated one and get the same result, (if the difference of the lpdfs does not depend on any of the parameters), and we might save a bit of time in those cases. But I'm not sure this is worth the additional confusion (see #1833 for an example where it went wrong; and I did something like that in the past myself). To prevent the loss of performance, we could add a parameter propto to Distribution.logp, and drop terms that do not depend on the parameters if it is set to True to get the same effect. (stan does this, and it might save a bit of time in a lot of other circumstances, too).
Are there other use-cases that I'm not aware of? If not, I think deprecating them in favour of truncated distributions and removing them in a later release would be fine. Instead, we could also only deprecate the logp function of Bound, I think that would prevent most misusage. A third option would be to change their behaviour to the truncated distribution.

Truncated distributions

Implementing those seems rather strait forward once we have the lcdfs: Do the same as in Bound, but correct the logp function with the difference of the cdfs at the boundaries and fix the random function. We could follow the same design as Bound, but maybe we could use the opportunity to make that a bit more ergonomic. Bound seems a bit confusing to new users right now. Maybe it would be better to predefine distributions TruncatedNormal et al or put them in a submodule truncated.Normal. They could then have additional parameters lower_trunc and upper_trunc. This would also make it much easier to test. I doubt all conceivably useful combinations involving Bound are working at the moment (eg #1847)

Censored distributions

They are tricky. Supporting a couple of simple use cases seems easy enough: Say we analyse a survival experiment involving mice, where the experiment ends after 6 months. We could model that with something like that:

Survival = pm.Censored(pm.Weibull, upper=6)
Survival('months', ..., observed=data)

and code censored events by setting them to 6 exactly (which is a null set in the original distribution), or add an additional parameter is_censored.

Complications:

  • We need to be careful to not break posterior sampling with something like that.
  • Censoring can be much more complicated if we allow the censoring events themselves to be variables. Say we want to model a cancer treatment, where no patients drop out, but from time to time a patient is hit by a stray meteorite – not related to the treatment at all (and sorry for the example, that's what happens if you spend time with biologists and astronomers). Then the time to a meteorite hit – a censoring event – follows an exponential distribution. But since some patients die before we observe that, we end up with a censored distribution for the meteor hits and the normal deaths. And in real experiments the censoring events might not even be independent from the actual event of interest.
  • There are different forms of censoring: right / left / interval and combinations thereof.
  • Censored distributions are not absolutely continuous. Maybe it helps to look at them as a mixture of a truncated distribution and a discrete distribution.

tldr

  • Deprecate Bound
  • add submodule distributions.tuncated containing truncated (and tested) versions of most distributions. Each with two additional parameters lower_trunc and upper_trunc.
  • Implement lcdf, lccdf, linterval, lcinterval for most distributions
  • Add propto argument to logp of distribution and drop summands that do not depend on the parameters.
  • Think about a good design for censored

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions