Monad of No Return: Issues with `(>>) = (*>)`

A handful of months ago, I brought the monad of no return proposal to discourse, and now many months later I bring what I hope will be my last question regarding one issue with one part of the proposal.

TL;DR: I’m looking for examples of fixes of memory leaks or issues about memory leaks due to the default definition of (*>).

To summarise the intent of the proposal, there are methods between the Applicative and Monad type classes that should always behave identically, if the type they are defined on is lawful. These methods can therefore be “canonically” defined: return = pure, and (>>) = (*>). We wish to get rid of the redundancy here, and make these the top level definitions.

There is an issue with (>>) = (*>) specifically, however. Currently, (>>) is defined as (>>) a b = a >>= \_ -> b (which I’ll alternatively call thenM), while (*>) a b = (id <$ a) <*> b (which I’ll call thenA). In making (>>) be equivalent to its canonical definition (so, (>>) = (*>)), we’d be changing its default definition for virtually every Monad in existence, and there have been issues with the definition of (*>) in the past. One example of that can be found in issue 33 of the transformers package. The common problem seems to be that the definition thenA leaks some amount of memory, but we’re unclear as to why.

My personal theory is this: only in the cases of monad transformers (or equivalents) will we see these performance regressions. If this is the case, then potential regressions industry wide are very low compared to the alternative worst case.

I’m therefore looking for more examples of badly behaved default (*>) definitions. If they’re transformer-like, that’s good to have more examples, but if they’re not transformer-like, I’m especially interested.

If you want to ask more about the proposal(s), I’m also happy to answer those questions here.

16 Likes

I don’t think there’s any lack of clarity why: x *> y and x >> y are for “tail calls”, i.e. after x, execution can be handed over entirely to y, without keeping anything around. The default definition of *> (thenA) is troublesome for this: (id <$ x) <*> y will, by default, needlessly keep the result of id <$ x (i.e. just id) around whilst it executes y. The default definition of >> (thenM) won’t.

What is unclear is the extent to which optimizations mitigate or resolve this issue, and how trouble the issue would actually cause in the wild.

6 Likes

To me it’s unclear why the optimisations are unable to eliminate keeping the result around in most of the cases seen, hence why I want more examples. I have been doing some tinkering with transformers as I mentioned in the proposal, and am still trying to understand what I’ve gotten out of it. When I can access my computer again, I’ll bring what I’ve got here for discussion before going back to the proposal.

1 Like

Maybe your ‘by default’ is hedging against this, but it’s worth noting that for some Applicatives, there’s nothing to keep around even ignoring optimizations. Consider Applicative ((->) a):

instance Applicative ((->) r) where
    pure = const
    (<*>) f g x = f x (g x)

With zero optimizations, here’s how thenA reduces:

thenA x y z
((id <$ x) <*> y) z
(id <$ x) z (y z)
(fmap (const id) x) z (y z)
(const id . x) z (y z)
const id (x z) (y z)
id (y z)
y z

So y ends up in tail position after all!

Re examples, I’ve just noticed that the RWS.CPS and Writer.CPS transformers, which were added after #33 was merged, need their (*>)s patched. If someone who already is handy with hub.darcs.net could submit that patch, I’d appreciate it; otherwise I’ll probably get around to setting up an account and installing darcs and such at… some point.

2 Likes

Yes, and I probably should have said “can, in general” rather than “will, by default”.

Indeed, and Identity is an even more stark example. The Applicatives that experience the issue are those that “hold a resource” “during evaluation”. (Strict) StateT and IO are examples of those. They “hold stack” during the evaluation of the case scrutinee ((a, s) in the former case, (# State# RealWorld, a #) in the latter).

2 Likes

As promised, here’s some of the stuff I was looking at.

I set up some instances of transformers, and have been looking at ExceptT’s Core specifically. I’ve cleaned up a touch for readability.

-- default impl -O0
$fApplicativeExceptT_$c*>
  = \ @f_a5U3 @e_a5U4 $dMonad_a5U5 @a_a5Vz @b_a5VA before after ->
      let { applicativeDict = $p1Monad $dMonad_a5U5 } in
      let { functorDict = $p1Applicative applicativeDict } in
      (>>=
         $dMonad_a5U5
         (fmap functorDict (const id <$>) (before `cast` <Co:4> :: ...))
         (\ a ->
            case a of {
              Left l -> pure applicativeDict (Left l);
              Right rf ->
                fmap
                  functorDict
                  (\ b ->
                     case b of _ {
                       Left _ -> b;
                       Right r -> Right (rf r)
                     })
                  (after `cast` <Co:4> :: ...)
            }))
      `cast` <Co:5> :: ...

in the above we use fmap twice, and have to pass around id within an either.

-- default impl -O2
$fApplicativeExceptT_$c*>
  = \ @f_a5U3 @e_a5U4 $dMonad_a5U5 @a_a5Vz @b_a5VA before after ->
      case $p1Monad $dMonad_a5U5 of
      { C:Applicative functorDict pure ww2_s87U ww3_s87V ww4_s87W
                      ww5_s87X ->
      case functorDict of { C:Functor fmap ww7_s87R ->
      (>>=
         $dMonad_a5U5
         (fmap (const id <$>) (before `cast` <Co:4> :: ...))
         (\ a ->
            case a of {
              Left l -> pure (Left l);
              Right rf ->
                fmap
                  (\ b ->
                     case b of _ {
                       Left _ -> b;
                       Right r -> Right (rf r)
                     })
                  (after `cast` <Co:4> :: ...)
            }))
      `cast` <Co:5> :: ...
      }
      }

virtually identical, just allowing the dictionaries to be unpacked

-- explicit O0
-- just uses the monad implementation
$c*>_r68H
  = \ @f_a5Sr @e_a5Ss $dMonad_a5St @a_a5TX @b_a5TY m_a52L k_a52M ->
      $c>>=_r68E $dMonad_a5St m_a52L (\ _ -> k_a52M)

$c>>=_r68E
  = \ @f_a5Rt @e_a5Ru $dMonad_a5Rv @a_a5RE @b_a5RF before afterF ->
      let { $dApplicative_a5S0 = $p1Monad $dMonad_a5Rv } in
      $ ((\ ds_d67c -> ds_d67c) `cast` <Co:11> :: ...)
        (>>=
           $dMonad_a5Rv
           (runExceptT before)
           (\ a ->
              case a of {
                Left l -> return $dApplicative_a5S0 (Left l);
                Right r -> runExceptT (afterF r)
              }))

this simply uses the monad implementation and doesn’t inline anything, which should be expected

-- explicit -O2
$fApplicativeExceptT_$c*>
  = \ @f_a5U5 @e_a5U6 $dMonad_a5U7 @a_a5VB @b_a5VC before after ->
      case $p1Monad $dMonad_a5U7 of
      { C:Applicative ww_s87G pure ww2_s87I ww3_s87J ww4_s87K
                      ww5_s87L ->
      (>>=
         $dMonad_a5U7
         (before `cast` <Co:4> :: ...)
         (\ b ->
            case b of {
              Left l -> pure (Left l);
              Right _ -> after `cast` <Co:4> :: ...
            }))
      `cast` <Co:5> :: ...
      }

finally, this implementation is about as good as we can get, casting in and out of ExceptT and very simply binding into a case statement.

It’s interesting that there are so many unnecessary operations even in the more optimised default implementations. I was hoping to see an obvious reason why the first few would result in space leaks and the latter don’t, but my brain isn’t able to see such a pattern currently.

This is an interesting piece of information. Does this suggest that the benchmarks (which use IO by default) wouldn’t trigger a memory issue if they had Identity as the base Monad?

1 Like

For some, this is the case.

  • ExceptT, MaybeT, CPS.RWST, Strict.RWST, Strict.StateT, CPS.WriterT, and Strict.WriterT have the same memory usage in IO and Identity.
  • IdentityT, ReaderT, SelectT, and Lazy.StateT use bounded memory with thenA in Identity but unbounded in IO. (All of these have had their (*>) patched to use bounded memory.)
  • AccumT, Lazy.RWST, and Lazy.WriterT use unbounded memory with both (*>) and (>>) in IO, but bounded with both in Identity.

Here’s the script I’ve been using to check up on various monads. It covers all of transformers (except for ContT) and a handful of things from base. Add your own monads, add the packages they come from to the stack script block, and run it with stack. If a row turns red, you’ve found a monad that needs its (*>) patched. If the first column (thenA) disagrees with the second column ((*>)), that’s a monad that either needed patching and got it, or got optimized by GHC at the package level (this is what I think is happening with Either).

{- stack script
   --compile
   --ghc-options -O0
   --ghc-options -Wno-x-partial
   --ghc-options -with-rtsopts=-K100k
   --snapshot nightly-2025-06-11
   --package transformers
-}
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE LambdaCase #-}

import Control.Exception
import Control.Monad
import Data.Either
import Data.Maybe
import Type.Reflection

import Control.Monad.Trans.Accum
import Control.Monad.Trans.Except
import Control.Monad.Trans.Identity
import Control.Monad.Trans.Maybe
import Control.Monad.Trans.RWS.CPS qualified as CPS
import Control.Monad.Trans.RWS.Lazy qualified as Lazy
import Control.Monad.Trans.RWS.Strict qualified as Strict
import Control.Monad.Trans.Reader
import Control.Monad.Trans.Select
import Control.Monad.Trans.State.Lazy qualified as Lazy
import Control.Monad.Trans.State.Strict qualified as Strict
import Control.Monad.Trans.Writer.CPS qualified as CPS
import Control.Monad.Trans.Writer.Lazy qualified as Lazy
import Control.Monad.Trans.Writer.Strict qualified as Strict
import Data.Functor.Identity
import Data.List.NonEmpty (NonEmpty)
import Data.List.NonEmpty qualified as NE

thenA :: Applicative f => f a -> f b -> f b
thenA x y = (id <$ x) <*> y

times :: (f a -> f a -> f a) -> Integer -> f a -> f a
times _ 1 m = m
times op n m = m `op` times op (n - 1) m

doTest :: Applicative f => (f () -> IO ()) -> (f () -> f () -> f ()) -> IO Bool
doTest runMonad op = try (evaluate <=< runMonad $ times op 10000 $ pure ()) >>= \case
  Right ()           -> pure True
  Left StackOverflow -> pure False
  Left e             -> throwIO e

printWithModule :: TypeRep a -> IO ()
printWithModule tr = putStr (tyConModule (typeRepTyCon tr)) *> putStr "." *> print tr

class Testable a where
  finish :: a -> IO ()

instance Testable () where
  finish = pure

instance Testable (IO ()) where
  finish = id

experiment :: forall m a. Typeable m => Monad m => Testable a => (m () -> a) -> IO ()
experiment runMonad = do
  r1 <- doTest (finish . runMonad) thenA
  r2 <- doTest (finish . runMonad) (*>)
  r3 <- doTest (finish . runMonad) (>>)
  let problem = r2 /= r3
  when problem $ putStr "\027[1;31m"
  putBool r1
  putBool r2
  putBool r3
  putStr "-- "
  printWithModule (TypeRep @m)
  when problem $ putStr "\027[0m"
  where
  putBool = \case True -> putStr "True  "; False -> putStr "False "

main :: IO ()
main = do
  putStrLn "thenA (*>)  (>>)  -- True if space use is bounded"
  putStrLn "-------------------------------------------------"

  experiment @Identity runIdentity
  experiment @Maybe fromJust
  experiment @(Either ()) $ fromRight undefined
  experiment @((->) ()) ($ ())
  experiment @((,) ()) snd
  experiment @[] head
  experiment @NonEmpty NE.head

  experiment @(Accum ()) $ \x -> evalAccum x ()
  experiment @(AccumT () IO) $ \x -> evalAccumT x ()
  experiment @(Except ()) $ fromRight undefined . runExcept
  experiment @(ExceptT () IO) $ fmap (fromRight undefined) . runExceptT
  experiment @(IdentityT Identity) $ runIdentity . runIdentityT
  experiment @(IdentityT IO) $ runIdentityT
  experiment @(MaybeT Identity) $ fromJust . runIdentity . runMaybeT
  experiment @(MaybeT IO) $ fmap fromJust . runMaybeT
  experiment @(CPS.RWS () () ()) $ \x -> fst $ CPS.evalRWS x () ()
  experiment @(CPS.RWST () () () IO) $ \x -> fst <$> CPS.evalRWST x () ()
  experiment @(Lazy.RWS () () ()) $ \x -> fst $ Lazy.evalRWS x () ()
  experiment @(Lazy.RWST () () () IO) $ \x -> fst <$> Lazy.evalRWST x () ()
  experiment @(Strict.RWS () () ()) $ \x -> fst $ Strict.evalRWS x () ()
  experiment @(Strict.RWST () () () IO) $ \x -> fst <$> Strict.evalRWST x () ()
  experiment @(Reader ()) $ \x -> runReader x ()
  experiment @(ReaderT () IO) $ \x -> runReaderT x ()
  experiment @(Select ()) $ \x -> runSelect x id
  experiment @(SelectT () IO) $ \x -> runSelectT x pure
  experiment @(Lazy.State ()) $ \x -> Lazy.evalState x ()
  experiment @(Lazy.StateT () IO) $ \x -> Lazy.evalStateT x ()
  experiment @(Strict.State ()) $ \x -> Strict.evalState x ()
  experiment @(Strict.StateT () IO) $ \x -> Strict.evalStateT x ()
  experiment @(CPS.Writer ()) $ fst . CPS.runWriter
  experiment @(CPS.WriterT () IO) $ fmap fst . CPS.runWriterT
  experiment @(Lazy.Writer ()) $ fst . Lazy.runWriter
  experiment @(Lazy.WriterT () IO) $ fmap fst . Lazy.runWriterT
  experiment @(Strict.Writer ()) $ fst . Strict.runWriter
  experiment @(Strict.WriterT () IO) $ fmap fst . Strict.runWriterT
4 Likes

One reason here is that optimizations are not allowed to change a bottom into non-bottom or vice versa. For instance, thenA for (strict) StateT s IO can be simplified like

thenA :: StateT s IO a -> StateT s IO b -> StateT s IO b
-- thenA m1 m2 = (id <$ m1) <*> m2
thenA m1 m2 = StateT $ \s0 ->
  runStateT m1 s0 >>= \(_x1, s1) ->
  runStateT m2 s1 >>= \(x2, s2) -> pure (x2, s2)

But it cannot be simplified further into

thenM m1 m2 = StateT $ \s0 ->
  runStateT m1 s0 >>= \(_x1, s1) ->
  runStateT m2 s1

Because if the tuple (x2, s2)is bottom, it changes the result of thenA from a bottom IO action into a non-bottom IO action (with a bottom inside).

2 Likes
5 Likes

This is great, but why not improve (<.) and (<*) at the same time? Those could benefit from tail recursion too, for instance with Backwards IO.

(<*) isn’t relevant for the MNR proposal. This is already a big enough undertaking, on largely speculative grounds, and I don’t see anyone else here making PRs.

6 Likes

Fair enough :slightly_smiling_face:

1 Like

Ah, so the pattern matches on the Eithers have to stay in place, because if you removed the latter one it might have been bottom. This makes sense in the Applicative case because if either is bottom, it evaluates both immediately to get bottom, but in the Monad case, it just needs to evaluate the first to get the result.

Your personal theory is wrong. Yes, typically GHC optimizes simple cases well, but introduce a single obstacle and GHC won’t save you.

Here’s a simple program:

data NonEmpty a = One a | Cons a (NonEmpty a)
  deriving (Show)

instance Semigroup (NonEmpty a) where
  xs0 <> ys = go xs0 where
    go (One x)     = Cons x ys
    go (Cons x xs) = Cons x (go xs)
  {-# INLINE (<>) #-}

instance Functor NonEmpty where
  fmap f = go where
    go (One x)     = One (f x)
    go (Cons x xs) = Cons (f x) (go xs)
  {-# INLINE fmap #-}

instance Applicative NonEmpty where
  pure = One
  {-# INLINE pure #-}

  fs0 <*> xs = go fs0 where
    go (One f)     = fmap f xs
    go (Cons f fs) = fmap f xs <> go fs
  {-# INLINE (<*>) #-}

instance Monad NonEmpty where
  xs0 >>= f = go xs0 where
    go (One x)     = f x
    go (Cons x xs) = f x <> go xs
  {-# INLINE (>>=) #-}

loop :: Int -> NonEmpty String
loop 0 = pure "b"
loop n = pure "a" >> loop (n - 1)

main = print $ loop (3 * 10 ^ (7 :: Int))

Running it with -O2 and -sstderr gives me

          51,128 bytes allocated in the heap
           3,272 bytes copied during GC
          44,328 bytes maximum residency (1 sample(s))
          25,304 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.001s   0.001s     0.0006s    0.0006s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.021s  (  0.021s elapsed)
  GC      time    0.001s  (  0.001s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time    0.023s  (  0.031s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,403,063 bytes per MUT second

  Productivity  93.0% of total user, 69.5% of total elapsed

Now if I replace >> with *>, I get

   1,689,186,232 bytes allocated in the heap
   3,342,376,848 bytes copied during GC
     828,730,056 bytes maximum residency (9 sample(s))
     260,908,344 bytes maximum slop
            2151 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       395 colls,     0 par    1.107s   1.108s     0.0028s    0.0057s
  Gen  1         9 colls,     0 par    2.070s   2.070s     0.2300s    0.7351s

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time    0.901s  (  0.890s elapsed)
  GC      time    3.178s  (  3.178s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time    4.080s  (  4.070s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,874,759,220 bytes per MUT second

  Productivity  22.1% of total user, 21.9% of total elapsed

This is an example of *> being leaky when >> isn’t, however that’s not the only issue.

Even if both of them have to allocate, defining <*> in terms of >>= the default way is pretty common and appears even in base – and guess what, *> defined in terms of <*> defined in terms of >>= is very likely gonna be slower than >> defined in terms of >>= directly.

We can take the same NonEmpty example, but use the definition from base:

import           Data.List.NonEmpty

loop :: Int -> NonEmpty String
loop 0 = "c" :| []
loop n = ("a" :| ["b"]) >> loop (n - 1)

main = print . Prelude.length $ loop 23

Running it with -O2 and -sstderr gives me

     939,613,912 bytes allocated in the heap
     449,564,896 bytes copied during GC
     107,699,184 bytes maximum residency (7 sample(s))
       1,729,552 bytes maximum slop
             254 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       218 colls,     0 par    0.141s   0.141s     0.0006s    0.0035s
  Gen  1         7 colls,     0 par    0.259s   0.259s     0.0370s    0.1055s

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time    0.149s  (  0.147s elapsed)
  GC      time    0.400s  (  0.400s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.550s  (  0.551s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    6,296,104,197 bytes per MUT second

  Productivity  27.1% of total user, 26.8% of total elapsed

Now if I replace >> with *>, I get

   4,697,714,080 bytes allocated in the heap
   1,498,159,192 bytes copied during GC
     320,043,184 bytes maximum residency (11 sample(s))
       4,331,152 bytes maximum slop
             727 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1116 colls,     0 par    0.542s   0.543s     0.0005s    0.0018s
  Gen  1        11 colls,     0 par    0.815s   0.815s     0.0741s    0.2862s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.448s  (  0.442s elapsed)
  GC      time    1.357s  (  1.358s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time    1.806s  (  1.810s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    10,479,510,041 bytes per MUT second

  Productivity  24.8% of total user, 24.4% of total elapsed

I.e. using *> instead of >> makes the program 3.3x slower and makes it consume 2.9x more memory.

Please give up this futile idea of ditching >> while still defining *> in terms of <*> by default. It’s a nice idea in theory, but in practice it would be punishing industrial users for choosing Haskell – and nothing more than that.

10 Likes

That is a very nice example and shows even *> for NonEmpty in base needs to be updated. However, I think the best outcome would be if GHC could perform these kinds of optimizations by itself and we can afford to spend a little more effort to see if that is possible. It’s not yet obvious that it is futile.

5 Likes

The first case looks like a good example of GHC optimizations not being as good as we would hope.

For the second, unfortunately Data.List.NonEmpty is known to have rather questionable defintions (CLC #107), and I wouldn’t rely on them to demonstrate efficiency. With some reasonable definitions, I see that the default (*>) is less drastically bad.

import Data.List.NonEmpty

nonEmptyAp :: NonEmpty (a -> b) -> NonEmpty a -> NonEmpty b
nonEmptyAp (f :| fs) (x :| xs) = f x :| (fmap f xs ++ (fs <*> (x:xs)))

nonEmptyThen :: NonEmpty a -> NonEmpty b -> NonEmpty b
nonEmptyThen xs ys = (id <$ xs) `nonEmptyAp` ys

nonEmptyBind :: NonEmpty a -> (a -> NonEmpty b) -> NonEmpty b
nonEmptyBind (x :| xs) f =
  case f x of y :| ys -> y :| (ys ++ (xs >>= toL . f))
  where toL (y :| ys) = y : ys

nonEmptyThen2 :: NonEmpty a -> NonEmpty b -> NonEmpty b
nonEmptyThen2 xs ys = xs `nonEmptyBind` \_ -> ys

loop :: Int -> NonEmpty String
loop 0 = "c" :| []
loop n = ("a" :| ["b"]) `nonEmptyThen` loop (n - 1)

main = print . Prelude.length $ loop 23

With nonEmptyThen:

8388608
     872,498,760 bytes allocated in the heap
[...]
  Total   time    0.351s  (  0.350s elapsed)

With nonEmptyThen2:

8388608
     469,845,440 bytes allocated in the heap
[...]
  Total   time    0.186s  (  0.186s elapsed)
2 Likes

Thanks for your comment, I appreciate the effort you’ve put into your research here. This is almost exactly what I was looking for by making this post.

With your hand rolled NonEmpty I can definitely see the issues you’ve presented. I tried to mitigate them by occasionally adding in a bind to the (*>) version, but it only reduced the memory by a small amount.

I concur with the others that this being brought to light is an excellent opportunity to remove this performance disparity. While I’m not sure of the direct route there, being able to ensure that switching between what should be identical operators is a completely fine and safe operation is something we should be aiming towards.
My “laziest” route is to emit a warning on a Monad instance if the Applicative instance does not define its own (*>), recommending that users define (*>) = thenM, which will no doubt be annoying but will also make sure that those that follow best practice will avoid performance issues using their monads and applicatives.
Alternatives include either instrinsic superclasses, instance templates, or something like them to make the default implementation of (*>) when there is a Monad instance be thenM.

Improving the compilation of a default (*>) implementation is obviously the best way to go, however. The most obvious case we’d find performance regressions like this is using traverse_, (which uses (*>)) versus mapM_; even if this proposal doesn’t get through, we need to fix this.

5 Likes