Another attempt at docstrings for names and parameters--using "||"

blhsing · October 15, 2024, 9:14am

Motivation:

For starters I’d like to refer to the very well-written Motivation for PEP-727 for why there should be a change to the status quo, where third-party tools are used to render per-name/parameter documentation embedded in a large docstring written in specific microsyntaxes.

While there have been numerous attempts to standardize docstrings for names and parameters over long discussions such as Revisiting attribute docstrings and PEP 727: Documentation Metadata in Typing, all of the proposals so far seem to have fallen short in some ways as they try to repurpose an existing Python grammar for a docstring.

These include some variations of:

docstring as a string:

class A:
    b = 42
    """Some documentation."""

Downside: As a string it’s easy to run into ambiguous usage such as:

def foo(
    param: str = "some default value"
    """Some documentation""",
):
    ...

docstring as a comment:

V = 'hello' #: this is the docstring for V

Downside: As a comment its value is difficult to access at runtime.

docstring as an annotation:

def foo(
    param: Annotated[str, Doc("Some documentation.")]
) -> None: ...

Downside: Too much boilerplate and can’t exist independently from a type hint.

docstring as an annotation transformed from existing operators:

def some_function(
    some_parameter: SomeType -- "Some documentation goes here",
    **kwargs: Any -- "Additional keyword arguments"
) -> SomeReturn -- "Some details about this return type":
    """ Documenting the function itself here """

Downside: Can’t exist independently from a type hint and would be ambiguous as a docstring for a name:

foo = 'hello' -- world # is it a docstring or a negation and a subtraction?

Rationale:

Since it is apparent from the past proposals that repurposing an existing syntax for a per-name docstring ultimately fails with too much compromise in some ways, it would be justified to introduce a new per-name docstring syntax which:

is invalid with the current syntax, so it’s unambiguous, easily identifiable by a human and collapsable by an IDE.
is short and concise, so to leave room for the actual content of the docstring.
is not nestable, so to avoid ambiguity.
is optional, so it is used only where it makes sense to individually document a name or a parameter.
is viable for both variable names and parameters.

Specification:

The main proposal is to introduce a new token, ||, to mark the start of a docstring or a continuation of a docstring from the previous line, and what follows || is the content of the docstring.

In most ways it works just like a comment, except that its content is accessible as a string in the __name_docs__ dict attribute of a module or a class, or __param_docs__ dict attribute of a function. And by default its content is automatically dedented, just like docstrings currently are.

When || appears on the same line as a name, the docstring binds to the name (do imagine there’s syntax highlighting for the docstring):

class Color(Enum):
    RED = 1  || a lovely color
    BLUE = 2 || a moody color

This results in Color.__name_docs__ with a value of:

{'RED': 'a lovely color', 'BLUE': 'a moody color'}

Note that I’m not sure if we should allow a docstring to be on a different line as a name when there’s a multi-line default value:

class Color(Enum):
    RED = RGB(    || a lovely color
        255, 0, 0
    )

Should this be allowed?

class Color(Enum):
    RED = RGB(
        255, 0, 0
    ) || a lovely color

I’ll leave the decision to the discussion.

Then there’s a second form, where || appears at the beginning of a line after dedentation, in which case the docstring binds to the name on the left in the next line:

class Color(Enum):
    || a lovely color
    RED = RGB(
        255, 0, 0
    )

In both cases above Color.__name_docs__ becomes {'RED': 'a lovely color'}.

A multi-line docstring can be written in multiple lines with each line starting with a || (keeping in mind that an IDE can be made to collapse the docstrings as needed):

class Color(Enum):
    || a lovely color that symbolizes:
    ||     - life
    ||     - health
    ||     - courage
    ||     - love
    RED = RGB(
        255, 0, 0
    )

and Color.__name_docs__ becomes {'RED': 'a lovely color that symbolizes:\n - life\n - health\n - courage\n - love'}. Note the lack of a trailing newline to be consistent with single-line docstrings.

The same rules apply to docstrings for parameters of a function defintion, which I won’t repeat here for brevity.

Mnemonic:

The choice of || as a docstring marker comes from its visual resemblance to a column separator for notes, and its widely understood meaning of “or”, which in English, is a conjunction that can offer an explanation of a preceding word or phrase.

JamesParrott · October 15, 2024, 10:02am

What’s wrong with the existing docstring conventions and mature supporting tools?

I’m a big fan of documentation - it should be a concern of developers. But documentation should not make the code less readable.

Nodd · October 15, 2024, 10:18am

Points 1 and 2 above represent the existing conventions (that I know of). @blhsing did a good work summarizing the limitations of these solutions.

Personally, I find the rationale is well explained, but I don’t like the token proposed, || looks way too much like an or operator. But I can’t find another proposition, the number of available tokens is limited.

jcampbell05 · October 15, 2024, 10:42am

there is an annotation type from typing library that basically does this already

name = Annotated[str, "first letter is capital"]

pawamoy · October 15, 2024, 1:22pm

Thank you @blhsing for taking a stab at this

It must be allowed, otherwise you can’t write a multiline docstring.

At this point, it just looks like comments with a different token, so why not just include comments in ast’s output and also make them available at runtime per the rules you suggest, instead of introducing new backward-incompatible syntax?

Lots of things:

The two most-used docstring styles, Google and Numpydoc, do not have specs. That leaves lots of ambiguities when implementing parsers. Other styles (Sphinx, epydoc), I’m not aware of specs either.
They diverge from one another. Some allow documenting X, other Y, making it hard to create common data structures to hold the information.
Some are relying on, or highly inspired by actual markup such as rST. Mixing markup in the style makes it incompatible with other markups (Markdown, Asciidot, Djot, etc.). The only style that I know of that is markup-agnostic is Google’s.
They cannot be detected with 100% accuracy. You’ll always have false-positives or false-negatives when trying to identify the style used by a docstring, especially when they rely on specific markup. Fixing this would require actually providing metadata for each docstring (or once for a whole module/package), somehow. This is an unsolved problem.

This idea is rejected by both PEP 727 and this very proposal.

blhsing · October 15, 2024, 2:45pm

Yeah same here. // would’ve been the most obvious choice if it weren’t already used for floor division, so I came up with the said admittedly somewhat weakish mnemonics to hopefully warm people up to seeing the typical or operator in a new light.

That would be an awesome idea if there weren’t already so many existing comments that aren’t meant to be docstrings but just happen to satisfy my rules for per-name docstrings. And even for new code we still need to be able to make a developer-oriented comment on a name without exposing it as a docstring, which is meant for users. That’s why a dedicated syntax for a per-name docstring is still needed.

pawamoy · October 15, 2024, 3:01pm

We can have both comments and a dedicated prefix

# This is a regular comment.
attr1 = 1

### This is a docstring comment.
attr2 = 2

I’d strongly advocate for triple ### because that brings you to four characters indentation, preventing messing with further indents:

## Starts at column 4.
##  Hit tab once :(

### Starts at column 5.
###     Hit tab once :)

And I’d advocate against using anything else than ###, like #::, #|>, ##? or whatever, as it’s much easier to type and to update multiple lines at once

Nodd · October 15, 2024, 3:09pm

The docstring marker for comments is #:, which is not typically used for developer-oriented comment as it’s against pep8 (there is no space after #). Also as noted in the original post, it’s already used for docstrings in Spinx autodoc.

As for ### I’ve seen it used to mark blocks of code, for example like this:

####################
### MY CODE BLOC ###
####################

It may already exist in the wild.

pawamoy · October 15, 2024, 3:16pm

Ah, right, that’s tough

#: is Sphinx-specific, yes.

jcampbell05 · October 15, 2024, 3:41pm

Could us maybe piggy back of the whole b" means byte-string, f" means formatted string convention ?

Perhaps an underscore followed by " could mean a comment ? def. sayHello(name = "Hello" _"The greeting to use when saying hello"):

bschubert · October 15, 2024, 3:47pm

There’s already a competing PEP for that syntax. Better write your PEP fast if you want to beat it

jcampbell05 · October 15, 2024, 3:51pm

I had a look and a underscore prefix doesn’t seem to be reserved by that PEP beyond if a user defined it themselves as far as I can see and I think having _ be a reserved prefix would be okay.

jamestwebber · October 15, 2024, 10:16pm

I’d been thinking about proposing something like docments for a while, until I learned that they already exist. But I never wrote up a proposal because it really doesn’t need a proposal, it needs an implementation and widespread adoption, and I didn’t find the time to work on the former.

I think using a comment syntax for this has a major strength that the others can’t match: it doesn’t require any change to Python syntax. It simply requires a linter option to allow a specific type of comment prefix. Sphinx’s #: is one option, but I think #| would be nice because it makes a nice vertical rule for multi-line comments.

On the other hand, this is a very small downside. Comments are easily accessible with the tokenizer. If a standard doc-comment format became popular, a future PEP could add those comments to the AST to act like docstrings. This isn’t any more complicated than adding the novel syntax of ||, but has the benefit that doc-comments are trivially backwards compatible.

Because doc-comments are just comments, they don’t need a PEP or any syntax changes for the initial implementation, which means they can prove their merit before any proposal is needed. If I had infinite free time I’d be working on an IDE plugin for that, I think it could be pretty slick.

effigies · October 15, 2024, 10:52pm

Unfortunately, the predominant auto-formatters normalize trailing commas to two spaces, making vertical alignment painful.

Formatter: Keep right-hanging comments aligned · Issue #7684 · astral-sh/ruff · GitHub
cosmetic alignment of inline comments (like yapf SPACES_BEFORE_COMMENT) · Issue #682 · psf/black · GitHub
Comments that are vertically aligned across 3 or more lines can remain vertically aligned. · Issue #1696 · psf/black · GitHub

jamestwebber · October 15, 2024, 11:12pm

Those examples seem to be different^[1], in that they are multiple comments for different things. By “multi-line” I was referring to something more like

def foo(
    a: int,  #| these are all single-line (doc) comments
    bar: str,  #| so I wouldn't expect them to be aligned
    c: str,  #| it might even be confusing to imply they're connected
):
    #| This is what I meant by multi-line comment, because
    #| it's a single block of documentation for something 
    #| (in this case it's the equivalent of a docstring for `foo`)
    ...

But again this is a lint/formatter change, not a Python syntax change, so it’s much easier to be flexible. It requires a formatter that respects this convention and/or allows users to disable the rule that messes it up. If someone wanted to do that it could be available next week, rather than next year.

except for the third one which seems to have a multi-line comment that overlaps other arguments. I think that’s just confusing to read ↩︎

blhsing · October 16, 2024, 2:21am

James Webber:

I’d been thinking about proposing something like docments for a while, until I learned that they already exist. But I never wrote up a proposal because it really doesn’t need a proposal, it needs an implementation and widespread adoption, and I didn’t find the time to work on the former.

I think using a comment syntax for this has a major strength that the others can’t match: it doesn’t require any change to Python syntax. It simply requires a linter option to allow a specific type of comment prefix. Sphinx’s #: is one option, but I think #| would be nice because it makes a nice vertical rule for multi-line comments.

blhsing:

Downside: As a comment its value is difficult to access at runtime.

On the other hand, this is a very small downside. Comments are easily accessible with the tokenizer. If a standard doc-comment format became popular, a future PEP could add those comments to the AST to act like docstrings. This isn’t any more complicated than adding the novel syntax of ||, but has the benefit that doc-comments are trivially backwards compatible.

Because doc-comments are just comments, they don’t need a PEP or any syntax changes for the initial implementation, which means they can prove their merit before any proposal is needed. If I had infinite free time I’d be working on an IDE plugin for that, I think it could be pretty slick.

You’re right that extracting comments from the source code and binding them to names is a long solved problem with the help of tokenize and ast.

With that, it makes total sense to use comments for docstrings if we can just standardize on a comment prefix for a user-bound docstring, which Sphinx autodoc already has with this very reasonable looking #:.

A separate prefix for vertical multi-line docstrings isn’t needed since multiple successive #:-led lines can already form a multi-line docstring.

While Sphinx autodoc doesn’t support per-parameter docstrings, the docment module demonstrates how it can be done (but without a dedicated docstring prefix). I think the more viable path forward is for Sphinx autodoc to incorporate docment capabilities into its syntax and for IDE extensions to support it (particularly in offering an option to hide or collapse per-parameter docstrings).

Thanks everyone for the feedbacks!

jamestwebber · October 16, 2024, 2:34am

I wasn’t proposing a separate prefix just for that, it was just an alternative style that I thought looked nice and hadn’t been claimed by anything^[1]

The real question is, is anyone volunteering to do this work?

that I knew of, although the fast.ai folks might actually be using it for something ↩︎

petercordia · October 16, 2024, 2:34pm

It would feel like a waste to me if an update to docstrings didn’t improve type hinting for dataclass and attrs.

So to attempt to expand on this proposal:

@attrs.define
class MyDataClass():
    a = attrs.field(converter=int, default='33')
    || set_type = str|float|int
    || get_type = str
    || "A variable that is correctly type hinted in the class constructor"

I think that might actually work better than I’d initially expected.

pawamoy · October 16, 2024, 2:37pm

So, if I need multiple lines to document parameters and return values with docments, I have to do the following, right? (using #| just to play along)

def foo(
    #| The bar parameter (summary).
    #|
    #| The bar parameter (body).
    #| More info.
    bar: str | None = None,
    #| The return value (summary).
    #|
    #| The return value (body).
    #| More info.
) -> int:
    """Function summary.

    Function body.
    """
    ...

What if my function returns multiple values? With PEP 727 I could annotate and document each:

def foo() -> tuple[
    Annotated[int, Doc("Some integer.")],
    Annotated[float, Doc("Some float.")],
]:
    ...

I don’t think it’s possible with docments? Current docstring styles also allow multiple named return values to be documented.

def foo() -> tuple[int, float]:
    """Summary.

    Returns:
        quantity: Whatever.
        threshold: Whatever.
    """

Allowing this with docments would require another microsyntax I think?

We also have to take into account the fact that current docstring styles allow documenting more than parameters and return values: emitted warnings, raised exceptions, generator yielded and received values, etc. How would docments handle these?

About tokenization, sure, it lets you pick up comments. But that essentially means that when doing static analysis with ast, you have to parse each module twice: once to get the AST, and a second time to get the comments. This is not very efficient.

But yes, I hear the argument of “once it’s widely adopted, we can standardize it and change ast so that it returns these comments too, and also change the interpreter to include these docstrings at runtime”. The issue with this I think, is that in the meantime you’ll just create one more standard (ref xkcd).

In any case, authors/maintainers of docments: please, please write a spec! Happy to provide feedback along the way.

jamestwebber · October 16, 2024, 2:38pm

That seems like a totally different change related to how type annotations work, combining it into this would be a much more complicated proposal.