Description
If a Markdown heading contains HTML, the corresponding entry in the .toc_tokens
property ends up with HTML placeholders when returned to the user. The following example should illustrate the problem:
>>> import markdown
>>> md = markdown.Markdown(extensions=['toc'])
>>> md.convert('# <code>Heading</code>\n')
'<h1 id="heading"><code>Heading</code></h1>'
>>> md.toc_tokens
[{'level': 1, 'id': 'heading', 'name': '\x02wzxhzdk:0\x03Heading\x02wzxhzdk:1\x03', 'children': []}]
While this isn't too hard to fix (we could just un-stash the HTML immediately before returning it to the user), it does raise a bigger question: what should the data format of the name
field in toc_tokens
be? Is it...
- Markdown (so the value would be exactly as in the source file:
<code>Heading</code>
) - Plain text (so the value would strip HTML:
Heading
) - HTML (similar to the Markdown format, but with HTML entities replaced, so
<code>a>b</code>
becomes<code>a>b</code>
)
In particular, this is relevant for mkdocs/mkdocs#1970. Prior to that PR, MkDocs would build an internal representation of the TOC by parsing the HTML from .toc
. With the change, it (tries to) use .toc_tokens
, but fails due to this issue.
FWIW, I think MkDocs wants this to be plain text in the end, but HTML makes the most sense to me in general: after all, the purpose of this lib is to convert Markdown to HTML. (MkDocs would then just need to parse the HTML fragment and strip out the tags.)