Skip to content

Don't emit stashed HTML tag placeholders in .toc_tokens #901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 31, 2020

Conversation

jimporter
Copy link
Contributor

@jimporter jimporter commented Jan 31, 2020

This should resolve #899. The fix is a bit subtle, since we want to emit HTML entities for .toc_tokens, but not HTML tags. We also want to be careful to clean up data-toc-label as needed. Note: this change means that raw HTML in a heading is no longer passed through to the HTML for .toc.

I'm not entirely happy with this. In practice, I think the ideal would be for the toc extension to include the (HTML-ized) Markdown from a heading in its TOC entry, and to have .toc_tokens include both html and plaintext fields. This patch gives us a sort of middle ground, where we just use the HTML with tags stripped. The "ideal" is a larger behavioral change though, and I'm not sure it makes sense to do that for a minor release...

If you think it makes sense to implement the "ideal" solution now, I can try and do that. However, this patch is sufficient for what MkDocs needs.

…ython-Markdown#899

Note: this slightly changes existing behavior in that raw HTML tags are no
longer included in the HTML `.toc`.
@jimporter
Copy link
Contributor Author

Hmm, one problem with the "ideal" solution mentioned above is that we'd probably want to strip <a> tags from the heading: those are obviously going to cause problems with the links generated for navigating from the TOC itself. Given that, maybe what I have here is the best solution for now...

@jimporter
Copy link
Contributor Author

One other option here would be that we strip HTML tags from the heading, but if a user really wants HTML tags in their heading, they could explicitly use { data-toc-label="Foo <code>Bar</code>" }. This is more flexible, though attr lists do seem to have some unusual behavior when trying to embed Markdown in them, e.g. { data-toc-label="Foo *Bar*" }. I haven't spent enough time digging through what's happening to understand how to fix that, but I assume it's because attr lists are just looking at the raw text, not the HTML elements generated by the initial Markdown parse.

@waylan
Copy link
Member

waylan commented Jan 31, 2020

For data-toc-label to support raw HTML, then the content would need to be stashed in the html stash. Otherwise, it would get escaped when the elementtree is serialized.

For data-toc-label to support Markdown syntax, it gets complicated. For example, we do not want a elements. And what about images or block level elements? I suppose we could allow a whitelist of known specific inline elements only. I would need to explore this more as well. But that is separate from the issue we are trying to address here.

While the fix here is more restrictive than previously, it fixes the obviously wrong output and actually does what we intended to do previously. If we want to add support for formatting TOC entries, then that would be a separate new feature.

@waylan waylan merged commit ccf56ed into Python-Markdown:master Jan 31, 2020
@jimporter jimporter deleted the toc-strip-html branch April 8, 2020 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Heading names in toc_tokens contain stashed HTML placeholders
2 participants