Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown-to-Markdown renderer #4

Closed
lhayhurst opened this issue Oct 4, 2017 · 34 comments · Fixed by #162
Closed

Markdown-to-Markdown renderer #4

lhayhurst opened this issue Oct 4, 2017 · 34 comments · Fixed by #162

Comments

@lhayhurst
Copy link

Hi, great project! I selected it versus the alternatives because I want to render the Markdown back into MarkDown. Is there a simple pass type Renderer that will render it back to its original input form? (My larger use case is a want to edit nodes in the AST to do some programmatic improvements of user entered markdown). Cheers!

@miyuchina
Copy link
Owner

Thanks for the interest! Unfortunately rendering back to Markdown does require implementing a complete renderer, as the original syntax information is lost in the parsed AST.

Such a renderer is certainly planned for mistletoe, though it does require a bit of work. If you're interested at all in implementing this feature yourself, feel free to open a pull request and we'll see how it goes. Otherwise, it would be a planned feature for the next release.

@lhayhurst
Copy link
Author

lhayhurst commented Oct 6, 2017

Thanks for the reply! Cool, that is what I thought. My friend ( @dgroo) and I are going to take a shot at writing the MarkDown renderer (starting from the HTML one), but we're both a little busy right now, so if this is something you are hoping to get done quickly, please let me know :-)

@miyuchina miyuchina changed the title Rendering in Markdown Markdown-to-Markdown renderer Oct 6, 2017
@miyuchina
Copy link
Owner

miyuchina commented Jan 13, 2018

I'm going to add a "help-wanted" tag to this issue, since I don't think I'd be getting around to this anytime soon. If you're interested in this feature, add your thumps-up to @lhayhurst 's topmost comment. Comment below if you're in a pinch!

For potential contributors, take a look at mistletoe.html_renderer module. It would serve as a good example for writing your own renderer classes, and you will find most token attributes there.

Also a reminder to branch off your changes from the dev branch, not the master branch!

@nickovs
Copy link

nickovs commented Jun 20, 2018

Has any progress been made on this? I too need a MarkDown renderer for Mistletoe. If there's work in progress then I would be happy to take a look at using that as a starting point and see if I can build something.

@miyuchina
Copy link
Owner

miyuchina commented Jun 20, 2018

Thank you @nickovs for taking this task on yourself! I think the main difficulty is working through all the edge cases that a Markdown document can contain, and this is partly why I've been putting this issue off. For example:

**_foo_**

... should be parsed as:

<strong><em>foo</em></strong>

But using a naive implementation, e.g.,

def render_strong(self, token):
    return '**{}**'.format(self.render_inner(token))

def render_emphasis(self, token):
    return '*{}*'.format(self.render_inner(token))

... we would have the output:

***foo***

... which gets parsed as:

<em><strong>foo</strong></em>

And things get trickier when we have escape characters, which influence the parsing process, but in some cases are not reflected in the abstract syntax tree.

I have some thoughts on how to get around this, but it would require some additional work apart from implementing a renderer. What are your thoughts, and what do you think would be your use case for such a renderer?

Edit and thank you @huettenhain!

@nickovs
Copy link

nickovs commented Jun 20, 2018

I've been taking a look at this just now since I have active need for it at work. The use case that I have is that we manage a bunch of processes internally using Markdown wiki pages; some of these pages are generated by humans and some by machine. I need to be able to have code that can add, modify and/or delete content in the sections in the middle of the pages and ideally I'd like to be able to do this in a structured way. I can extract the content but at the moment I can't regenerate the content after editing it.

As for thoughts about how to do this, I think that the key piece that is missing is for the renderer for a given token to be able to look back up the stack at the tokens above. This would be fairly easy to do just by having BaseRenderer.render() push the token being rendered onto a stack before it makes the call through the render_map and pop it back off afterwards. Doing this would be useful to improve the rendering of nested strong and emphasis and also might make some cases like tables a little easier to keep looking nice.

@lhayhurst
Copy link
Author

Hi, thank you for picking this up! I've been knee-deep in job-work recently and unable to complete the task :-(

@nickovs
Copy link

nickovs commented Jun 20, 2018

@miyuchina Since you mentioned that this was already planned as a feature for mistletoe, when I send you a pull request would you like me to put this into the mistletoe directory or the contrib directory? It seems to me that it should be core functionality for the library, which would suggest the former.

@miyuchina
Copy link
Owner

@nickovs Yes, go ahead and put it in the mistletoe directory! I like the idea, but for now, if you do end up implementing this, is it okay if you only override the render function in your new renderer? Don't worry too much about writing tests, they can come later.

I'm thinking about adding location information to each token, e.g., a Paragraph knows it has lines 3-6 of the original document, and an Emphasis knows it's characters 12-20. This would potentially help with features like incremental compilation. For implementing MarkdownRenderer, there's a simpler (and faster?) way that allows us to avoid handling edge cases one by one:

  • if we see an unmodified token, copy the relevant text region from the original document;
  • if we see a modified token, render according to the new render method.

But adding location information to tokens needs quite a bit of work, so if you want to go through with your method, feel free!

@nickovs
Copy link

nickovs commented Jun 20, 2018

OK. I have a naive version working for the documents that I care about. I will get it to a state where a parse of the samples in the tests and parses of my rendered rendered versions of the first pass look the same and then I'll send it to you.

@miyuchina
Copy link
Owner

@nickovs no rush of course, but I'd love to include your Markdown renderer in version 0.7.1, which I plan to release this coming weekend. Do you think it can be finished before then, or do you think we should give it more time?

@nickovs
Copy link

nickovs commented Jun 25, 2018

It looks like I missed the 0.7.1 release window! What I have is somewhat untested but works for my purposes. I’ll send you a PR of what I’ve got when I get back to my computer and you can give me your comments.

Sent with GitHawk

@gruns
Copy link

gruns commented Jul 4, 2018

I'm thinking about adding location information to each token, e.g.,
a Paragraph knows it has lines 3-6 of the original document, and an
Emphasis knows it's characters 12-20.

This is information is required, in some capacity, to preserve tokens
with abiguous Markdown representations, like headers, emphasis, list
item prefixes, etc. Without such, there's no way to preserve the
input's character choice. E.g. mistletoe can't know whether to render
the input **Strong** as **Strong** (correct) or __Strong__
(incorrect).

@nickovs Any progress on your PR? And how does your implementation
handle the above situation?

@miyuchina
Copy link
Owner

Sorry for the late reply, I've been busy with other commitments for the past half month. Hopefully in the next week or so I can squeeze in some time to work on this feature.

I already have two commits on a local branch implementing location information. There are tricky cases, and I still need to think about how they fit together in the Markdown renderer. This is just to say that I'm working on it, and will keep posting updates to this thread.

@Jyhess
Copy link

Jyhess commented Mar 28, 2019

Hi, any news on this feature?
Like @nickovs we are documenting our project with Markdown, and we need a parser to extract or add some information. Mistletoe is great for parsing, with a data tree easily manipulable (thank for this work). We just need a way to write modified structure.
I don't have time yet to write it by myself, but I can test it and provide feedback.

@matthubb
Copy link

2 years later bump?

This is the most promising thread I could find for a Markdown -> AST -> Markdown solution, but nothing published so far?

@chrisjsewell
Copy link
Contributor

Heya, just to note https://github.com/executablebooks/markdown-it-py provides a markdown -> markdown render via https://github.com/executablebooks/mdformat

@pbodnar
Copy link
Collaborator

pbodnar commented Sep 18, 2021

@chrisjsewell, that looks promising, thanks for the tip. 👍 I think it would help you if you mentioned this, or how to use different renderers (which ones?) generally, somewhere at the top of your docs for markdown-it-py. I've searched through them quickly and I couldn't find much info on that topic.

@chrisjsewell
Copy link
Contributor

Yeh no worries it's on the todo list 😅 executablebooks/markdown-it-py#10 (comment)

@pbodnar
Copy link
Collaborator

pbodnar commented Jun 24, 2022

A brief summary and feedback after some time:

I'm thinking about adding location information to each token, e.g.,
a Paragraph knows it has lines 3-6 of the original document, and an
Emphasis knows it's characters 12-20.

This is information is required, in some capacity, to preserve tokens with abiguous Markdown representations, like headers, emphasis, list item prefixes, etc. Without such, there's no way to preserve the input's character choice. E.g. mistletoe can't know whether to render the input **Strong** as **Strong** (correct) or __Strong__ (incorrect).

So far the use cases presented here, like this one, seem NOT to need any location information? Instead, it should be sufficient (or even required) to know what enclosing characters were used in the input for a given token (which should be relatively easy to do). OTOH location information (BTW a feature freshly requested in #144) would be useful if we wanted to keep the original text 100% untouched (which might be quite a challenge)? Please let me know if I have overlooked anything here.

@nickovs Any progress on your PR? And how does your implementation handle the above situation?

Unfortunately, it looks like there are no branches or PRs available yet. So we would either have to start from scratch, or to inspire from other projects. ;)

@anderskaplan
Copy link
Contributor

I'd like to see this too! In particular, to get as close as possible to a bit-perfect roundtrip. The use case would be to use it for translation.

I'd be happy to contribute this. Can't make any promises as to when it will be finished, but I've done some research and I think it should be possible.

The approach would be to add the necessary information (e.g., if '_' or '*' was used for emphasis) to the tokens, and then create a new renderer class.

@huguesdevimeux
Copy link

Hello,

Sorry, I'm late to the party. I'm working on this feature (no promise at all) for a personal project, and this thread is the closest one I could find on AST → MD, in python.

For reference, such renderer as already been coded in js here by @DamonOehlman. Most of the logic can be found here.
That being said, the issue @miyuchina mentioned is seemingly not fixed by this renderer.

I will give a try on implementing this.

@pbodnar
Copy link
Collaborator

pbodnar commented Sep 11, 2022

@huguesdevimeux, thanks for your contribution to this topic.

Just be aware that @anderskaplan is currently probably working on this as well, while also greatly helping us fix many other things "on the way", so I'm not sure how far he actually got with this one (no published branch for this yet?)

For reference, such renderer as already been coded in js here by @DamonOehlman. Most of the logic can be found here.
That being said, the issue @miyuchina mentioned is seemingly not fixed by this renderer.

Just checked, I can confirm the linked JS renderer does seem like the basic "naive" implementation, i.e. not considering types of headings or strong texts from the original markdown text. As suggested by me and confirmed by @anderskaplan just above, these cases shouldn't be that difficult to cover by extending the AST, not sure about the rest - but I still think we don't need to keep all the original formatting...

@anderskaplan
Copy link
Contributor

@huguesdevimeux just so you know, I will soon put up a PR for this. I've got it working for everything except tables. As I wrote above, I'm aiming for a near-perfect roundtrip. Some whitespace will be lost, that's inevitable, but apart from that the rendered document should look just like the input. As it happens, this approach solves the problem that @miyuchina mentioned above!

But, the PR builds on top of some other PR's, so those will have to go in first.

I can publish a draft PR if you'd like to see it, and maybe try it out. Probably sometime later this week.

@huguesdevimeux
Copy link

@huguesdevimeux just so you know, I will soon put up a PR for this. I've got it working for everything except tables. As I wrote above, I'm aiming for a near-perfect roundtrip. Some whitespace will be lost, that's inevitable, but apart from that the rendered document should look just like the input. As it happens, this approach solves the problem that @miyuchina mentioned above!

But, the PR builds on top of some other PR's, so those will have to go in first.

I can publish a draft PR if you'd like to see it, and maybe try it out. Probably sometime later this week.

Ok, then, perfect. I'm curious to see what you did, though :).

@anderskaplan
Copy link
Contributor

@huguesdevimeux hi, I've just created a draft PR for this. Please check it out and let me know how it works for you!

@mikez
Copy link

mikez commented Nov 18, 2022

+1 on rendering back to Markdown. :)

For my use case, it would be useful if the location of references and footnotes were preserved in the ast.
Why: Sometimes, there may be two different lists of footnotes: a notes section [^a], [^b], [^c], ... and a references section [^1], [^2], [^3], akin to how Wikipedia has it.

@anderskaplan
Copy link
Contributor

Removed the draft status on the PR now.

@pbodnar pbodnar linked a pull request Jun 10, 2023 that will close this issue
@pbodnar pbodnar added this to the 1.1.0 milestone Jun 10, 2023
@pbodnar
Copy link
Collaborator

pbodnar commented Jun 10, 2023

@ALL, the PR has been merged into the master branch and it will available in the coming release. 🎉 Testing and feedback are welcome. :)

@pbodnar pbodnar closed this as completed Jun 10, 2023
@lhayhurst
Copy link
Author

(OP here). Amazing! Incredible fortitude seeing this 6.5 year old ticket through to completion. 🥳

@mikez
Copy link

mikez commented Jun 10, 2023

@anderskaplan @pbodnar
🎉 Tested and works as expected. :)

Minor remark

Consider this markdown text:

lorem[^a] ipsum[^b].

## Notes

[^a]: dolor
[^b]: sit amet

When trying to traverse the ast, I was confused why [^a] turns into a LinkReferenceDefinition, but [^b] is turned into a RawText and merged with "ipsum" to ipsum[^b].

@pbodnar
Copy link
Collaborator

pbodnar commented Jun 13, 2023

@mikez, thanks for your feedback. :)

Regarding your remark, maybe you could file an issue describing the problem in more detail? Note that mistletoe still doesn't support "classical footnotes" as given your example - see #47.

@mikez
Copy link

mikez commented Jun 13, 2023

@pbodnar Thank you for the clarification. Markdown Extra and MultiMarkdown have footnotes, but CommonMark and GitHub Flavored Markdown (GFM) do not at this time. You follow CommonMark, so now I understand why my example can have unpredictable behavior.

@pbodnar
Copy link
Collaborator

pbodnar commented Dec 17, 2023

Regarding competing, ready-made markdown renderers like this MarkdownRenderer, I've just found out that the markdown-it-py project actually also has one: they have it in a separate Python package called mdformat which can be used on its own, or together with the MarkdownIt API as described here. It would be interesting to compare the 2 renderers...

UPDATE: I've just realized the existence of mdformat was already mentioned above. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.