python-markdownify

Author	SHA1	Message	Date
Chris Papademetrious	48724e7002	support backticks in <code> spans (#226 ) (#230 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-06-29 14:56:21 -04:00
Chris Papademetrious	9b1412aa5b	implement a strip_pre configuration option (#218 ) (#222 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-06-14 16:37:47 -04:00
Colin	0e1a849346	Add conversion support for <q> tags (#217 )	2025-04-28 06:37:33 -04:00
Chris Papademetrious	e29de4e753	make convert_hn() public instead of internal (#213 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-04-20 06:20:01 -04:00
Chris Papademetrious	618747c18c	in inline contexts, resolve <br/> to a space instead of an empty string (#202 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-03-04 07:37:22 -05:00
Chris Papademetrious	5122c973c1	add missing newlines for definition lists (#200 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-03-02 06:42:56 -05:00
itmammoth	ac5736f0a3	Support `video` tag with `poster` attribute (#189 )	2025-02-28 10:51:42 +01:00
Joseph Myers	3311f4d896	Avoid stripping nonbreaking spaces (#188 )	2025-02-19 07:40:53 -05:00
Chris Papademetrious	3026602686	make conversion non-destructive to soup; improve div/article/section handling (#184 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-02-04 18:09:24 -05:00
Chris Papademetrious	ae0597d80c	remove superfluous leading/trailing whitespace (#181 )	2025-01-27 11:55:32 -05:00
Chris Papademetrious	dbb5988802	add blank line before/after preformatted block (#179 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-01-21 11:01:11 -05:00
Chris Papademetrious	f24ec9e83c	add blank line before ATX-style headings to avoid ambiguity (#178 ) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-01-21 11:00:51 -05:00
chrispy	0fb855676d	support HTML definition lists (<dl>, <dt>, and <dd>) Signed-off-by: chrispy <chrispy@synopsys.com>	2025-01-18 19:43:28 -05:00
chrispy	17c3678d0e	optimize empty-line handling for li and blockquote content Signed-off-by: chrispy <chrispy@synopsys.com>	2025-01-18 19:25:03 -05:00
Chris Papademetrious	600f77d244	allow a wrap_width value of None for unlimited line lengths (#169 ) allow a wrap_width value of None to reflow text to unlimited line length	2025-01-18 19:20:22 -05:00
chrispy	71e1471e18	do not construct Markdown links in code spans and code blocks Signed-off-by: chrispy <chrispy@synopsys.com>	2024-12-29 12:33:46 -05:00
AlexVonB	3466061ca9	prevent `<hn>` to call convert_hn and crash fixes #142	2024-11-24 21:20:57 +01:00
AlexVonB	9595618796	prevent very large headline prefixes for example: `<h9999999>` could crash the conversion. fixes #143	2024-11-24 21:11:42 +01:00
Joseph Myers	340aecbe98	More thorough cleanup of input whitespace This improves the markdownify logic for cleaning up input whitespace that has no semantic significance in HTML. This PR uses a branch based on that for #150 (which in turn is based on that for #120) to avoid conflicts with those fixes. The suggested order of merging is just first to merge #120, then the rest of #150, then the rest of this PR. Whitespace in HTML input isn't generally significant before or after block-level elements, or at the start of end of such an element other than `<pre>`. There is some limited logic in markdownify for removing it, (a) for whitespace-only nodes in conjunction with a limited list of elements (and with questionable logic that ony removes whitespace adjacent to such an element when also inside such an element) and (b) only for trailing whitespace, in certain places in relation to lists. Replace both those places with more thorough logic using a common list of block-level elements (which could be expanded more). In general, this reduces the number of unnecessary blank lines in output from markdownify (sometimes lines with just a newline, sometimes lines containing a space as well as that newline). There are open issues about cases where propagating such input whitespace to the output actually results in badly formed Markdown output (wrongly indented output), but #120 (which this builds on) fixes those issues, sometimes leaving unnecessary lines with just a space on them in the output, which are dealt with fully by the present PR. There are a few testcases that are affected because they were relying on such whitespace for good output from bad HTML input that used `<p>` or `<blockquote>` inside header tags. To keep reasonable output in those cases of bad input now input whitespace adjacent to those two tags is ignored, make the `<p>` and `<blockquote>` output explicitly include leading and trailing spaces if `convert_as_inline`; such explicit spaces seem the best that can be done for such bad input. Given those fixes, all the remaining changes needed to the expectations of existing tests seem like improvements (removing useless spaces or newlines from the output).	2024-10-03 20:16:23 +00:00
Joseph Myers	c2ffe46e85	Fix whitespace issues around wrapping This fixes various issues relating to how input whitespace is handled and how wrapping handles whitespace resulting from hard line breaks. This PR uses a branch based on that for #120 to avoid conflicts with the fixes and associated test changes there. My suggestion is thus first to merge #120 (which fixes two open issues), then to merge the remaining changes from this PR. Wrapping paragraphs has the effect of losing all newlines including those from `<br>` tags, contrary to HTML semantics (wrapping should be a matter of pretty-printing the output; input whitespace from the HTML input should be normalized, but `<br>` should remain as a hard line break). To fix this, we need to wrap the portions of a paragraph between hard line breaks separately. For this to work, ensure that when wrapping, all input whitespace is normalized at an early stage, including turning newlines into spaces. (Only ASCII whitespace is handled this way; `\s` is not used as it's not clear Unicode whitespace should get such normalization.) When not wrapping, there is still too much input whitespace preservation. If the input contains a blank line, that ends up as a paragraph break in the output, or breaks the header formatting when appearing in a header tag, though in terms of HTML semantics such a blank line is no different from a space. In the case of an ATX header, even a single newline appearing in the output breaks the Markdown. Thus, when not wrapping, arrange for input whitespace containing at least one `\r` or `\n` to be normalized to a single newline, and in the ATX header case, normalize to a space. Fixes #130 (probably, not sure exactly what the HTML input there is) Fixes #88 (a related case, anyway; the actual input in #88 has already been fixed)	2024-10-03 00:30:50 +00:00
Joseph Myers	4399ee75db	Merge branch 'develop' into para-newlines-92-98	2024-09-30 18:05:32 +00:00
Joseph Myers	7861b330cd	Special-case use of HTML tags for converting `<sub>` / `<sup>` (#119 ) Allow different strings before / after `<sub>` / `<sup>` content In particular, this allows setting `sub_symbol='<sub>'`, `sup_symbol='<sup>'`, to use raw HTML in the output when converting subscripts and superscripts.	2024-06-23 13:28:05 +02:00
Joseph Myers	60d86663d7	More carefully separate inline text from block content There are various cases in which inline text fails to be separated by (sufficiently many) newlines from adjacent block content. A paragraph needs a blank line (two newlines) separating it from prior text, as does an underlined header; an ATX header needs a single newline separating it from prior text. A list needs at least one newline separating it from prior text, but in general two newlines (for an ordered list starting other than at 1, which will only be recognized given a blank line before). To avoid accumulation of more newlines than necessary, take care when concatenating the results of converting consecutive tags to remove redundant newlines (keeping the greater of the number ending the prior text and the number starting the subsequent text). This is thus an alternative to #108 that tries to avoid the excess newline accumulation that was a concern there, as well as fixing more cases than just paragraphs, and updating tests. Fixes #92 Fixes #98	2024-04-09 16:54:33 +00:00
Joseph Myers	2bd0772685	Avoid inline styles inside `<code>` / `<pre>` conversion (#117 ) * Avoid inline styles inside `<code>` / `<pre>` conversion The check used for this is analogous to that used to avoid escaping potential markup characters inside such tags. Fixes #103 --------- Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>	2024-04-04 20:55:54 +02:00
Chris Papademetrious	d5fb0fbb85	make sure there are blank lines around table/figure captions (#114 ) Signed-off-by: chrispy <chrispy@synopsys.com> Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>	2024-03-26 21:41:56 +01:00
André van Delft	2f9a42d3b8	Strip text before adding blockquote markers (#76 )	2024-03-26 21:07:28 +01:00
Veronika Butkevich	f33ccd7c1a	Fix newline start in header tags (#89 ) * Fix newline start in header tags	2024-03-26 20:46:30 +01:00
Thomas L. Kjeldsen	60967c1c95	ignore script and style content (such as css and javascript) (#112 )	2024-03-11 21:07:24 +01:00
chrispy	2b22d239ad	avoid text normalization/escaping in any preformatted/code context Signed-off-by: chrispy <chrispy@synopsys.com>	2024-01-15 10:53:14 -05:00
Adam Bambuch	17d8586843	don't escape text in pre tag (Fenced Code Blocks) (#67 ) don't escape text in pre tag (Fenced Code Blocks)	2022-08-28 20:58:54 +02:00
AlexVonB	5f1b98e25d	added wrap option closes #66	2022-04-24 11:00:04 +02:00
AlexVonB	35479d2d3b	Merge branch 'code_language_callback' of https://github.com/tdgroot/python-markdownify into tdgroot-code_language_callback	2022-04-13 20:25:37 +02:00
AlexVonB	423b7e948c	add option to allow inline images in selected tags fixes #61	2022-04-13 19:55:34 +02:00
Timon de Groot	0ea95de4d0	Add code language callback	2022-04-09 13:22:28 +02:00
AlexVonB	cb2646cd93	differentiated between text and code language	2021-11-17 17:03:31 +01:00
AlexVonB	9692b5e714	satisfy linter	2021-11-17 16:55:00 +01:00
Umberto Grando	ac68c53a7d	added language for multiline code	2021-11-01 21:19:35 +01:00
AlexVonB	16d8a0e1f7	Revert "add figure/figcaption" This reverts commit `828e116530`.	2021-07-11 13:12:16 +02:00
AlexVonB	4aa6cf2a24	rewrote text processing to not escape _ in code fixes #47	2021-07-11 13:10:59 +02:00
AlexVonB	828e116530	add figure/figcaption for #46	2021-06-30 13:02:42 +02:00
AlexVonB	62e9f0de02	add examples for custom converters closes #46	2021-06-27 15:53:23 +02:00
AlexVonB	a6a31624ad	add options for sub and sup tags fixes #44	2021-05-30 19:07:43 +02:00
AlexVonB	6f3732307d	restructured test files	2021-05-30 19:06:52 +02:00
AlexVonB	8f6d7e500d	add option 'default_title' to links fixes #39	2021-05-30 18:40:40 +02:00
AlexVonB	70ef9b6e48	added pre tag closes #15	2021-05-21 14:15:41 +02:00
AlexVonB	079f32f6cd	added del and s tags	2021-05-21 12:27:49 +02:00
AlexVonB	77797ebb79	Merge branch 'andrewcrichards/add_code_samp_kbd_tags' of https://github.com/AndrewCRichards/python-markdownify into AndrewCRichards-andrewcrichards/add_code_samp_kbd_tags	2021-05-21 12:11:59 +02:00
AlexVonB	ea81407b87	implemented table parsing correctly instead of manually walking down the dom tree in a table, we now rely on the main descent loop and just implement conversion for rows and cells correctly. this enables the use of html inside a table cell.	2021-05-17 14:00:00 +02:00
AlexVonB	e6da15c173	allow tables with headers in first (or any) column	2021-05-17 12:36:48 +02:00
AlexVonB	7dac92e85e	Allow for tables without header row fixes #42	2021-05-16 19:02:04 +02:00

1 2

96 Commits