133 Commits

Author SHA1 Message Date
Gareth Jones
85ef82e083 Add basic type stubs (#221) (#215)
* feat: add basic type stubs

* feat: add types for constants

* feat: add type for `MarkdownConverter` class

* ci: add basic job for checking types

* feat: add new constant

* ci: install types as required

* ci: install types package manually

* test: add strict coverage for types

* fix: allow `strip_document` to be `None`

* feat: expand types for MarkdownConverter

* fix: do not use `Unpack` as it requires Python 3.12

* feat: define `MarkdownConverter#convert_soup`

* feat: improve type for `code_language_callback`

* chore: add end-of-file newline

* refactor: use `Union` for now
2025-08-03 06:35:46 -04:00
Chris Papademetrious
48724e7002 support backticks in <code> spans (#226) (#230)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-29 14:56:21 -04:00
Chris Papademetrious
9b1412aa5b implement a strip_pre configuration option (#218) (#222)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-14 16:37:47 -04:00
Chris Papademetrious
75ab3064dd allow BeautifulSoup configuration kwargs to be specified (#224)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-14 09:06:22 -04:00
Chris Papademetrious
016251e915 ensure that explicitly provided heading conversion functions are used (#212) (#214)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-05-03 10:57:09 -04:00
Colin
0e1a849346 Add conversion support for <q> tags (#217) 2025-04-28 06:37:33 -04:00
Chris Papademetrious
e29de4e753 make convert_hn() public instead of internal (#213)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-04-20 06:20:01 -04:00
Stephen V. Brown
7908f1492a Generalize handling of colspan in case where colspan is in first row but header row is missing (#203) 2025-03-04 20:01:16 -05:00
Chris Papademetrious
618747c18c in inline contexts, resolve <br/> to a space instead of an empty string (#202)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-03-04 07:37:22 -05:00
Chris Papademetrious
5122c973c1 add missing newlines for definition lists (#200)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-03-02 06:42:56 -05:00
itmammoth
ac5736f0a3 Support video tag with poster attribute (#189) 2025-02-28 10:51:42 +01:00
Joseph Myers
c7329ac1ef Escape right square brackets (#187) 2025-02-19 10:04:29 -05:00
Joseph Myers
3311f4d896 Avoid stripping nonbreaking spaces (#188) 2025-02-19 07:40:53 -05:00
Chris Papademetrious
5655f27208 propagate parent tag context downward to improve runtime (#191) 2025-02-18 16:35:36 -05:00
Chris Papademetrious
3026602686 make conversion non-destructive to soup; improve div/article/section handling (#184)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-04 18:09:24 -05:00
Chris Papademetrious
c52a50e66a when computing <ol><li> numbering, ignore non-<li> previous siblings (#183)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-04 15:39:32 -05:00
Chris Papademetrious
ae0597d80c remove superfluous leading/trailing whitespace (#181) 2025-01-27 11:55:32 -05:00
Chris Papademetrious
dbb5988802 add blank line before/after preformatted block (#179)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-21 11:01:11 -05:00
Chris Papademetrious
f24ec9e83c add blank line before ATX-style headings to avoid ambiguity (#178)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-21 11:00:51 -05:00
Fess-AKA-DeadMonk
1b3333073a for convert_* functions, allow for tags with special characters in their name (like "subtag-name") (#136)
support custom conversion functions for tags with `:` and `-` characters in their names by mapping them to underscores in the function name
2025-01-19 09:48:08 -05:00
SomeBottle
3bf0b527a4 Add a new configuration option to control tabler header row inference (#161)
Add option to infer first table row as table header (defaults to false)
2025-01-19 08:13:24 -05:00
chrispy
0fb855676d support HTML definition lists (<dl>, <dt>, and <dd>)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-18 19:43:28 -05:00
chrispy
17c3678d0e optimize empty-line handling for li and blockquote content
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-18 19:25:03 -05:00
Chris Papademetrious
600f77d244 allow a wrap_width value of None for unlimited line lengths (#169)
allow a wrap_width value of None to reflow text to unlimited line length
2025-01-18 19:20:22 -05:00
Chris Papademetrious
9339571ae9 Merge pull request #167 from chrispy-snps/chrispy/table-caption-blank-line
insert a blank line between table caption, table content
2025-01-18 19:09:24 -05:00
chrispy
1009087d41 insert a blank line between table caption, table content
Signed-off-by: chrispy <chrispy@synopsys.com>
2024-12-29 13:52:32 -05:00
chrispy
71e1471e18 do not construct Markdown links in code spans and code blocks
Signed-off-by: chrispy <chrispy@synopsys.com>
2024-12-29 12:33:46 -05:00
AlexVonB
3466061ca9 prevent <hn> to call convert_hn and crash
fixes #142
2024-11-24 21:20:57 +01:00
AlexVonB
9595618796 prevent very large headline prefixes
for example: `<h9999999>` could crash the conversion.

fixes #143
2024-11-24 21:11:42 +01:00
AlexVonB
19780834af Merge branch 'alfonsrv-fix-pr-118' into jsm28-list-indentation 2024-11-24 12:07:59 +01:00
AlexVonB
9202027e26 ignore bs4 warnings in tests 2024-11-24 12:00:27 +01:00
AlexVonB
9bf4ff14b9 Merge branch 'jsm28-selective-escaping' into jsm28-list-indentation 2024-11-20 14:16:06 +01:00
alfonsrv
7ff4d835ae Set escape_misc to False by default to improve backwards compatibility 2024-10-09 18:55:50 +02:00
Joseph Myers
c13bdd5c14 Fix logic for indentation inside list items
This fixes problems with the markdownify logic for indentation inside
list items.

This PR uses a branch building on that for #120, #150 and #151, so
those three PRs should be merged first before merging this one.

There is limited logic in markdownify for handling indentation in the
case of nested lists.  There are two major problems with this logic:

* As it's in `convert_list`, causing a list to be indented when inside
  another list, it does not add indentation for any other elements
  such as paragraphs that may be found inside list items (or `<pre>`,
  `<blockquote>`, etc.), so such elements are wrongly not indented and
  terminate the list in the output.

* It uses fixed indentation of one tab.  Following CommonMark, a tab
  in Markdown is considered equivalent to four spaces, which is not
  sufficient indentation in ordered list items with a number of three
  or more digits.

Fix both of these issues by making `convert_li` handle indentation for
the contents of `<li>`, based on the length of the list item marker,
rather than doing it in `convert_list` at all.
2024-10-03 21:04:40 +00:00
Joseph Myers
340aecbe98 More thorough cleanup of input whitespace
This improves the markdownify logic for cleaning up input whitespace
that has no semantic significance in HTML.

This PR uses a branch based on that for #150 (which in turn is based
on that for #120) to avoid conflicts with those fixes.  The suggested
order of merging is just first to merge #120, then the rest of #150,
then the rest of this PR.

Whitespace in HTML input isn't generally significant before or after
block-level elements, or at the start of end of such an element other
than `<pre>`.  There is some limited logic in markdownify for removing
it, (a) for whitespace-only nodes in conjunction with a limited list
of elements (and with questionable logic that ony removes whitespace
adjacent to such an element when also inside such an element) and (b)
only for trailing whitespace, in certain places in relation to lists.

Replace both those places with more thorough logic using a common list
of block-level elements (which could be expanded more).

In general, this reduces the number of unnecessary blank lines in
output from markdownify (sometimes lines with just a newline,
sometimes lines containing a space as well as that newline).  There
are open issues about cases where propagating such input whitespace to
the output actually results in badly formed Markdown output (wrongly
indented output), but #120 (which this builds on) fixes those issues,
sometimes leaving unnecessary lines with just a space on them in the
output, which are dealt with fully by the present PR.

There are a few testcases that are affected because they were relying
on such whitespace for good output from bad HTML input that used `<p>`
or `<blockquote>` inside header tags.  To keep reasonable output in
those cases of bad input now input whitespace adjacent to those two
tags is ignored, make the `<p>` and `<blockquote>` output explicitly
include leading and trailing spaces if `convert_as_inline`; such
explicit spaces seem the best that can be done for such bad input.
Given those fixes, all the remaining changes needed to the
expectations of existing tests seem like improvements (removing
useless spaces or newlines from the output).
2024-10-03 20:16:23 +00:00
Joseph Myers
c2ffe46e85 Fix whitespace issues around wrapping
This fixes various issues relating to how input whitespace is handled
and how wrapping handles whitespace resulting from hard line breaks.

This PR uses a branch based on that for #120 to avoid conflicts with
the fixes and associated test changes there.  My suggestion is thus
first to merge #120 (which fixes two open issues), then to merge the
remaining changes from this PR.

Wrapping paragraphs has the effect of losing all newlines including
those from `<br>` tags, contrary to HTML semantics (wrapping should be
a matter of pretty-printing the output; input whitespace from the HTML
input should be normalized, but `<br>` should remain as a hard line
break).  To fix this, we need to wrap the portions of a paragraph
between hard line breaks separately.  For this to work, ensure that
when wrapping, all input whitespace is normalized at an early stage,
including turning newlines into spaces.  (Only ASCII whitespace is
handled this way; `\s` is not used as it's not clear Unicode
whitespace should get such normalization.)

When not wrapping, there is still too much input whitespace
preservation.  If the input contains a blank line, that ends up as a
paragraph break in the output, or breaks the header formatting when
appearing in a header tag, though in terms of HTML semantics such a
blank line is no different from a space.  In the case of an ATX
header, even a single newline appearing in the output breaks the
Markdown.  Thus, when not wrapping, arrange for input whitespace
containing at least one `\r` or `\n` to be normalized to a single
newline, and in the ATX header case, normalize to a space.

Fixes #130
(probably, not sure exactly what the HTML input there is)

Fixes #88
(a related case, anyway; the actual input in #88 has already been fixed)
2024-10-03 00:30:50 +00:00
Joseph Myers
a369e07211 More selective escaping of -#.) (alternative approach)
This is a partial alternative to #122 (open since April) for more
selective escaping of some special characters.

Here, we fix the test function naming (as noted in that PR) so the
tests are actually run (and fix some incorrect test assertions so they
pass).  We also make escaping of `-#.)` (the most common cases of
unnecessary escaping in my use case) more selective, while still being
conservatively safe in escaping all cases of those characters that
might have Markdown significance (including in the presence of
wrapping, unlike in #122).  (Being conservatively safe doesn't include
the cases where `.` or `)` start a fragment, where the existing code
already was not conservatively safe.)

There are certainly more cases where the code could also be made more
selective while remaining conservatively safe (including in the
presence of wrapping), so this is not a complete replacement for #122,
but by fixing some of the most common cases in a safe way, and getting
the tests actually running, I hope this allows progress to be made
where the previous attempt appears to have stalled, while still
allowing further incremental progress with appropriately safe logic
for other characters where useful.
2024-10-02 21:59:39 +00:00
Joseph Myers
4399ee75db Merge branch 'develop' into para-newlines-92-98 2024-09-30 18:05:32 +00:00
AlexVonB
0a5c89aa49 added test for ol start check 2024-06-23 14:30:07 +02:00
Joseph Myers
7861b330cd Special-case use of HTML tags for converting <sub> / <sup> (#119)
Allow different strings before / after `<sub>` / `<sup>` content

In particular, this allows setting `sub_symbol='<sub>'`,
`sup_symbol='<sup>'`, to use raw HTML in the output when
converting subscripts and superscripts.
2024-06-23 13:28:05 +02:00
AlexVonB
2ec33384de handle un-parsable colspan values
fixes #126
2024-06-23 13:17:20 +02:00
Joseph Myers
60d86663d7 More carefully separate inline text from block content
There are various cases in which inline text fails to be separated by
(sufficiently many) newlines from adjacent block content.  A paragraph
needs a blank line (two newlines) separating it from prior text, as
does an underlined header; an ATX header needs a single newline
separating it from prior text.  A list needs at least one newline
separating it from prior text, but in general two newlines (for an
ordered list starting other than at 1, which will only be recognized
given a blank line before).

To avoid accumulation of more newlines than necessary, take care when
concatenating the results of converting consecutive tags to remove
redundant newlines (keeping the greater of the number ending the prior
text and the number starting the subsequent text).

This is thus an alternative to #108 that tries to avoid the excess
newline accumulation that was a concern there, as well as fixing more
cases than just paragraphs, and updating tests.

Fixes #92

Fixes #98
2024-04-09 16:54:33 +00:00
Joseph Myers
46af45bb3c Escape all characters with Markdown significance (#118)
* Escape all characters with Markdown significance

There are many punctuation characters that sometimes have significance
in Markdown; more systematically escape them all (based on a new
escape_misc configuration option).

A limited attempt is made to limit the escaping of '.' and ')' to the
context where they might have Markdown significance (after a number,
where they can indicate an ordered list item); no such attempt is made
for the other characters (and even that limiting of '.' and ')' may
not be entirely safe in all cases, as it's possible the HTML could
have the number outside the block being escaped in one go,
e.g. `<span>1</span>.`.

---------

Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
2024-04-04 21:42:58 +02:00
Joseph Myers
2bd0772685 Avoid inline styles inside <code> / <pre> conversion (#117)
* Avoid inline styles inside `<code>` / `<pre>` conversion

The check used for this is analogous to that used to avoid escaping
potential markup characters inside such tags.

Fixes #103

---------

Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
2024-04-04 20:55:54 +02:00
Eric Xu
3b4a014f25 Table merge cell horizontally (#110)
* Fix #109 Table merge cell horizontally

* Add test case for colspan

---------

Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
2024-03-26 21:50:54 +01:00
AlexVonB
57d4f37923 fixed tests for table caption 2024-03-26 21:43:25 +01:00
Chris Papademetrious
d5fb0fbb85 make sure there are blank lines around table/figure captions (#114)
Signed-off-by: chrispy <chrispy@synopsys.com>
Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
2024-03-26 21:41:56 +01:00
huuya
e4df41225d Support conversion of header rows in tables without th tag (#83)
* Fixed support for header row conversion for tables without th tag
2024-03-26 21:32:36 +01:00
André van Delft
2f9a42d3b8 Strip text before adding blockquote markers (#76) 2024-03-26 21:07:28 +01:00
AlexVonB
96a25cfbf3 added tests for linebreaks in table cells 2024-03-26 21:05:31 +01:00