This improves the markdownify logic for cleaning up input whitespace
that has no semantic significance in HTML.
This PR uses a branch based on that for #150 (which in turn is based
on that for #120) to avoid conflicts with those fixes. The suggested
order of merging is just first to merge #120, then the rest of #150,
then the rest of this PR.
Whitespace in HTML input isn't generally significant before or after
block-level elements, or at the start of end of such an element other
than `<pre>`. There is some limited logic in markdownify for removing
it, (a) for whitespace-only nodes in conjunction with a limited list
of elements (and with questionable logic that ony removes whitespace
adjacent to such an element when also inside such an element) and (b)
only for trailing whitespace, in certain places in relation to lists.
Replace both those places with more thorough logic using a common list
of block-level elements (which could be expanded more).
In general, this reduces the number of unnecessary blank lines in
output from markdownify (sometimes lines with just a newline,
sometimes lines containing a space as well as that newline). There
are open issues about cases where propagating such input whitespace to
the output actually results in badly formed Markdown output (wrongly
indented output), but #120 (which this builds on) fixes those issues,
sometimes leaving unnecessary lines with just a space on them in the
output, which are dealt with fully by the present PR.
There are a few testcases that are affected because they were relying
on such whitespace for good output from bad HTML input that used `<p>`
or `<blockquote>` inside header tags. To keep reasonable output in
those cases of bad input now input whitespace adjacent to those two
tags is ignored, make the `<p>` and `<blockquote>` output explicitly
include leading and trailing spaces if `convert_as_inline`; such
explicit spaces seem the best that can be done for such bad input.
Given those fixes, all the remaining changes needed to the
expectations of existing tests seem like improvements (removing
useless spaces or newlines from the output).
This fixes various issues relating to how input whitespace is handled
and how wrapping handles whitespace resulting from hard line breaks.
This PR uses a branch based on that for #120 to avoid conflicts with
the fixes and associated test changes there. My suggestion is thus
first to merge #120 (which fixes two open issues), then to merge the
remaining changes from this PR.
Wrapping paragraphs has the effect of losing all newlines including
those from `<br>` tags, contrary to HTML semantics (wrapping should be
a matter of pretty-printing the output; input whitespace from the HTML
input should be normalized, but `<br>` should remain as a hard line
break). To fix this, we need to wrap the portions of a paragraph
between hard line breaks separately. For this to work, ensure that
when wrapping, all input whitespace is normalized at an early stage,
including turning newlines into spaces. (Only ASCII whitespace is
handled this way; `\s` is not used as it's not clear Unicode
whitespace should get such normalization.)
When not wrapping, there is still too much input whitespace
preservation. If the input contains a blank line, that ends up as a
paragraph break in the output, or breaks the header formatting when
appearing in a header tag, though in terms of HTML semantics such a
blank line is no different from a space. In the case of an ATX
header, even a single newline appearing in the output breaks the
Markdown. Thus, when not wrapping, arrange for input whitespace
containing at least one `\r` or `\n` to be normalized to a single
newline, and in the ATX header case, normalize to a space.
Fixes#130
(probably, not sure exactly what the HTML input there is)
Fixes#88
(a related case, anyway; the actual input in #88 has already been fixed)
Allow different strings before / after `<sub>` / `<sup>` content
In particular, this allows setting `sub_symbol='<sub>'`,
`sup_symbol='<sup>'`, to use raw HTML in the output when
converting subscripts and superscripts.
There are various cases in which inline text fails to be separated by
(sufficiently many) newlines from adjacent block content. A paragraph
needs a blank line (two newlines) separating it from prior text, as
does an underlined header; an ATX header needs a single newline
separating it from prior text. A list needs at least one newline
separating it from prior text, but in general two newlines (for an
ordered list starting other than at 1, which will only be recognized
given a blank line before).
To avoid accumulation of more newlines than necessary, take care when
concatenating the results of converting consecutive tags to remove
redundant newlines (keeping the greater of the number ending the prior
text and the number starting the subsequent text).
This is thus an alternative to #108 that tries to avoid the excess
newline accumulation that was a concern there, as well as fixing more
cases than just paragraphs, and updating tests.
Fixes#92Fixes#98
* Avoid inline styles inside `<code>` / `<pre>` conversion
The check used for this is analogous to that used to avoid escaping
potential markup characters inside such tags.
Fixes#103
---------
Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
instead of manually walking down the dom tree
in a table, we now rely on the main descent loop
and just implement conversion for rows and cells
correctly. this enables the use of html inside a
table cell.