This fixes various issues relating to how input whitespace is handled and how wrapping handles whitespace resulting from hard line breaks. This PR uses a branch based on that for #120 to avoid conflicts with the fixes and associated test changes there. My suggestion is thus first to merge #120 (which fixes two open issues), then to merge the remaining changes from this PR. Wrapping paragraphs has the effect of losing all newlines including those from `<br>` tags, contrary to HTML semantics (wrapping should be a matter of pretty-printing the output; input whitespace from the HTML input should be normalized, but `<br>` should remain as a hard line break). To fix this, we need to wrap the portions of a paragraph between hard line breaks separately. For this to work, ensure that when wrapping, all input whitespace is normalized at an early stage, including turning newlines into spaces. (Only ASCII whitespace is handled this way; `\s` is not used as it's not clear Unicode whitespace should get such normalization.) When not wrapping, there is still too much input whitespace preservation. If the input contains a blank line, that ends up as a paragraph break in the output, or breaks the header formatting when appearing in a header tag, though in terms of HTML semantics such a blank line is no different from a space. In the case of an ATX header, even a single newline appearing in the output breaks the Markdown. Thus, when not wrapping, arrange for input whitespace containing at least one `\r` or `\n` to be normalized to a single newline, and in the ATX header case, normalize to a space. Fixes #130 (probably, not sure exactly what the HTML input there is) Fixes #88 (a related case, anyway; the actual input in #88 has already been fixed)
15 lines
305 B
Python
15 lines
305 B
Python
from markdownify import markdownify as md
|
|
|
|
|
|
def test_single_tag():
|
|
assert md('<span>Hello</span>') == 'Hello'
|
|
|
|
|
|
def test_soup():
|
|
assert md('<div><span>Hello</div></span>') == 'Hello'
|
|
|
|
|
|
def test_whitespace():
|
|
assert md(' a b \t\t c ') == ' a b c '
|
|
assert md(' a b \n\n c ') == ' a b\nc '
|