This fixes various issues relating to how input whitespace is handled
and how wrapping handles whitespace resulting from hard line breaks.
This PR uses a branch based on that for #120 to avoid conflicts with
the fixes and associated test changes there. My suggestion is thus
first to merge #120 (which fixes two open issues), then to merge the
remaining changes from this PR.
Wrapping paragraphs has the effect of losing all newlines including
those from `<br>` tags, contrary to HTML semantics (wrapping should be
a matter of pretty-printing the output; input whitespace from the HTML
input should be normalized, but `<br>` should remain as a hard line
break). To fix this, we need to wrap the portions of a paragraph
between hard line breaks separately. For this to work, ensure that
when wrapping, all input whitespace is normalized at an early stage,
including turning newlines into spaces. (Only ASCII whitespace is
handled this way; `\s` is not used as it's not clear Unicode
whitespace should get such normalization.)
When not wrapping, there is still too much input whitespace
preservation. If the input contains a blank line, that ends up as a
paragraph break in the output, or breaks the header formatting when
appearing in a header tag, though in terms of HTML semantics such a
blank line is no different from a space. In the case of an ATX
header, even a single newline appearing in the output breaks the
Markdown. Thus, when not wrapping, arrange for input whitespace
containing at least one `\r` or `\n` to be normalized to a single
newline, and in the ATX header case, normalize to a space.
Fixes#130
(probably, not sure exactly what the HTML input there is)
Fixes#88
(a related case, anyway; the actual input in #88 has already been fixed)
* Move the metadata from `setup.py` into `setup.cfg`.
Added `pyproject.toml`.
Removed `setup.py` - it is no longer needed.
Got rid of tests erroroneously finding their way into the wheel.
* Started populating version automatically from git tags using `setuptools_scm`.
* Migrated the metadata into `PEP 621`-compliant `pyproject.toml`, got rid of `setup.cfg`.
* test build in develop and pull requests
* use static version instead of dynamic git tag info
---------
Co-authored-by: KOLANICH <kolan_n@mail.ru>
Allow different strings before / after `<sub>` / `<sup>` content
In particular, this allows setting `sub_symbol='<sub>'`,
`sup_symbol='<sup>'`, to use raw HTML in the output when
converting subscripts and superscripts.
There are various cases in which inline text fails to be separated by
(sufficiently many) newlines from adjacent block content. A paragraph
needs a blank line (two newlines) separating it from prior text, as
does an underlined header; an ATX header needs a single newline
separating it from prior text. A list needs at least one newline
separating it from prior text, but in general two newlines (for an
ordered list starting other than at 1, which will only be recognized
given a blank line before).
To avoid accumulation of more newlines than necessary, take care when
concatenating the results of converting consecutive tags to remove
redundant newlines (keeping the greater of the number ending the prior
text and the number starting the subsequent text).
This is thus an alternative to #108 that tries to avoid the excess
newline accumulation that was a concern there, as well as fixing more
cases than just paragraphs, and updating tests.
Fixes#92Fixes#98
* Escape all characters with Markdown significance
There are many punctuation characters that sometimes have significance
in Markdown; more systematically escape them all (based on a new
escape_misc configuration option).
A limited attempt is made to limit the escaping of '.' and ')' to the
context where they might have Markdown significance (after a number,
where they can indicate an ordered list item); no such attempt is made
for the other characters (and even that limiting of '.' and ')' may
not be entirely safe in all cases, as it's possible the HTML could
have the number outside the block being escaped in one go,
e.g. `<span>1</span>.`.
---------
Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
* Avoid inline styles inside `<code>` / `<pre>` conversion
The check used for this is analogous to that used to avoid escaping
potential markup characters inside such tags.
Fixes#103
---------
Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>