340aecbe98e5450cca872f03b1c0ecf70ac83de0
This improves the markdownify logic for cleaning up input whitespace that has no semantic significance in HTML. This PR uses a branch based on that for #150 (which in turn is based on that for #120) to avoid conflicts with those fixes. The suggested order of merging is just first to merge #120, then the rest of #150, then the rest of this PR. Whitespace in HTML input isn't generally significant before or after block-level elements, or at the start of end of such an element other than `<pre>`. There is some limited logic in markdownify for removing it, (a) for whitespace-only nodes in conjunction with a limited list of elements (and with questionable logic that ony removes whitespace adjacent to such an element when also inside such an element) and (b) only for trailing whitespace, in certain places in relation to lists. Replace both those places with more thorough logic using a common list of block-level elements (which could be expanded more). In general, this reduces the number of unnecessary blank lines in output from markdownify (sometimes lines with just a newline, sometimes lines containing a space as well as that newline). There are open issues about cases where propagating such input whitespace to the output actually results in badly formed Markdown output (wrongly indented output), but #120 (which this builds on) fixes those issues, sometimes leaving unnecessary lines with just a space on them in the output, which are dealt with fully by the present PR. There are a few testcases that are affected because they were relying on such whitespace for good output from bad HTML input that used `<p>` or `<blockquote>` inside header tags. To keep reasonable output in those cases of bad input now input whitespace adjacent to those two tags is ignored, make the `<p>` and `<blockquote>` output explicitly include leading and trailing spaces if `convert_as_inline`; such explicit spaces seem the best that can be done for such bad input. Given those fixes, all the remaining changes needed to the expectations of existing tests seem like improvements (removing useless spaces or newlines from the output).
|build| |version| |license| |downloads|
.. |build| image:: https://img.shields.io/github/actions/workflow/status/matthewwithanm/python-markdownify/python-app.yml?branch=develop
:alt: GitHub Workflow Status
:target: https://github.com/matthewwithanm/python-markdownify/actions/workflows/python-app.yml?query=workflow%3A%22Python+application%22
.. |version| image:: https://img.shields.io/pypi/v/markdownify
:alt: Pypi version
:target: https://pypi.org/project/markdownify/
.. |license| image:: https://img.shields.io/pypi/l/markdownify
:alt: License
:target: https://github.com/matthewwithanm/python-markdownify/blob/develop/LICENSE
.. |downloads| image:: https://pepy.tech/badge/markdownify
:alt: Pypi Downloads
:target: https://pepy.tech/project/markdownify
Installation
============
``pip install markdownify``
Usage
=====
Convert some HTML to Markdown:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
Options
=======
Markdownify supports the following options:
strip
A list of tags to strip. This option can't be used with the
``convert`` option.
convert
A list of tags to convert. This option can't be used with the
``strip`` option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a ``a`` tag's contents match its href. Defaults to ``True``.
default_title
A boolean to enable setting the title of a link to its href, if no title is
given. Defaults to ``False``.
heading_style
Defines how headings should be converted. Accepted values are ``ATX``,
``ATX_CLOSED``, ``SETEXT``, and ``UNDERLINED`` (which is an alias for
``SETEXT``). Defaults to ``UNDERLINED``.
bullets
An iterable (string, list, or tuple) of bullet styles to be used. If the
iterable only contains one item, it will be used regardless of how deeply
lists are nested. Otherwise, the bullet will alternate based on nesting
level. Defaults to ``'*+-'``.
strong_em_symbol
In markdown, both ``*`` and ``_`` are used to encode **strong** or
*emphasized* texts. Either of these symbols can be chosen by the options
``ASTERISK`` (default) or ``UNDERSCORE`` respectively.
sub_symbol, sup_symbol
Define the chars that surround ``<sub>`` and ``<sup>`` text. Defaults to an
empty string, because this is non-standard behavior. Could be something like
``~`` and ``^`` to result in ``~sub~`` and ``^sup^``. If the value starts
with ``<`` and ends with ``>``, it is treated as an HTML tag and a ``/`` is
inserted after the ``<`` in the string used after the text; this allows
specifying ``<sub>`` to use raw HTML in the output for subscripts, for
example.
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all ``<pre>`` sections.
Useful, if all code on a page is in the same programming language and
should be annotated with `````python`` or similar.
Defaults to ``''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, an BeautifylSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to ``None``.
escape_asterisks
If set to ``False``, do not escape ``*`` to ``\*`` in text.
Defaults to ``True``.
escape_underscores
If set to ``False``, do not escape ``_`` to ``\_`` in text.
Defaults to ``True``.
escape_misc
If set to ``False``, do not escape miscellaneous punctuation characters
that sometimes have Markdown significance in text.
Defaults to ``True``.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Converting BeautifulSoup objects
================================
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
Creating Custom Converters
==========================
If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change.
The function that handles a HTML tag named ``abc`` is called
``convert_abc(self, el, text, convert_as_inline)`` and returns a string
containing the converted HTML tag.
The ``MarkdownConverter`` object will handle the conversion based on the
function names:
.. code:: python
from markdownify import MarkdownConverter
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
.. code:: python
from markdownify import MarkdownConverter
class IgnoreParagraphsConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that ignores paragraphs
"""
def convert_p(self, el, text, convert_as_inline):
return ''
# Create shorthand method for conversion
def md(html, **options):
return IgnoreParagraphsConverter(**options).convert(html)
Command Line Interface
======================
Use ``markdownify example.html > example.md`` or pipe input from stdin
(``cat example.html | markdownify > example.md``).
Call ``markdownify -h`` to see all available options.
They are the same as listed above and take the same arguments.
Development
===========
To run tests and the linter run ``pip install tox`` once, then ``tox``.
Languages
Python
99.7%
Nix
0.3%