Compare commits

..

25 Commits

Author SHA1 Message Date
AlexVonB
9231704988 Merge branch 'develop' 2021-12-11 14:44:58 +01:00
AlexVonB
1613c302bc Merge branch 'develop' 2021-11-17 17:11:01 +01:00
AlexVonB
55c9e84f38 Merge branch 'develop' 2021-09-04 21:50:34 +02:00
AlexVonB
99875683ac Merge branch 'develop' 2021-08-25 08:53:38 +02:00
AlexVonB
eaeb0603eb Merge branch 'develop' 2021-07-11 13:21:20 +02:00
AlexVonB
cb73590623 Merge branch 'develop' 2021-07-11 13:14:29 +02:00
AlexVonB
59417ab115 Merge branch 'develop' 2021-05-30 19:10:49 +02:00
AlexVonB
917b01e548 Merge branch 'develop' 2021-05-30 11:20:32 +02:00
AlexVonB
652714859d Merge branch 'develop' 2021-05-21 14:18:14 +02:00
AlexVonB
ea5b22824b Merge branch 'develop' 2021-05-18 10:42:27 +02:00
AlexVonB
ec5858e42f Merge branch 'develop' 2021-05-16 18:41:24 +02:00
AlexVonB
02bb914ef3 Merge branch 'develop' 2021-05-02 13:49:30 +02:00
AlexVonB
21c0d034d0 Merge branch 'develop' 2021-05-02 10:51:00 +02:00
AlexVonB
e3ddc789a2 Merge branch 'develop' 2021-04-22 12:43:27 +02:00
AlexVonB
2d0cd97323 Merge branch 'develop' 2021-04-22 12:13:03 +02:00
AlexVonB
ec185e2e9c Merge branch 'develop' 2021-02-21 23:09:55 +01:00
AlexVonB
079d1721aa Merge branch 'develop' 2021-02-21 20:58:34 +01:00
AlexVonB
bf24df3e2e bump to v0.6.3 2021-01-12 22:43:18 +01:00
AlexVonB
15329588b1 Merge branch 'develop' 2021-01-12 22:42:58 +01:00
AlexVonB
34ad8485fa bump to v0.6.2 2021-01-12 22:40:03 +01:00
AlexVonB
f0ce934bf8 Merge branch 'develop' 2021-01-12 22:39:47 +01:00
AlexVonB
99cd237f27 Merge branch 'develop' 2021-01-04 10:22:02 +01:00
AlexVonB
2bde8d3e8e Merge branch 'develop' 2021-01-02 16:49:28 +01:00
AlexVonB
8c9b029756 Merge branch 'develop' 2020-09-01 18:10:07 +02:00
AlexVonB
ae50065872 Merge branch 'develop' 2020-08-18 18:53:10 +02:00
24 changed files with 333 additions and 1540 deletions

View File

@@ -14,27 +14,6 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build tox
- name: Lint and test
run: |
tox
- name: Build
run: |
python -m build -nwsx .
types:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
@@ -44,8 +23,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build tox mypy types-beautifulsoup4
- name: Check types
pip install flake8==3.8.4 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
mypy .
mypy --strict tests/types.py
python setup.py lint
- name: Test with pytest
run: |
python setup.py test

View File

@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
@@ -21,11 +21,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build twine
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python -m build -nwsx .
python setup.py sdist bdist_wheel
twine upload dist/*

1
.gitignore vendored
View File

@@ -9,4 +9,3 @@
/venv
build/
.vscode/settings.json
.tox/

View File

@@ -1,2 +1 @@
include README.rst
prune tests

View File

@@ -1,8 +1,8 @@
|build| |version| |license| |downloads|
.. |build| image:: https://img.shields.io/github/actions/workflow/status/matthewwithanm/python-markdownify/python-app.yml?branch=develop
.. |build| image:: https://img.shields.io/github/workflow/status/matthewwithanm/python-markdownify/Python%20application/develop
:alt: GitHub Workflow Status
:target: https://github.com/matthewwithanm/python-markdownify/actions/workflows/python-app.yml?query=workflow%3A%22Python+application%22
:target: https://github.com/matthewwithanm/python-markdownify/actions?query=workflow%3A%22Python+application%22
.. |version| image:: https://img.shields.io/pypi/v/markdownify
:alt: Pypi version
@@ -32,14 +32,14 @@ Convert some HTML to Markdown:
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude:
Specify tags to exclude (blacklist):
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include:
\...or specify the tags you want to include (whitelist):
.. code:: python
@@ -53,11 +53,11 @@ Options
Markdownify supports the following options:
strip
A list of tags to strip. This option can't be used with the
A list of tags to strip (blacklist). This option can't be used with the
``convert`` option.
convert
A list of tags to convert. This option can't be used with the
A list of tags to convert (whitelist). This option can't be used with the
``strip`` option.
autolinks
@@ -87,16 +87,12 @@ strong_em_symbol
sub_symbol, sup_symbol
Define the chars that surround ``<sub>`` and ``<sup>`` text. Defaults to an
empty string, because this is non-standard behavior. Could be something like
``~`` and ``^`` to result in ``~sub~`` and ``^sup^``. If the value starts
with ``<`` and ends with ``>``, it is treated as an HTML tag and a ``/`` is
inserted after the ``<`` in the string used after the text; this allows
specifying ``<sub>`` to use raw HTML in the output for subscripts, for
example.
``~`` and ``^`` to result in ``~sub~`` and ``^sup^``.
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash an a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
@@ -106,101 +102,16 @@ code_language
should be annotated with `````python`` or similar.
Defaults to ``''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, a BeautifulSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to ``None``.
escape_asterisks
If set to ``False``, do not escape ``*`` to ``\*`` in text.
Defaults to ``True``.
escape_underscores
If set to ``False``, do not escape ``_`` to ``\_`` in text.
Defaults to ``True``.
escape_misc
If set to ``True``, escape miscellaneous punctuation characters
that sometimes have Markdown significance in text.
Defaults to ``False``.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
table_infer_header
Controls handling of tables with no header row (as indicated by ``<thead>``
or ``<th>``). When set to ``True``, the first body row is used as the header row.
Defaults to ``False``, which leaves the header row empty.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
A `wrap_width` value of `None` reflows lines to unlimited line length.
strip_document
Controls whether leading and/or trailing separation newlines are removed from
the final converted document. Supported values are ``LSTRIP`` (leading),
``RSTRIP`` (trailing), ``STRIP`` (both), and ``None`` (neither). Newlines
within the document are unaffected.
Defaults to ``STRIP``.
strip_pre
Controls whether leading/trailing blank lines are removed from ``<pre>``
tags. Supported values are ``STRIP`` (all leading/trailing blank lines),
``STRIP_ONE`` (one leading/trailing blank line), and ``None`` (neither).
Defaults to ``STRIP``.
bs4_options
Specify additional configuration options for the ``BeautifulSoup`` object
used to interpret the HTML markup. String and list values (such as ``lxml``
or ``html5lib``) are treated as ``features`` arguments to control parser
selection. Dictionary values (such as ``{"from_encoding": "iso-8859-8"}``)
are treated as full kwargs to be used for the BeautifulSoup constructor,
allowing specification of any parameter. For parameter details, see the
Beautiful Soup documentation at:
.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Converting BeautifulSoup objects
================================
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
Creating Custom Converters
==========================
If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change.
The function that handles a HTML tag named ``abc`` is called
``convert_abc(self, el, text, parent_tags)`` and returns a string
containing the converted HTML tag.
The ``MarkdownConverter`` object will handle the conversion based on the
function names:
change:
.. code:: python
@@ -210,39 +121,21 @@ function names:
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, parent_tags):
return super().convert_img(el, text, parent_tags) + '\n\n'
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
.. code:: python
from markdownify import MarkdownConverter
class IgnoreParagraphsConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that ignores paragraphs
"""
def convert_p(self, el, text, parent_tags):
return ''
# Create shorthand method for conversion
def md(html, **options):
return IgnoreParagraphsConverter(**options).convert(html)
Command Line Interface
======================
Use ``markdownify example.html > example.md`` or pipe input from stdin
(``cat example.html | markdownify > example.md``).
Call ``markdownify -h`` to see all available options.
They are the same as listed above and take the same arguments.
Development
===========
To run tests and the linter run ``pip install tox`` once, then ``tox``.
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``

View File

@@ -1,48 +1,14 @@
from bs4 import BeautifulSoup, Comment, Doctype, NavigableString, Tag
from textwrap import fill
from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
import re
import six
# General-purpose regex patterns
re_convert_heading = re.compile(r'convert_h(\d+)')
re_line_with_content = re.compile(r'^(.*)', flags=re.MULTILINE)
re_whitespace = re.compile(r'[\t ]+')
re_all_whitespace = re.compile(r'[\t \r\n]+')
re_newline_whitespace = re.compile(r'[\t \r\n]*[\r\n][\t \r\n]*')
re_html_heading = re.compile(r'h(\d+)')
re_pre_lstrip1 = re.compile(r'^ *\n')
re_pre_rstrip1 = re.compile(r'\n *$')
re_pre_lstrip = re.compile(r'^[ \n]*\n')
re_pre_rstrip = re.compile(r'[ \n]*$')
convert_heading_re = re.compile(r'convert_h(\d+)')
line_beginning_re = re.compile(r'^', re.MULTILINE)
whitespace_re = re.compile(r'[\t ]+')
all_whitespace_re = re.compile(r'[\s]+')
html_heading_re = re.compile(r'h[1-6]')
# Pattern for creating convert_<tag> function names from tag names
re_make_convert_fn_name = re.compile(r'[\[\]:-]')
# Extract (leading_nl, content, trailing_nl) from a string
# (functionally equivalent to r'^(\n*)(.*?)(\n*)$', but greedy is faster than reluctant here)
re_extract_newlines = re.compile(r'^(\n*)((?:.*[^\n])?)(\n*)$', flags=re.DOTALL)
# Escape miscellaneous special Markdown characters
re_escape_misc_chars = re.compile(r'([]\\&<`[>~=+|])')
# Escape sequence of one or more consecutive '-', preceded
# and followed by whitespace or start/end of fragment, as it
# might be confused with an underline of a header, or with a
# list marker
re_escape_misc_dash_sequences = re.compile(r'(\s|^)(-+(?:\s|$))')
# Escape sequence of up to six consecutive '#', preceded
# and followed by whitespace or start/end of fragment, as
# it might be confused with an ATX heading
re_escape_misc_hashes = re.compile(r'(\s|^)(#{1,6}(?:\s|$))')
# Escape '.' or ')' preceded by up to nine digits, as it might be
# confused with a list item
re_escape_misc_list_items = re.compile(r'((?:\s|^)[0-9]{1,9})([.)](?:\s|$))')
# Find consecutive backtick sequences in a string
re_backtick_runs = re.compile(r'`+')
# Heading styles
ATX = 'atx'
@@ -58,25 +24,11 @@ BACKSLASH = 'backslash'
ASTERISK = '*'
UNDERSCORE = '_'
# Document/pre strip styles
LSTRIP = 'lstrip'
RSTRIP = 'rstrip'
STRIP = 'strip'
STRIP_ONE = 'strip_one'
def strip1_pre(text):
"""Strip one leading and trailing newline from a <pre> string."""
text = re_pre_lstrip1.sub('', text)
text = re_pre_rstrip1.sub('', text)
return text
def strip_pre(text):
"""Strip all leading and trailing newlines from a <pre> string."""
text = re_pre_lstrip.sub('', text)
text = re_pre_rstrip.sub('', text)
return text
def escape(text):
if not text:
return ''
return text.replace('_', r'\_')
def chomp(text):
@@ -96,22 +48,15 @@ def abstract_inline_conversion(markup_fn):
"""
This abstracts all simple inline tags like b, em, del, ...
Returns a function that wraps the chomped text in a pair of the string
that is returned by markup_fn, with '/' inserted in the string used after
the text if it looks like an HTML tag. markup_fn is necessary to allow for
that is returned by markup_fn. markup_fn is necessary to allow for
references to self.strong_em_symbol etc.
"""
def implementation(self, el, text, parent_tags):
markup_prefix = markup_fn(self)
if markup_prefix.startswith('<') and markup_prefix.endswith('>'):
markup_suffix = '</' + markup_prefix[1:]
else:
markup_suffix = markup_prefix
if '_noformat' in parent_tags:
return text
def implementation(self, el, text, convert_as_inline):
markup = markup_fn(self)
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s%s%s%s%s' % (prefix, markup_prefix, text, markup_suffix, suffix)
return '%s%s%s%s%s' % (prefix, markup, text, markup, suffix)
return implementation
@@ -119,84 +64,19 @@ def _todict(obj):
return dict((k, getattr(obj, k)) for k in dir(obj) if not k.startswith('_'))
def should_remove_whitespace_inside(el):
"""Return to remove whitespace immediately inside a block-level element."""
if not el or not el.name:
return False
if re_html_heading.match(el.name) is not None:
return True
return el.name in ('p', 'blockquote',
'article', 'div', 'section',
'ol', 'ul', 'li',
'dl', 'dt', 'dd',
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th')
def should_remove_whitespace_outside(el):
"""Return to remove whitespace immediately outside a block-level element."""
return should_remove_whitespace_inside(el) or (el and el.name == 'pre')
def _is_block_content_element(el):
"""
In a block context, returns:
- True for content elements (tags and non-whitespace text)
- False for non-content elements (whitespace text, comments, doctypes)
"""
if isinstance(el, Tag):
return True
elif isinstance(el, (Comment, Doctype)):
return False # (subclasses of NavigableString, must test first)
elif isinstance(el, NavigableString):
return el.strip() != ''
else:
return False
def _prev_block_content_sibling(el):
"""Returns the first previous sibling that is a content element, else None."""
while el is not None:
el = el.previous_sibling
if _is_block_content_element(el):
return el
return None
def _next_block_content_sibling(el):
"""Returns the first next sibling that is a content element, else None."""
while el is not None:
el = el.next_sibling
if _is_block_content_element(el):
return el
return None
class MarkdownConverter(object):
class DefaultOptions:
autolinks = True
bs4_options = 'html.parser'
bullets = '*+-' # An iterable of bullet types.
code_language = ''
code_language_callback = None
convert = None
default_title = False
escape_asterisks = True
escape_underscores = True
escape_misc = False
heading_style = UNDERLINED
keep_inline_images_in = []
newline_style = SPACES
strip = None
strip_document = STRIP
strip_pre = STRIP
strong_em_symbol = ASTERISK
sub_symbol = ''
sup_symbol = ''
table_infer_header = False
wrap = False
wrap_width = 80
code_language = ''
class Options(DefaultOptions):
pass
@@ -211,202 +91,99 @@ class MarkdownConverter(object):
raise ValueError('You may specify either tags to strip or tags to'
' convert, but not both.')
# If a string or list is passed to bs4_options, assume it is a 'features' specification
if not isinstance(self.options['bs4_options'], dict):
self.options['bs4_options'] = {'features': self.options['bs4_options']}
# Initialize the conversion function cache
self.convert_fn_cache = {}
def convert(self, html):
soup = BeautifulSoup(html, **self.options['bs4_options'])
return self.convert_soup(soup)
soup = BeautifulSoup(html, 'html.parser')
return self.process_tag(soup, convert_as_inline=False, children_only=True)
def convert_soup(self, soup):
return self.process_tag(soup, parent_tags=set())
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
def process_element(self, node, parent_tags=None):
if isinstance(node, NavigableString):
return self.process_text(node, parent_tags=parent_tags)
else:
return self.process_tag(node, parent_tags=parent_tags)
# markdown headings or cells can't include
# block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
def process_tag(self, node, parent_tags=None):
# For the top-level element, initialize the parent context with an empty set.
if parent_tags is None:
parent_tags = set()
if not children_only and (isHeading or isCell):
convert_children_as_inline = True
# Collect child elements to process, ignoring whitespace-only text elements
# adjacent to the inner/outer boundaries of block elements.
should_remove_inside = should_remove_whitespace_inside(node)
# Remove whitespace-only textnodes in purely nested nodes
def is_nested_node(el):
return el and el.name in ['ol', 'ul', 'li',
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th']
def _can_ignore(el):
if isinstance(el, Tag):
# Tags are always processed.
return False
elif isinstance(el, (Comment, Doctype)):
# Comment and Doctype elements are always ignored.
# (subclasses of NavigableString, must test first)
return True
if is_nested_node(node):
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the
# conditions is true:
# - el is the first element in its parent
# - el is the last element in its parent
# - el is adjacent to an nested node
can_extract = (not el.previous_sibling
or not el.next_sibling
or is_nested_node(el.previous_sibling)
or is_nested_node(el.next_sibling))
if (isinstance(el, NavigableString)
and six.text_type(el).strip() == ''
and can_extract):
el.extract()
# Convert the children first
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
if six.text_type(el).strip() != '':
# Non-whitespace text nodes are always processed.
return False
elif should_remove_inside and (not el.previous_sibling or not el.next_sibling):
# Inside block elements (excluding <pre>), ignore adjacent whitespace elements.
return True
elif should_remove_whitespace_outside(el.previous_sibling) or should_remove_whitespace_outside(el.next_sibling):
# Outside block elements (including <pre>), ignore adjacent whitespace elements.
return True
else:
return False
elif el is None:
return True
text += self.process_text(el)
else:
raise ValueError('Unexpected element type: %s' % type(el))
text += self.process_tag(el, convert_children_as_inline)
children_to_convert = [el for el in node.children if not _can_ignore(el)]
# Create a copy of this tag's parent context, then update it to include this tag
# to propagate down into the children.
parent_tags_for_children = set(parent_tags)
parent_tags_for_children.add(node.name)
# if this tag is a heading or table cell, add an '_inline' parent pseudo-tag
if (
re_html_heading.match(node.name) is not None # headings
or node.name in {'td', 'th'} # table cells
):
parent_tags_for_children.add('_inline')
# if this tag is a preformatted element, add a '_noformat' parent pseudo-tag
if node.name in {'pre', 'code', 'kbd', 'samp'}:
parent_tags_for_children.add('_noformat')
# Convert the children elements into a list of result strings.
child_strings = [
self.process_element(el, parent_tags=parent_tags_for_children)
for el in children_to_convert
]
# Remove empty string values.
child_strings = [s for s in child_strings if s]
# Collapse newlines at child element boundaries, if needed.
if node.name == 'pre' or node.find_parent('pre'):
# Inside <pre> blocks, do not collapse newlines.
pass
else:
# Collapse newlines at child element boundaries.
updated_child_strings = [''] # so the first lookback works
for child_string in child_strings:
# Separate the leading/trailing newlines from the content.
leading_nl, content, trailing_nl = re_extract_newlines.match(child_string).groups()
# If the last child had trailing newlines and this child has leading newlines,
# use the larger newline count, limited to 2.
if updated_child_strings[-1] and leading_nl:
prev_trailing_nl = updated_child_strings.pop() # will be replaced by the collapsed value
num_newlines = min(2, max(len(prev_trailing_nl), len(leading_nl)))
leading_nl = '\n' * num_newlines
# Add the results to the updated child string list.
updated_child_strings.extend([leading_nl, content, trailing_nl])
child_strings = updated_child_strings
# Join all child text strings into a single string.
text = ''.join(child_strings)
# apply this tag's final conversion function
convert_fn = self.get_conv_fn_cached(node.name)
if convert_fn is not None:
text = convert_fn(node, text, parent_tags=parent_tags)
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)
return text
def convert__document_(self, el, text, parent_tags):
"""Final document-level formatting for BeautifulSoup object (node.name == "[document]")"""
if self.options['strip_document'] == LSTRIP:
text = text.lstrip('\n') # remove leading separation newlines
elif self.options['strip_document'] == RSTRIP:
text = text.rstrip('\n') # remove trailing separation newlines
elif self.options['strip_document'] == STRIP:
text = text.strip('\n') # remove leading and trailing separation newlines
elif self.options['strip_document'] is None:
pass # leave leading and trailing separation newlines as-is
else:
raise ValueError('Invalid value for strip_document: %s' % self.options['strip_document'])
return text
def process_text(self, el, parent_tags=None):
# For the top-level element, initialize the parent context with an empty set.
if parent_tags is None:
parent_tags = set()
def process_text(self, el):
text = six.text_type(el) or ''
# normalize whitespace if we're not inside a preformatted element
if 'pre' not in parent_tags:
if self.options['wrap']:
text = re_all_whitespace.sub(' ', text)
else:
text = re_newline_whitespace.sub('\n', text)
text = re_whitespace.sub(' ', text)
# dont remove any whitespace when handling pre or code in pre
if not (el.parent.name == 'pre'
or (el.parent.name == 'code'
and el.parent.parent.name == 'pre')):
text = whitespace_re.sub(' ', text)
# escape special characters if we're not inside a preformatted or code element
if '_noformat' not in parent_tags:
text = self.escape(text, parent_tags)
if el.parent.name != 'code':
text = escape(text)
# remove leading whitespace at the start or just after a
# block-level element; remove traliing whitespace at the end
# or just before a block-level element.
if (should_remove_whitespace_outside(el.previous_sibling)
or (should_remove_whitespace_inside(el.parent)
and not el.previous_sibling)):
text = text.lstrip(' \t\r\n')
if (should_remove_whitespace_outside(el.next_sibling)
or (should_remove_whitespace_inside(el.parent)
and not el.next_sibling)):
# remove trailing whitespaces if any of the following condition is true:
# - current text node is the last node in li
# - current text node is followed by an embedded list
if (el.parent.name == 'li'
and (not el.next_sibling
or el.next_sibling.name in ['ul', 'ol'])):
text = text.rstrip()
return text
def get_conv_fn_cached(self, tag_name):
"""Given a tag name, return the conversion function using the cache."""
# If conversion function is not in cache, add it
if tag_name not in self.convert_fn_cache:
self.convert_fn_cache[tag_name] = self.get_conv_fn(tag_name)
def __getattr__(self, attr):
# Handle headings
m = convert_heading_re.match(attr)
if m:
n = int(m.group(1))
# Return the cached entry
return self.convert_fn_cache[tag_name]
def convert_tag(el, text, convert_as_inline):
return self.convert_hn(n, el, text, convert_as_inline)
def get_conv_fn(self, tag_name):
"""Given a tag name, find and return the conversion function."""
tag_name = tag_name.lower()
convert_tag.__name__ = 'convert_h%s' % n
setattr(self, convert_tag.__name__, convert_tag)
return convert_tag
# Handle strip/convert exclusion options
if not self.should_convert_tag(tag_name):
return None
# Look for an explicitly defined conversion function by tag name first
convert_fn_name = "convert_%s" % re_make_convert_fn_name.sub("_", tag_name)
convert_fn = getattr(self, convert_fn_name, None)
if convert_fn:
return convert_fn
# If tag is any heading, handle with convert_hN() function
match = re_html_heading.match(tag_name)
if match:
n = int(match.group(1)) # get value of N from <hN>
return lambda el, text, parent_tags: self.convert_hN(n, el, text, parent_tags)
# No conversion function was found
return None
raise AttributeError(attr)
def should_convert_tag(self, tag):
"""Given a tag name, return whether to convert based on strip/convert options."""
tag = tag.lower()
strip = self.options['strip']
convert = self.options['convert']
if strip is not None:
@@ -416,28 +193,14 @@ class MarkdownConverter(object):
else:
return True
def escape(self, text, parent_tags):
if not text:
return ''
if self.options['escape_misc']:
text = re_escape_misc_chars.sub(r'\\\1', text)
text = re_escape_misc_dash_sequences.sub(r'\1\\\2', text)
text = re_escape_misc_hashes.sub(r'\1\\\2', text)
text = re_escape_misc_list_items.sub(r'\1\\\2', text)
if self.options['escape_asterisks']:
text = text.replace('*', r'\*')
if self.options['escape_underscores']:
text = text.replace('_', r'\_')
return text
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
def underline(self, text, pad_char):
text = (text or '').rstrip()
return '\n\n%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
return '%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
def convert_a(self, el, text, parent_tags):
if '_noformat' in parent_tags:
return text
def convert_a(self, el, text, convert_as_inline):
prefix, suffix, text = chomp(text)
if not text:
return ''
@@ -457,188 +220,93 @@ class MarkdownConverter(object):
convert_b = abstract_inline_conversion(lambda self: 2 * self.options['strong_em_symbol'])
def convert_blockquote(self, el, text, parent_tags):
# handle some early-exit scenarios
text = (text or '').strip(' \t\r\n')
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return "\n"
def convert_blockquote(self, el, text, convert_as_inline):
# indent lines with blockquote marker
def _indent_for_blockquote(match):
line_content = match.group(1)
return '> ' + line_content if line_content else '>'
text = re_line_with_content.sub(_indent_for_blockquote, text)
if convert_as_inline:
return text
return '\n' + text + '\n\n'
return '\n' + (line_beginning_re.sub('> ', text) + '\n\n') if text else ''
def convert_br(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' '
def convert_br(self, el, text, convert_as_inline):
if convert_as_inline:
return ""
if self.options['newline_style'].lower() == BACKSLASH:
return '\\\n'
else:
return ' \n'
def convert_code(self, el, text, parent_tags):
if '_noformat' in parent_tags:
def convert_code(self, el, text, convert_as_inline):
if el.parent.name == 'pre':
return text
prefix, suffix, text = chomp(text)
if not text:
return ''
# Find the maximum number of consecutive backticks in the text, then
# delimit the code span with one more backtick than that
max_backticks = max((len(match) for match in re.findall(re_backtick_runs, text)), default=0)
markup_delimiter = '`' * (max_backticks + 1)
# If the maximum number of backticks is greater than zero, add a space
# to avoid interpretation of inside backticks as literals
if max_backticks > 0:
text = " " + text + " "
return '%s%s%s%s%s' % (prefix, markup_delimiter, text, markup_delimiter, suffix)
converter = abstract_inline_conversion(lambda self: '`')
return converter(self, el, text, convert_as_inline)
convert_del = abstract_inline_conversion(lambda self: '~~')
def convert_div(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' ' + text.strip() + ' '
text = text.strip()
return '\n\n%s\n\n' % text if text else ''
convert_article = convert_div
convert_section = convert_div
convert_em = abstract_inline_conversion(lambda self: self.options['strong_em_symbol'])
convert_kbd = convert_code
def convert_dd(self, el, text, parent_tags):
text = (text or '').strip()
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return '\n'
# indent definition content lines by four spaces
def _indent_for_dd(match):
line_content = match.group(1)
return ' ' + line_content if line_content else ''
text = re_line_with_content.sub(_indent_for_dd, text)
# insert definition marker into first-line indent whitespace
text = ':' + text[1:]
return '%s\n' % text
# definition lists are formatted as follows:
# https://pandoc.org/MANUAL.html#definition-lists
# https://michelf.ca/projects/php-markdown/extra/#def-list
convert_dl = convert_div
def convert_dt(self, el, text, parent_tags):
# remove newlines from term text
text = (text or '').strip()
text = re_all_whitespace.sub(' ', text)
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return '\n'
# TODO - format consecutive <dt> elements as directly adjacent lines):
# https://michelf.ca/projects/php-markdown/extra/#def-list
return '\n\n%s\n' % text
def convert_hN(self, n, el, text, parent_tags):
# convert_hN() converts <hN> tags, where N is any integer
if '_inline' in parent_tags:
def convert_hn(self, n, el, text, convert_as_inline):
if convert_as_inline:
return text
# Markdown does not support heading depths of n > 6
n = max(1, min(6, n))
style = self.options['heading_style'].lower()
text = text.strip()
text = text.rstrip()
if style == UNDERLINED and n <= 2:
line = '=' if n == 1 else '-'
return self.underline(text, line)
text = re_all_whitespace.sub(' ', text)
hashes = '#' * n
if style == ATX_CLOSED:
return '\n\n%s %s %s\n\n' % (hashes, text, hashes)
return '\n\n%s %s\n\n' % (hashes, text)
return '%s %s %s\n\n' % (hashes, text, hashes)
return '%s %s\n\n' % (hashes, text)
def convert_hr(self, el, text, parent_tags):
def convert_hr(self, el, text, convert_as_inline):
return '\n\n---\n\n'
convert_i = convert_em
def convert_img(self, el, text, parent_tags):
def convert_img(self, el, text, convert_as_inline):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if ('_inline' in parent_tags
and el.parent.name not in self.options['keep_inline_images_in']):
if convert_as_inline:
return alt
return '![%s](%s%s)' % (alt, src, title_part)
def convert_video(self, el, text, parent_tags):
if ('_inline' in parent_tags
and el.parent.name not in self.options['keep_inline_images_in']):
return text
src = el.attrs.get('src', None) or ''
if not src:
sources = el.find_all('source', attrs={'src': True})
if sources:
src = sources[0].attrs.get('src', None) or ''
poster = el.attrs.get('poster', None) or ''
if src and poster:
return '[![%s](%s)](%s)' % (text, poster, src)
if src:
return '[%s](%s)' % (text, src)
if poster:
return '![%s](%s)' % (text, poster)
return text
def convert_list(self, el, text, parent_tags):
def convert_list(self, el, text, convert_as_inline):
# Converting a list to inline is undefined.
# Ignoring inline conversion parents for list.
# Ignoring convert_to_inline for list.
nested = False
before_paragraph = False
next_sibling = _next_block_content_sibling(el)
if next_sibling and next_sibling.name not in ['ul', 'ol']:
if el.next_sibling and el.next_sibling.name not in ['ul', 'ol']:
before_paragraph = True
if 'li' in parent_tags:
# remove trailing newline if we're in a nested list
return '\n' + text.rstrip()
return '\n\n' + text + ('\n' if before_paragraph else '')
while el:
if el.name == 'li':
nested = True
break
el = el.parent
if nested:
# remove trailing newline if nested
return '\n' + self.indent(text, 1).rstrip()
return text + ('\n' if before_paragraph else '')
convert_ul = convert_list
convert_ol = convert_list
def convert_li(self, el, text, parent_tags):
# handle some early-exit scenarios
text = (text or '').strip()
if not text:
return "\n"
# determine list item bullet character to use
def convert_li(self, el, text, convert_as_inline):
parent = el.parent
if parent is not None and parent.name == 'ol':
if parent.get("start") and str(parent.get("start")).isnumeric():
if parent.get("start"):
start = int(parent.get("start"))
else:
start = 1
bullet = '%s.' % (start + len(el.find_previous_siblings('li')))
bullet = '%s.' % (start + parent.index(el))
else:
depth = -1
while el:
@@ -647,71 +315,17 @@ class MarkdownConverter(object):
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
bullet = bullet + ' '
bullet_width = len(bullet)
bullet_indent = ' ' * bullet_width
return '%s %s\n' % (bullet, (text or '').strip())
# indent content lines by bullet width
def _indent_for_li(match):
line_content = match.group(1)
return bullet_indent + line_content if line_content else ''
text = re_line_with_content.sub(_indent_for_li, text)
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return text
return '%s\n\n' % text if text else ''
# insert bullet into first-line indent whitespace
text = bullet + text[bullet_width:]
return '%s\n' % text
def convert_p(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' ' + text.strip(' \t\r\n') + ' '
text = text.strip(' \t\r\n')
if self.options['wrap']:
# Preserve newlines (and preceding whitespace) resulting
# from <br> tags. Newlines in the input have already been
# replaced by spaces.
if self.options['wrap_width'] is not None:
lines = text.split('\n')
new_lines = []
for line in lines:
line = line.lstrip(' \t\r\n')
line_no_trailing = line.rstrip()
trailing = line[len(line_no_trailing):]
line = fill(line,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
new_lines.append(line + trailing)
text = '\n'.join(new_lines)
return '\n\n%s\n\n' % text if text else ''
def convert_pre(self, el, text, parent_tags):
def convert_pre(self, el, text, convert_as_inline):
if not text:
return ''
code_language = self.options['code_language']
if self.options['code_language_callback']:
code_language = self.options['code_language_callback'](el) or code_language
if self.options['strip_pre'] == STRIP:
text = strip_pre(text) # remove all leading/trailing newlines
elif self.options['strip_pre'] == STRIP_ONE:
text = strip1_pre(text) # remove one leading/trailing newline
elif self.options['strip_pre'] is None:
pass # leave leading and trailing newlines as-is
else:
raise ValueError('Invalid value for strip_pre: %s' % self.options['strip_pre'])
return '\n\n```%s\n%s\n```\n\n' % (code_language, text)
def convert_q(self, el, text, parent_tags):
return '"' + text + '"'
def convert_script(self, el, text, parent_tags):
return ''
def convert_style(self, el, text, parent_tags):
return ''
return '\n```%s\n%s\n```\n' % (self.options['code_language'], text)
convert_s = convert_del
@@ -723,70 +337,28 @@ class MarkdownConverter(object):
convert_sup = abstract_inline_conversion(lambda self: self.options['sup_symbol'])
def convert_table(self, el, text, parent_tags):
return '\n\n' + text.strip() + '\n\n'
def convert_table(self, el, text, convert_as_inline):
return '\n\n' + text + '\n'
def convert_caption(self, el, text, parent_tags):
return text.strip() + '\n\n'
def convert_td(self, el, text, convert_as_inline):
return ' ' + text + ' |'
def convert_figcaption(self, el, text, parent_tags):
return '\n\n' + text.strip() + '\n\n'
def convert_th(self, el, text, convert_as_inline):
return ' ' + text + ' |'
def convert_td(self, el, text, parent_tags):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = max(1, min(1000, int(el['colspan'])))
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_th(self, el, text, parent_tags):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = max(1, min(1000, int(el['colspan'])))
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_tr(self, el, text, parent_tags):
def convert_tr(self, el, text, convert_as_inline):
cells = el.find_all(['td', 'th'])
is_first_row = el.find_previous_sibling() is None
is_headrow = (
all([cell.name == 'th' for cell in cells])
or (el.parent.name == 'thead'
# avoid multiple tr in thead
and len(el.parent.find_all('tr')) == 1)
)
is_head_row_missing = (
(is_first_row and not el.parent.name == 'tbody')
or (is_first_row and el.parent.name == 'tbody' and len(el.parent.parent.find_all(['thead'])) < 1)
)
is_headrow = all([cell.name == 'th' for cell in cells])
overline = ''
underline = ''
full_colspan = 0
for cell in cells:
if 'colspan' in cell.attrs and cell['colspan'].isdigit():
full_colspan += max(1, min(1000, int(cell['colspan'])))
else:
full_colspan += 1
if ((is_headrow
or (is_head_row_missing
and self.options['table_infer_header']))
and is_first_row):
# first row and:
# - is headline or
# - headline is missing and header inference is enabled
# print headline underline
underline += '| ' + ' | '.join(['---'] * full_colspan) + ' |' + '\n'
elif ((is_head_row_missing
and not self.options['table_infer_header'])
or (is_first_row
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.find_previous_sibling())))):
# headline is missing and header inference is disabled or:
# first row, not headline, and:
# - the parent is table or
# - the parent is tbody at the beginning of a table.
if is_headrow and not el.previous_sibling:
# first row and is headline: print headline underline
underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
elif not el.previous_sibling and not el.parent.name != 'table':
# first row, not headline, and the parent is sth. like tbody:
# print empty headline above this row
overline += '| ' + ' | '.join([''] * full_colspan) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * full_colspan) + ' |' + '\n'
overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
return overline + '|' + text + '\n' + underline

View File

@@ -1,77 +0,0 @@
from _typeshed import Incomplete
from typing import Callable, Union
ATX: str
ATX_CLOSED: str
UNDERLINED: str
SETEXT = UNDERLINED
SPACES: str
BACKSLASH: str
ASTERISK: str
UNDERSCORE: str
LSTRIP: str
RSTRIP: str
STRIP: str
STRIP_ONE: str
def markdownify(
html: str,
autolinks: bool = ...,
bs4_options: str = ...,
bullets: str = ...,
code_language: str = ...,
code_language_callback: Union[Callable[[Incomplete], Union[str, None]], None] = ...,
convert: Union[list[str], None] = ...,
default_title: bool = ...,
escape_asterisks: bool = ...,
escape_underscores: bool = ...,
escape_misc: bool = ...,
heading_style: str = ...,
keep_inline_images_in: list[str] = ...,
newline_style: str = ...,
strip: Union[list[str], None] = ...,
strip_document: Union[str, None] = ...,
strip_pre: str = ...,
strong_em_symbol: str = ...,
sub_symbol: str = ...,
sup_symbol: str = ...,
table_infer_header: bool = ...,
wrap: bool = ...,
wrap_width: int = ...,
) -> str: ...
class MarkdownConverter:
def __init__(
self,
autolinks: bool = ...,
bs4_options: str = ...,
bullets: str = ...,
code_language: str = ...,
code_language_callback: Union[Callable[[Incomplete], Union[str, None]], None] = ...,
convert: Union[list[str], None] = ...,
default_title: bool = ...,
escape_asterisks: bool = ...,
escape_underscores: bool = ...,
escape_misc: bool = ...,
heading_style: str = ...,
keep_inline_images_in: list[str] = ...,
newline_style: str = ...,
strip: Union[list[str], None] = ...,
strip_document: Union[str, None] = ...,
strip_pre: str = ...,
strong_em_symbol: str = ...,
sub_symbol: str = ...,
sup_symbol: str = ...,
table_infer_header: bool = ...,
wrap: bool = ...,
wrap_width: int = ...,
) -> None:
...
def convert(self, html: str) -> str:
...
def convert_soup(self, soup: Incomplete) -> str:
...

View File

@@ -1,84 +0,0 @@
#!/usr/bin/env python
import argparse
import sys
from markdownify import markdownify, ATX, ATX_CLOSED, UNDERLINED, \
SPACES, BACKSLASH, ASTERISK, UNDERSCORE
def main(argv=sys.argv[1:]):
parser = argparse.ArgumentParser(
prog='markdownify',
description='Converts html to markdown.',
)
parser.add_argument('html', nargs='?', type=argparse.FileType('r'),
default=sys.stdin,
help="The html file to convert. Defaults to STDIN if not "
"provided.")
parser.add_argument('-s', '--strip', nargs='*',
help="A list of tags to strip. This option can't be used with "
"the --convert option.")
parser.add_argument('-c', '--convert', nargs='*',
help="A list of tags to convert. This option can't be used with "
"the --strip option.")
parser.add_argument('-a', '--autolinks', action='store_true',
help="A boolean indicating whether the 'automatic link' style "
"should be used when a 'a' tag's contents match its href.")
parser.add_argument('--default-title', action='store_false',
help="A boolean to enable setting the title of a link to its "
"href, if no title is given.")
parser.add_argument('--heading-style', default=UNDERLINED,
choices=(ATX, ATX_CLOSED, UNDERLINED),
help="Defines how headings should be converted.")
parser.add_argument('-b', '--bullets', default='*+-',
help="A string of bullet styles to use; the bullet will "
"alternate based on nesting level.")
parser.add_argument('--strong-em-symbol', default=ASTERISK,
choices=(ASTERISK, UNDERSCORE),
help="Use * or _ to convert strong and italics text"),
parser.add_argument('--sub-symbol', default='',
help="Define the chars that surround '<sub>'.")
parser.add_argument('--sup-symbol', default='',
help="Define the chars that surround '<sup>'.")
parser.add_argument('--newline-style', default=SPACES,
choices=(SPACES, BACKSLASH),
help="Defines the style of <br> conversions: two spaces "
"or backslash at the and of the line thet should break.")
parser.add_argument('--code-language', default='',
help="Defines the language that should be assumed for all "
"'<pre>' sections.")
parser.add_argument('--no-escape-asterisks', dest='escape_asterisks',
action='store_false',
help="Do not escape '*' to '\\*' in text.")
parser.add_argument('--no-escape-underscores', dest='escape_underscores',
action='store_false',
help="Do not escape '_' to '\\_' in text.")
parser.add_argument('-i', '--keep-inline-images-in',
default=[],
nargs='*',
help="Images are converted to their alt-text when the images are "
"located inside headlines or table cells. If some inline images "
"should be converted to markdown images instead, this option can "
"be set to a list of parent tags that should be allowed to "
"contain inline images.")
parser.add_argument('--table-infer-header', dest='table_infer_header',
action='store_true',
help="When a table has no header row (as indicated by '<thead>' "
"or '<th>'), use the first body row as the header row.")
parser.add_argument('-w', '--wrap', action='store_true',
help="Wrap all text paragraphs at --wrap-width characters.")
parser.add_argument('--wrap-width', type=int, default=80)
parser.add_argument('--bs4-options',
default='html.parser',
help="Specifies the parser that BeautifulSoup should use to parse "
"the HTML markup. Examples include 'html5.parser', 'lxml', and "
"'html5lib'.")
args = parser.parse_args(argv)
print(markdownify(**vars(args)))
if __name__ == '__main__':
main()

View File

@@ -1 +0,0 @@

View File

@@ -1,45 +0,0 @@
[build-system]
requires = ["setuptools>=61.2", "setuptools_scm[toml]>=3.4.3"]
build-backend = "setuptools.build_meta"
[project]
name = "markdownify"
version = "1.2.2"
authors = [{name = "Matthew Tretter", email = "m@tthewwithanm.com"}]
description = "Convert HTML to markdown."
readme = "README.rst"
classifiers = [
"Environment :: Web Environment",
"Framework :: Django",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 2.5",
"Programming Language :: Python :: 2.6",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Topic :: Utilities",
]
dependencies = [
"beautifulsoup4>=4.9,<5",
"six>=1.15,<2"
]
[project.urls]
Homepage = "http://github.com/matthewwithanm/python-markdownify"
Download = "http://github.com/matthewwithanm/python-markdownify/tarball/master"
[project.scripts]
markdownify = "markdownify.main:main"
[tool.setuptools]
zip-safe = false
include-package-data = true
[tool.setuptools.packages.find]
include = ["markdownify", "markdownify.*"]
namespaces = false
[tool.setuptools_scm]

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[flake8]
ignore = E501 W503

99
setup.py Normal file
View File

@@ -0,0 +1,99 @@
#/usr/bin/env python
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.10.1',
}
class PyTest(TestCommand):
def finalize_options(self):
TestCommand.finalize_options(self)
self.test_args = ['tests', '-s']
self.test_suite = True
def run_tests(self):
import pytest
errno = pytest.main(self.test_args)
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.api.legacy import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
setup(
name='markdownify',
description='Convert HTML to markdown.',
long_description=read(os.path.join(os.path.dirname(__file__), 'README.rst')),
version=pkgmeta['__version__'],
author=pkgmeta['__author__'],
author_email='m@tthewwithanm.com',
url='http://github.com/matthewwithanm/python-markdownify',
download_url='http://github.com/matthewwithanm/python-markdownify/tarball/master',
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8>=3.8,<5',
],
tests_require=[
'pytest>=6.2,<7',
],
install_requires=[
'beautifulsoup4>=4.9,<5', 'six>=1.15,<2'
],
classifiers=[
'Environment :: Web Environment',
'Framework :: Django',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python :: 2.5',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Topic :: Utilities'
],
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
)

View File

@@ -1,10 +0,0 @@
{ pkgs ? import <nixpkgs> {} }:
pkgs.mkShell {
name = "python-shell";
buildInputs = with pkgs; [
python38
python38Packages.tox
python38Packages.setuptools
python38Packages.virtualenv
];
}

View File

@@ -1,4 +1,4 @@
from .utils import md
from markdownify import markdownify as md
def test_chomp():
@@ -14,7 +14,7 @@ def test_chomp():
def test_nested():
text = md('<p>This is an <a href="http://example.com/">example link</a>.</p>')
assert text == '\n\nThis is an [example link](http://example.com/).\n\n'
assert text == 'This is an [example link](http://example.com/).\n\n'
def test_ignore_comments():

View File

@@ -2,8 +2,7 @@
Test whitelisting/blacklisting of specific tags.
"""
from markdownify import markdownify, LSTRIP, RSTRIP, STRIP, STRIP_ONE
from .utils import md
from markdownify import markdownify as md
def test_strip():
@@ -24,24 +23,3 @@ def test_convert():
def test_do_not_convert():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=[])
assert text == 'Some Text'
def test_strip_document():
assert markdownify("<p>Hello</p>") == "Hello" # test default of STRIP
assert markdownify("<p>Hello</p>", strip_document=LSTRIP) == "Hello\n\n"
assert markdownify("<p>Hello</p>", strip_document=RSTRIP) == "\n\nHello"
assert markdownify("<p>Hello</p>", strip_document=STRIP) == "Hello"
assert markdownify("<p>Hello</p>", strip_document=None) == "\n\nHello\n\n"
def test_strip_pre():
assert markdownify("<pre> \n \n Hello \n \n </pre>") == "```\n Hello\n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=STRIP) == "```\n Hello\n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=STRIP_ONE) == "```\n \n Hello \n \n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=None) == "```\n \n \n Hello \n \n \n```"
def bs4_options():
assert markdownify("<p>Hello</p>", bs4_options="html.parser") == "Hello"
assert markdownify("<p>Hello</p>", bs4_options=["html.parser"]) == "Hello"
assert markdownify("<p>Hello</p>", bs4_options={"features": "html.parser"}) == "Hello"

View File

@@ -1,4 +1,4 @@
from .utils import md
from markdownify import markdownify as md
def test_single_tag():
@@ -6,9 +6,8 @@ def test_single_tag():
def test_soup():
assert md('<div><span>Hello</div></span>') == '\n\nHello\n\n'
assert md('<div><span>Hello</div></span>') == 'Hello'
def test_whitespace():
assert md(' a b \t\t c ') == ' a b c '
assert md(' a b \n\n c ') == ' a b\nc '

View File

@@ -1,5 +1,4 @@
from markdownify import ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERSCORE
from .utils import md
from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, UNDERSCORE
def inline_tests(tag, markup):
@@ -40,11 +39,6 @@ def test_a_no_autolinks():
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
def test_a_in_code():
assert md('<code><a href="https://google.com">Google</a></code>') == '`Google`'
assert md('<pre><a href="https://google.com">Google</a></pre>') == '\n\n```\nGoogle\n```\n\n'
def test_b():
assert md('<b>Hello</b>') == '**Hello**'
@@ -58,13 +52,6 @@ def test_b_spaces():
def test_blockquote():
assert md('<blockquote>Hello</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>\nHello\n</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>&nbsp;Hello</blockquote>') == '\n> \u00a0Hello\n\n'
def test_blockquote_with_nested_paragraph():
assert md('<blockquote><p>Hello</p></blockquote>') == '\n> Hello\n\n'
assert md('<blockquote><p>Hello</p><p>Hello again</p></blockquote>') == '\n> Hello\n>\n> Hello again\n\n'
def test_blockquote_with_paragraph():
@@ -73,114 +60,54 @@ def test_blockquote_with_paragraph():
def test_blockquote_nested():
text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>')
assert text == '\n> And she was like\n> > Hello\n\n'
assert text == '\n> And she was like \n> > Hello\n> \n> \n\n'
def test_br():
assert md('a<br />b<br />c') == 'a \nb \nc'
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
assert md('<h1>foo<br />bar</h1>', heading_style=ATX) == '\n\n# foo bar\n\n'
assert md('<td>foo<br />bar</td>', heading_style=ATX) == ' foo bar |'
def test_code():
inline_tests('code', '`')
assert md('<code>*this_should_not_escape*</code>') == '`*this_should_not_escape*`'
assert md('<kbd>*this_should_not_escape*</kbd>') == '`*this_should_not_escape*`'
assert md('<samp>*this_should_not_escape*</samp>') == '`*this_should_not_escape*`'
assert md('<code><span>*this_should_not_escape*</span></code>') == '`*this_should_not_escape*`'
assert md('<code>this should\t\tnormalize</code>') == '`this should normalize`'
assert md('<code><span>this should\t\tnormalize</span></code>') == '`this should normalize`'
assert md('<code>foo<b>bar</b>baz</code>') == '`foobarbaz`'
assert md('<kbd>foo<i>bar</i>baz</kbd>') == '`foobarbaz`'
assert md('<samp>foo<del> bar </del>baz</samp>') == '`foo bar baz`'
assert md('<samp>foo <del>bar</del> baz</samp>') == '`foo bar baz`'
assert md('<code>foo<em> bar </em>baz</code>') == '`foo bar baz`'
assert md('<code>foo<code> bar </code>baz</code>') == '`foo bar baz`'
assert md('<code>foo<strong> bar </strong>baz</code>') == '`foo bar baz`'
assert md('<code>foo<s> bar </s>baz</code>') == '`foo bar baz`'
assert md('<code>foo<sup>bar</sup>baz</code>', sup_symbol='^') == '`foobarbaz`'
assert md('<code>foo<sub>bar</sub>baz</code>', sub_symbol='^') == '`foobarbaz`'
assert md('foo<code>`bar`</code>baz') == 'foo`` `bar` ``baz'
assert md('foo<code>``bar``</code>baz') == 'foo``` ``bar`` ```baz'
assert md('foo<code> `bar` </code>baz') == 'foo `` `bar` `` baz'
def test_dl():
assert md('<dl><dt>term</dt><dd>definition</dd></dl>') == '\n\nterm\n: definition\n\n'
assert md('<dl><dt><p>te</p><p>rm</p></dt><dd>definition</dd></dl>') == '\n\nte rm\n: definition\n\n'
assert md('<dl><dt>term</dt><dd><p>definition-p1</p><p>definition-p2</p></dd></dl>') == '\n\nterm\n: definition-p1\n\n definition-p2\n\n'
assert md('<dl><dt>term</dt><dd><p>definition 1</p></dd><dd><p>definition 2</p></dd></dl>') == '\n\nterm\n: definition 1\n: definition 2\n\n'
assert md('<dl><dt>term 1</dt><dd>definition 1</dd><dt>term 2</dt><dd>definition 2</dd></dl>') == '\n\nterm 1\n: definition 1\n\nterm 2\n: definition 2\n\n'
assert md('<dl><dt>term</dt><dd><blockquote><p>line 1</p><p>line 2</p></blockquote></dd></dl>') == '\n\nterm\n: > line 1\n >\n > line 2\n\n'
assert md('<dl><dt>term</dt><dd><ol><li><p>1</p><ul><li>2a</li><li>2b</li></ul></li><li><p>3</p></li></ol></dd></dl>') == '\n\nterm\n: 1. 1\n\n * 2a\n * 2b\n 2. 3\n\n'
assert md('<code>this_should_not_escape</code>') == '`this_should_not_escape`'
def test_del():
inline_tests('del', '~~')
def test_div_section_article():
for tag in ['div', 'section', 'article']:
assert md(f'<{tag}>456</{tag}>') == '\n\n456\n\n'
assert md(f'123<{tag}>456</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}>\n 456 \n</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}><p>456</p></{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}>\n<p>456</p>\n</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}><pre>4 5 6</pre></{tag}>789') == '123\n\n```\n4 5 6\n```\n\n789'
assert md(f'123<{tag}>\n<pre>4 5 6</pre>\n</{tag}>789') == '123\n\n```\n4 5 6\n```\n\n789'
assert md(f'123<{tag}>4\n5\n6</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'123<{tag}>\n4\n5\n6\n</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'123<{tag}>\n<p>\n4\n5\n6\n</p>\n</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'<{tag}><h1>title</h1>body</{{tag}}>', heading_style=ATX) == '\n\n# title\n\nbody\n\n'
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_em():
inline_tests('em', '*')
def test_figcaption():
assert (md("TEXT<figure><figcaption>\nCaption\n</figcaption><span>SPAN</span></figure>") == "TEXT\n\nCaption\n\nSPAN")
assert (md("<figure><span>SPAN</span><figcaption>\nCaption\n</figcaption></figure>TEXT") == "SPAN\n\nCaption\n\nTEXT")
def test_header_with_space():
assert md('<h3>\n\nHello</h3>') == '\n\n### Hello\n\n'
assert md('<h3>Hello\n\n\nWorld</h3>') == '\n\n### Hello World\n\n'
assert md('<h4>\n\nHello</h4>') == '\n\n#### Hello\n\n'
assert md('<h5>\n\nHello</h5>') == '\n\n##### Hello\n\n'
assert md('<h5>\n\nHello\n\n</h5>') == '\n\n##### Hello\n\n'
assert md('<h5>\n\nHello \n\n</h5>') == '\n\n##### Hello\n\n'
def test_h1():
assert md('<h1>Hello</h1>') == '\n\nHello\n=====\n\n'
assert md('<h1>Hello</h1>') == 'Hello\n=====\n\n'
def test_h2():
assert md('<h2>Hello</h2>') == '\n\nHello\n-----\n\n'
assert md('<h2>Hello</h2>') == 'Hello\n-----\n\n'
def test_hn():
assert md('<h3>Hello</h3>') == '\n\n### Hello\n\n'
assert md('<h4>Hello</h4>') == '\n\n#### Hello\n\n'
assert md('<h5>Hello</h5>') == '\n\n##### Hello\n\n'
assert md('<h6>Hello</h6>') == '\n\n###### Hello\n\n'
assert md('<h10>Hello</h10>') == md('<h6>Hello</h6>')
assert md('<h0>Hello</h0>') == md('<h1>Hello</h1>')
assert md('<hx>Hello</hx>') == md('Hello')
assert md('<h3>Hello</h3>') == '### Hello\n\n'
assert md('<h4>Hello</h4>') == '#### Hello\n\n'
assert md('<h5>Hello</h5>') == '##### Hello\n\n'
assert md('<h6>Hello</h6>') == '###### Hello\n\n'
def test_hn_chained():
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '\n\n# First\n\n## Second\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X\n\n# First\n\n'
assert md('X<h1>First</h1>', heading_style=ATX_CLOSED) == 'X\n\n# First #\n\n'
assert md('X<h1>First</h1>') == 'X\n\nFirst\n=====\n\n'
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '# First\n\n\n## Second\n\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X# First\n\n'
def test_hn_nested_tag_heading_style():
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '\n\n# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '\n\n# A P C\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '# A P C\n\n'
def test_hn_nested_simple_tag():
@@ -196,38 +123,32 @@ def test_hn_nested_simple_tag():
]
for tag, markdown in tag_to_markdown:
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '\n\n### A ' + markdown + ' B\n\n'
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '\n\n### A B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '### A B\n\n'
# Nested lists not supported
# assert md('<h3>A <ul><li>li1</i><li>l2</li></ul></h3>', heading_style=ATX) == '\n### A li1 li2 B\n\n'
# assert md('<h3>A <ul><li>li1</i><li>l2</li></ul></h3>', heading_style=ATX) == '### A li1 li2 B\n\n'
def test_hn_nested_img():
image_attributes_to_markdown = [
("", "", ""),
("alt='Alt Text'", "Alt Text", ""),
("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
("", ""),
("alt='Alt Text'", "Alt Text"),
("alt='Alt Text' title='Optional title'", "Alt Text"),
]
for image_attributes, markdown, title in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '\n\n### A' + (' ' + markdown + ' ' if markdown else ' ') + 'B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '\n\n### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
for image_attributes, markdown in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg " ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
def test_hn_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '\n\n# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '\n\n## Hello\n\n'
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_hn_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '\n\n# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '\n\n## Hello ##\n\n'
def test_hn_newlines():
assert md("<h1>H1-1</h1>TEXT<h2>H2-2</h2>TEXT<h1>H1-2</h1>TEXT", heading_style=ATX) == '\n\n# H1-1\n\nTEXT\n\n## H2-2\n\nTEXT\n\n# H1-2\n\nTEXT'
assert md('<h1>H1-1</h1>\n<p>TEXT</p>\n<h2>H2-2</h2>\n<p>TEXT</p>\n<h1>H1-2</h1>\n<p>TEXT</p>', heading_style=ATX) == '\n\n# H1-1\n\nTEXT\n\n## H2-2\n\nTEXT\n\n# H1-2\n\nTEXT\n\n'
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_head():
@@ -237,7 +158,7 @@ def test_head():
def test_hr():
assert md('Hello<hr>World') == 'Hello\n\n---\n\nWorld'
assert md('Hello<hr />World') == 'Hello\n\n---\n\nWorld'
assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == '\n\nHello\n\n---\n\nWorld\n\n'
assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == 'Hello\n\n\n\n\n---\n\n\nWorld\n\n'
def test_i():
@@ -249,76 +170,17 @@ def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
def test_video():
assert md('<video src="/path/to/video.mp4" poster="/path/to/img.jpg">text</video>') == '[![text](/path/to/img.jpg)](/path/to/video.mp4)'
assert md('<video src="/path/to/video.mp4">text</video>') == '[text](/path/to/video.mp4)'
assert md('<video><source src="/path/to/video.mp4"/>text</video>') == '[text](/path/to/video.mp4)'
assert md('<video poster="/path/to/img.jpg">text</video>') == '![text](/path/to/img.jpg)'
assert md('<video>text</video>') == 'text'
def test_kbd():
inline_tests('kbd', '`')
def test_p():
assert md('<p>hello</p>') == '\n\nhello\n\n'
assert md("<p><p>hello</p></p>") == "\n\nhello\n\n"
assert md('<p>123456789 123456789</p>') == '\n\n123456789 123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>') == '\n\n123456789\n123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>', wrap=True, wrap_width=80) == '\n\n123456789 123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>', wrap=True, wrap_width=None) == '\n\n123456789 123456789\n\n'
assert md('<p>123456789 123456789</p>', wrap=True, wrap_width=10) == '\n\n123456789\n123456789\n\n'
assert md('<p><a href="https://example.com">Some long link</a></p>', wrap=True, wrap_width=10) == '\n\n[Some long\nlink](https://example.com)\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '\n\n12345\\\n67890\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=50, newline_style=BACKSLASH) == '\n\n12345\\\n67890\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=SPACES) == '\n\n12345 \n67890\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=50, newline_style=SPACES) == '\n\n12345 \n67890\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '\n\n12345678901\\\n12345\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=50, newline_style=BACKSLASH) == '\n\n12345678901\\\n12345\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=10, newline_style=SPACES) == '\n\n12345678901 \n12345\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=50, newline_style=SPACES) == '\n\n12345678901 \n12345\n\n'
assert md('<p>1234 5678 9012<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '\n\n1234 5678\n9012\\\n67890\n\n'
assert md('<p>1234 5678 9012<br />67890</p>', wrap=True, wrap_width=10, newline_style=SPACES) == '\n\n1234 5678\n9012 \n67890\n\n'
assert md('First<p>Second</p><p>Third</p>Fourth') == 'First\n\nSecond\n\nThird\n\nFourth'
assert md('<p>&nbsp;x y</p>', wrap=True, wrap_width=80) == '\n\n\u00a0x y\n\n'
assert md('<p>hello</p>') == 'hello\n\n'
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n\n```\ntest\n foo\nbar\n```\n\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n\n```\ntest\n foo\nbar\n```\n\n'
assert md('<pre>*this_should_not_escape*</pre>') == '\n\n```\n*this_should_not_escape*\n```\n\n'
assert md('<pre><span>*this_should_not_escape*</span></pre>') == '\n\n```\n*this_should_not_escape*\n```\n\n'
assert md('<pre>\t\tthis should\t\tnot normalize</pre>') == '\n\n```\n\t\tthis should\t\tnot normalize\n```\n\n'
assert md('<pre><span>\t\tthis should\t\tnot normalize</span></pre>') == '\n\n```\n\t\tthis should\t\tnot normalize\n```\n\n'
assert md('<pre>foo<b>\nbar\n</b>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<i>\nbar\n</i>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo\n<i>bar</i>\nbaz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<i>\n</i>baz</pre>') == '\n\n```\nfoo\nbaz\n```\n\n'
assert md('<pre>foo<del>\nbar\n</del>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<em>\nbar\n</em>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<code>\nbar\n</code>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<strong>\nbar\n</strong>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<s>\nbar\n</s>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sup>\nbar\n</sup>baz</pre>', sup_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('foo<pre>bar</pre>baz', sub_symbol='^') == 'foo\n\n```\nbar\n```\n\nbaz'
assert md("<p>foo</p>\n<pre>bar</pre>\n</p>baz</p>", sub_symbol="^") == "\n\nfoo\n\n```\nbar\n```\n\nbaz"
def test_q():
assert md('foo <q>quote</q> bar') == 'foo "quote" bar'
assert md('foo <q cite="https://example.com">quote</q> bar') == 'foo "quote" bar'
def test_script():
assert md('foo <script>var foo=42;</script> bar') == 'foo bar'
def test_style():
assert md('foo <style>h1 { font-size: larger }</style> bar') == 'foo bar'
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
def test_s():
@@ -343,34 +205,13 @@ def test_strong_em_symbol():
def test_sub():
assert md('<sub>foo</sub>') == 'foo'
assert md('<sub>foo</sub>', sub_symbol='~') == '~foo~'
assert md('<sub>foo</sub>', sub_symbol='<sub>') == '<sub>foo</sub>'
def test_sup():
assert md('<sup>foo</sup>') == 'foo'
assert md('<sup>foo</sup>', sup_symbol='^') == '^foo^'
assert md('<sup>foo</sup>', sup_symbol='<sup>') == '<sup>foo</sup>'
def test_lang():
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n\n```python\ntest\n foo\nbar\n```\n\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
def test_lang_callback():
def callback(el):
return el['class'][0] if el.has_attr('class') else None
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n\n```python\ntest\n foo\nbar\n```\n\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
def test_spaces():
assert md('<p> a b </p> <p> c d </p>') == '\n\na b\n\nc d\n\n'
assert md('<p> <i>a</i> </p>') == '\n\n*a*\n\n'
assert md('test <p> again </p>') == 'test\n\nagain\n\n'
assert md('test <blockquote> text </blockquote> after') == 'test\n> text\n\nafter'
assert md(' <ol> <li> x </li> <li> y </li> </ol> ') == '\n\n1. x\n2. y\n'
assert md(' <ul> <li> x </li> <li> y </li> </ol> ') == '\n\n* x\n* y\n'
assert md('test <pre> foo </pre> bar') == 'test\n\n```\n foo\n```\n\nbar'
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n foo\nbar\n```\n'

View File

@@ -1,44 +1,18 @@
from markdownify import MarkdownConverter
from bs4 import BeautifulSoup
class UnitTestConverter(MarkdownConverter):
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter for unit tests
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, parent_tags):
"""Add two newlines after an image"""
return super().convert_img(el, text, parent_tags) + '\n\n'
def convert_custom_tag(self, el, text, parent_tags):
"""Ensure conversion function is found for tags with special characters in name"""
return "convert_custom_tag(): %s" % text
def convert_h1(self, el, text, parent_tags):
"""Ensure explicit heading conversion function is used"""
return "convert_h1: %s" % (text)
def convert_hN(self, n, el, text, parent_tags):
"""Ensure general heading conversion function is used"""
return "convert_hN(%d): %s" % (n, text)
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def test_custom_conversion_functions():
def test_img():
# Create shorthand method for conversion
def md(html, **options):
return UnitTestConverter(**options).convert(html)
return ImageBlockConverter(**options).convert(html)
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />text') == '![Alt text](/path/to/img.jpg "Optional title")\n\ntext'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />text') == '![Alt text](/path/to/img.jpg)\n\ntext'
assert md("<custom-tag>text</custom-tag>") == "convert_custom_tag(): text"
assert md("<h1>text</h1>") == "convert_h1: text"
assert md("<h3>text</h3>") == "convert_hN(3): text"
def test_soup():
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
assert MarkdownConverter().convert_soup(soup) == '**test**'
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'

View File

@@ -1,20 +1,12 @@
import warnings
from bs4 import MarkupResemblesLocatorWarning
from .utils import md
def test_asterisks():
assert md('*hey*dude*') == r'\*hey\*dude\*'
assert md('*hey*dude*', escape_asterisks=False) == r'*hey*dude*'
from markdownify import markdownify as md
def test_underscore():
assert md('_hey_dude_') == r'\_hey\_dude\_'
assert md('_hey_dude_', escape_underscores=False) == r'_hey_dude_'
def test_xml_entities():
assert md('&amp;', escape_misc=True) == r'\&'
assert md('&amp;') == '&'
def test_named_entities():
@@ -27,51 +19,4 @@ def test_hexadecimal_entities():
def test_single_escaping_entities():
assert md('&amp;amp;', escape_misc=True) == r'\&amp;'
def test_misc():
# ignore the bs4 warning that "1.2" or "*" looks like a filename
warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
assert md('\\*', escape_misc=True) == r'\\\*'
assert md('&lt;foo>', escape_misc=True) == r'\<foo\>'
assert md('# foo', escape_misc=True) == r'\# foo'
assert md('#5', escape_misc=True) == r'#5'
assert md('5#', escape_misc=True) == '5#'
assert md('####### foo', escape_misc=True) == r'####### foo'
assert md('> foo', escape_misc=True) == r'\> foo'
assert md('~~foo~~', escape_misc=True) == r'\~\~foo\~\~'
assert md('foo\n===\n', escape_misc=True) == 'foo\n\\=\\=\\=\n'
assert md('---\n', escape_misc=True) == '\\---\n'
assert md('- test', escape_misc=True) == r'\- test'
assert md('x - y', escape_misc=True) == r'x \- y'
assert md('test-case', escape_misc=True) == 'test-case'
assert md('x-', escape_misc=True) == 'x-'
assert md('-y', escape_misc=True) == '-y'
assert md('+ x\n+ y\n', escape_misc=True) == '\\+ x\n\\+ y\n'
assert md('`x`', escape_misc=True) == r'\`x\`'
assert md('[text](notalink)', escape_misc=True) == r'\[text\](notalink)'
assert md('<a href="link">text]</a>', escape_misc=True) == r'[text\]](link)'
assert md('<a href="link">[text]</a>', escape_misc=True) == r'[\[text\]](link)'
assert md('1. x', escape_misc=True) == r'1\. x'
# assert md('1<span>.</span> x', escape_misc=True) == r'1\. x'
assert md('<span>1.</span> x', escape_misc=True) == r'1\. x'
assert md(' 1. x', escape_misc=True) == r' 1\. x'
assert md('123456789. x', escape_misc=True) == r'123456789\. x'
assert md('1234567890. x', escape_misc=True) == r'1234567890. x'
assert md('A1. x', escape_misc=True) == r'A1. x'
assert md('1.2', escape_misc=True) == r'1.2'
assert md('not a number. x', escape_misc=True) == r'not a number. x'
assert md('1) x', escape_misc=True) == r'1\) x'
# assert md('1<span>)</span> x', escape_misc=True) == r'1\) x'
assert md('<span>1)</span> x', escape_misc=True) == r'1\) x'
assert md(' 1) x', escape_misc=True) == r' 1\) x'
assert md('123456789) x', escape_misc=True) == r'123456789\) x'
assert md('1234567890) x', escape_misc=True) == r'1234567890) x'
assert md('(1) x', escape_misc=True) == r'(1) x'
assert md('A1) x', escape_misc=True) == r'A1) x'
assert md('1)x', escape_misc=True) == r'1)x'
assert md('not a number) x', escape_misc=True) == r'not a number) x'
assert md('|not table|', escape_misc=True) == r'\|not table\|'
assert md(r'\ &lt;foo> &amp;amp; | ` `', escape_misc=False) == r'\ <foo> &amp; | ` `'
assert md('&amp;amp;') == '&amp;'

View File

@@ -1,4 +1,4 @@
from .utils import md
from markdownify import markdownify as md
nested_uls = """
@@ -41,22 +41,16 @@ nested_ols = """
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol><!--comment--><li>a</li><span/><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '\n\n3. a\n4. b\n'
assert md('foo<ol start="3"><li>a</li><li>b</li></ol>bar') == 'foo\n\n3. a\n4. b\n\nbar'
assert md('<ol start="-1"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="foo"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="1.5"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="1234"><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ol>') == '\n\n1234. first para\n\n second para\n1235. third para\n\n fourth para\n'
assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
def test_nested_ols():
assert md(nested_ols) == '\n\n1. 1\n 1. a\n 1. I\n 2. II\n 3. III\n 2. b\n 3. c\n2. 2\n3. 3\n'
assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '\n\n* a\n* b\n'
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
assert md("""<ul>
<li>
a
@@ -64,13 +58,11 @@ def test_ul():
<li> b </li>
<li> c
</li>
</ul>""") == '\n\n* a\n* b\n* c\n'
assert md('<ul><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ul>') == '\n\n* first para\n\n second para\n* third para\n\n fourth para\n'
</ul>""") == '* a\n* b\n* c\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == '\n\nfoo\n\n* a\n* b\n\nbar\n\n'
assert md('foo<ul><li>bar</li></ul>baz') == 'foo\n\n* bar\n\nbaz'
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
@@ -78,12 +70,12 @@ def test_nested_uls():
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n\n* 1\n + a\n - I\n - II\n - III\n + b\n + c\n* 2\n* 3\n'
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n\n- 1\n - a\n - I\n - II\n - III\n - b\n - c\n- 2\n- 3\n'
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
def test_li_text():
assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar </li><li>foo <b>bar</b> <i>space</i>.</ul>') == '\n\n* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'
assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar </li><li>foo <b>bar</b> <i>space</i>.</ul>') == '* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'

View File

@@ -1,4 +1,4 @@
from .utils import md
from markdownify import markdownify as md
table = """<table>
@@ -57,26 +57,6 @@ table_with_paragraphs = """<table>
</tr>
</table>"""
table_with_linebreaks = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith
Jackson</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson
Smith</td>
<td>94</td>
</tr>
</table>"""
table_with_header_column = """<table>
<tr>
@@ -119,55 +99,6 @@ table_head_body = """<table>
</tbody>
</table>"""
table_head_body_missing_head = """<table>
<thead>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_head_body_multiple_head = """<table>
<thead>
<tr>
<td>Creator</td>
<td>Editor</td>
<td>Server</td>
</tr>
<tr>
<td>Operator</td>
<td>Manager</td>
<td>Engineer</td>
</tr>
</thead>
<tbody>
<tr>
<td>Bob</td>
<td>Oliver</td>
<td>Tom</td>
</tr>
<tr>
<td>Thomas</td>
<td>Lucas</td>
<td>Ethan</td>
</tr>
</tbody>
</table>"""
table_missing_text = """<table>
<thead>
<tr>
@@ -208,114 +139,12 @@ table_missing_head = """<table>
</tr>
</table>"""
table_body = """<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_with_caption = """TEXT<table>
<caption>
Caption
</caption>
<tbody><tr><td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
</tbody>
</table>"""
table_with_colspan = """<table>
<tr>
<th colspan="2">Name</th>
<th>Age</th>
</tr>
<tr>
<td colspan="1">Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_undefined_colspan = """<table>
<tr>
<th colspan="undefined">Name</th>
<th>Age</th>
</tr>
<tr>
<td colspan="-1">Jill</td>
<td>Smith</td>
</tr>
</table>"""
table_with_colspan_missing_head = """<table>
<tr>
<td colspan="2">Name</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_linebreaks) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith Jackson | 50 |\n| Eve | Jackson Smith | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body_multiple_head) == '\n\n| | | |\n| --- | --- | --- |\n| Creator | Editor | Server |\n| Operator | Manager | Engineer |\n| Bob | Oliver | Tom |\n| Thomas | Lucas | Ethan |\n\n'
assert md(table_head_body_missing_head) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption) == 'TEXT\n\nCaption\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n\n'
assert md(table_with_colspan) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_undefined_colspan) == '\n\n| Name | Age |\n| --- | --- |\n| Jill | Smith |\n\n'
assert md(table_with_colspan_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Name | | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
def test_table_infer_header():
assert md(table, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_linebreaks, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith Jackson | 50 |\n| Eve | Jackson Smith | 94 |\n\n'
assert md(table_with_header_column, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body_multiple_head, table_infer_header=True) == '\n\n| Creator | Editor | Server |\n| --- | --- | --- |\n| Operator | Manager | Engineer |\n| Bob | Oliver | Tom |\n| Thomas | Lucas | Ethan |\n\n'
assert md(table_head_body_missing_head, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text, table_infer_header=True) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption, table_infer_header=True) == 'TEXT\n\nCaption\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n\n'
assert md(table_with_colspan, table_infer_header=True) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_undefined_colspan, table_infer_header=True) == '\n\n| Name | Age |\n| --- | --- |\n| Jill | Smith |\n\n'
assert md(table_with_colspan_missing_head, table_infer_header=True) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'

View File

@@ -1,70 +0,0 @@
from markdownify import markdownify, ASTERISK, BACKSLASH, LSTRIP, RSTRIP, SPACES, STRIP, UNDERLINED, UNDERSCORE, MarkdownConverter
from bs4 import BeautifulSoup
from typing import Union
markdownify("<p>Hello</p>") == "Hello" # test default of STRIP
markdownify("<p>Hello</p>", strip_document=LSTRIP) == "Hello\n\n"
markdownify("<p>Hello</p>", strip_document=RSTRIP) == "\n\nHello"
markdownify("<p>Hello</p>", strip_document=STRIP) == "Hello"
markdownify("<p>Hello</p>", strip_document=None) == "\n\nHello\n\n"
# default options
MarkdownConverter(
autolinks=True,
bs4_options='html.parser',
bullets='*+-',
code_language='',
code_language_callback=None,
convert=None,
default_title=False,
escape_asterisks=True,
escape_underscores=True,
escape_misc=False,
heading_style=UNDERLINED,
keep_inline_images_in=[],
newline_style=SPACES,
strip=None,
strip_document=STRIP,
strip_pre=STRIP,
strong_em_symbol=ASTERISK,
sub_symbol='',
sup_symbol='',
table_infer_header=False,
wrap=False,
wrap_width=80,
).convert("")
# custom options
MarkdownConverter(
strip_document=None,
bullets="-",
escape_asterisks=True,
escape_underscores=True,
escape_misc=True,
autolinks=True,
default_title=True,
newline_style=BACKSLASH,
sup_symbol='^',
sub_symbol='^',
keep_inline_images_in=['h3'],
wrap=True,
wrap_width=80,
strong_em_symbol=UNDERSCORE,
code_language='python',
code_language_callback=None
).convert("")
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
MarkdownConverter().convert_soup(soup) == '**test**'
def callback(el: BeautifulSoup) -> Union[str, None]:
return el['class'][0] if el.has_attr('class') else None
MarkdownConverter(code_language_callback=callback).convert("")
MarkdownConverter(code_language_callback=lambda el: None).convert("")
markdownify('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback)
markdownify('<pre class="python">test\n foo\nbar</pre>', code_language_callback=lambda el: None)

View File

@@ -1,9 +0,0 @@
from markdownify import MarkdownConverter
# for unit testing, disable document-level stripping by default so that
# separation newlines are included in testing
def md(html, **options):
options = {"strip_document": None, **options}
return MarkdownConverter(**options).convert(html)

15
tox.ini
View File

@@ -1,15 +0,0 @@
[tox]
envlist = py38
[testenv]
passenv = PYTHONPATH
deps =
pytest==8
flake8
restructuredtext_lint
Pygments
commands =
pytest
flake8 --ignore=E501,W503 markdownify tests
restructuredtext-lint README.rst