Compare commits

...

46 Commits

Author SHA1 Message Date
AlexVonB
c47709c21c Merge branch 'develop' 2025-08-09 19:41:10 +02:00
AlexVonB
fbc1353593 bump to version v1.2.0 2025-08-09 19:40:43 +02:00
Gareth Jones
85ef82e083 Add basic type stubs (#221) (#215)
* feat: add basic type stubs

* feat: add types for constants

* feat: add type for `MarkdownConverter` class

* ci: add basic job for checking types

* feat: add new constant

* ci: install types as required

* ci: install types package manually

* test: add strict coverage for types

* fix: allow `strip_document` to be `None`

* feat: expand types for MarkdownConverter

* fix: do not use `Unpack` as it requires Python 3.12

* feat: define `MarkdownConverter#convert_soup`

* feat: improve type for `code_language_callback`

* chore: add end-of-file newline

* refactor: use `Union` for now
2025-08-03 06:35:46 -04:00
Gareth Jones
f7053e46ab docs: fix typo (#234) 2025-08-03 06:24:28 -04:00
Gareth Jones
7edbc5a22b ci: update actions/checkout to v4 (#233)
* ci: update `actions/checkout` to v4
2025-07-14 21:52:04 +02:00
alheiveea
76e5edb357 limit colspan values to range [1, 1000] (#232) 2025-07-09 22:08:47 +02:00
Chris Papademetrious
48724e7002 support backticks in <code> spans (#226) (#230)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-29 14:56:21 -04:00
Chris Papademetrious
9b1412aa5b implement a strip_pre configuration option (#218) (#222)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-14 16:37:47 -04:00
Chris Papademetrious
75ab3064dd allow BeautifulSoup configuration kwargs to be specified (#224)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-06-14 09:06:22 -04:00
Chris Papademetrious
016251e915 ensure that explicitly provided heading conversion functions are used (#212) (#214)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-05-03 10:57:09 -04:00
Colin
0e1a849346 Add conversion support for <q> tags (#217) 2025-04-28 06:37:33 -04:00
Chris Papademetrious
e29de4e753 make convert_hn() public instead of internal (#213)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-04-20 06:20:01 -04:00
Vincent Kelleher
2d654a6b7e Add beautiful_soup_parser option (#206)
* add beautiful_soup_parser option
* add Beautiful Soup parser argument to command line

---------

Co-authored-by: Vincent Kelleher <vincent.kelleher-ext@francetravail.fr>
Co-authored-by: AlexVonB <AlexVonB@users.noreply.github.com>
2025-03-29 11:29:29 +01:00
chrispy
26566891a7 Merge branch 'develop' 2025-03-05 06:48:47 -05:00
chrispy
13183f9925 bump to version v1.1.0
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-03-05 06:47:28 -05:00
Stephen V. Brown
7908f1492a Generalize handling of colspan in case where colspan is in first row but header row is missing (#203) 2025-03-04 20:01:16 -05:00
Chris Papademetrious
618747c18c in inline contexts, resolve <br/> to a space instead of an empty string (#202)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-03-04 07:37:22 -05:00
Chris Papademetrious
5122c973c1 add missing newlines for definition lists (#200)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-03-02 06:42:56 -05:00
itmammoth
ac5736f0a3 Support video tag with poster attribute (#189) 2025-02-28 10:51:42 +01:00
chrispy
47856cd429 Merge branch 'develop' 2025-02-24 16:20:32 -05:00
chrispy
daa9e28287 bump to version v1.0.0
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-24 16:18:23 -05:00
Chris Papademetrious
ba5e222b45 use compiled regex for escaping patterns (#194)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-24 12:29:09 -05:00
Chris Papademetrious
6984dca7ab use a conversion function cache to improve runtime (#196)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-24 11:48:40 -05:00
Chris Papademetrious
24977fd192 rename regex pattern variables (#195)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-19 20:01:12 -05:00
Joseph Myers
c7329ac1ef Escape right square brackets (#187) 2025-02-19 10:04:29 -05:00
Joseph Myers
3311f4d896 Avoid stripping nonbreaking spaces (#188) 2025-02-19 07:40:53 -05:00
Chris Papademetrious
5655f27208 propagate parent tag context downward to improve runtime (#191) 2025-02-18 16:35:36 -05:00
Chris Papademetrious
c52ba47166 use list-based processing (inspired by AlextheYounga) (#186) 2025-02-17 05:47:19 -08:00
Chris Papademetrious
3026602686 make conversion non-destructive to soup; improve div/article/section handling (#184)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-04 18:09:24 -05:00
Chris Papademetrious
c52a50e66a when computing <ol><li> numbering, ignore non-<li> previous siblings (#183)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-04 15:39:32 -05:00
Chris Papademetrious
d0c4b85fd5 simplify computation of convert_children_as_inline variable (#182)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-02-04 15:20:42 -05:00
Chris Papademetrious
ae0597d80c remove superfluous leading/trailing whitespace (#181) 2025-01-27 11:55:32 -05:00
Chris Papademetrious
dbb5988802 add blank line before/after preformatted block (#179)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-21 11:01:11 -05:00
Chris Papademetrious
f24ec9e83c add blank line before ATX-style headings to avoid ambiguity (#178)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-21 11:00:51 -05:00
Chris Papademetrious
7fec8a2080 code simplification to remove need for children_only parameter (#174)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-19 10:23:58 -05:00
Fess-AKA-DeadMonk
1b3333073a for convert_* functions, allow for tags with special characters in their name (like "subtag-name") (#136)
support custom conversion functions for tags with `:` and `-` characters in their names by mapping them to underscores in the function name
2025-01-19 09:48:08 -05:00
SomeBottle
3bf0b527a4 Add a new configuration option to control tabler header row inference (#161)
Add option to infer first table row as table header (defaults to false)
2025-01-19 08:13:24 -05:00
Chris Papademetrious
1783995cb2 Merge pull request #173 from chrispy-snps/chrispy/support-definition-lists
support HTML definition lists (`<dl>`, `<dt>`, and `<dd>`)
2025-01-18 19:45:03 -05:00
chrispy
0fb855676d support HTML definition lists (<dl>, <dt>, and <dd>)
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-18 19:43:28 -05:00
Chris Papademetrious
f73a435315 Merge pull request #171 from chrispy-snps/chrispy/optimize-li-blockquote-empty-lines
optimize empty-line handling for li and blockquote content
2025-01-18 19:30:06 -05:00
chrispy
17c3678d0e optimize empty-line handling for li and blockquote content
Signed-off-by: chrispy <chrispy@synopsys.com>
2025-01-18 19:25:03 -05:00
Chris Papademetrious
600f77d244 allow a wrap_width value of None for unlimited line lengths (#169)
allow a wrap_width value of None to reflow text to unlimited line length
2025-01-18 19:20:22 -05:00
Chris Papademetrious
9339571ae9 Merge pull request #167 from chrispy-snps/chrispy/table-caption-blank-line
insert a blank line between table caption, table content
2025-01-18 19:09:24 -05:00
Chris Papademetrious
5bc3059abf Merge pull request #165 from chrispy-snps/chrispy/fix-a-in-code
do not construct Markdown links in code spans and code blocks
2025-01-18 19:06:51 -05:00
chrispy
1009087d41 insert a blank line between table caption, table content
Signed-off-by: chrispy <chrispy@synopsys.com>
2024-12-29 13:52:32 -05:00
chrispy
71e1471e18 do not construct Markdown links in code spans and code blocks
Signed-off-by: chrispy <chrispy@synopsys.com>
2024-12-29 12:33:46 -05:00
17 changed files with 952 additions and 266 deletions

View File

@@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
@@ -30,3 +30,22 @@ jobs:
- name: Build
run: |
python -m build -nwsx .
types:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build tox mypy types-beautifulsoup4
- name: Check types
run: |
mypy .
mypy --strict tests/types.py

View File

@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v2
with:

View File

@@ -110,7 +110,7 @@ code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, an BeautifylSoup object, and returns
The callback gets one single argument, a BeautifulSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
@@ -139,10 +139,40 @@ keep_inline_images_in
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
table_infer_header
Controls handling of tables with no header row (as indicated by ``<thead>``
or ``<th>``). When set to ``True``, the first body row is used as the header row.
Defaults to ``False``, which leaves the header row empty.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
A `wrap_width` value of `None` reflows lines to unlimited line length.
strip_document
Controls whether leading and/or trailing separation newlines are removed from
the final converted document. Supported values are ``LSTRIP`` (leading),
``RSTRIP`` (trailing), ``STRIP`` (both), and ``None`` (neither). Newlines
within the document are unaffected.
Defaults to ``STRIP``.
strip_pre
Controls whether leading/trailing blank lines are removed from ``<pre>``
tags. Supported values are ``STRIP`` (all leading/trailing blank lines),
``STRIP_ONE`` (one leading/trailing blank line), and ``None`` (neither).
Defaults to ``STRIP``.
bs4_options
Specify additional configuration options for the ``BeautifulSoup`` object
used to interpret the HTML markup. String and list values (such as ``lxml``
or ``html5lib``) are treated as ``features`` arguments to control parser
selection. Dictionary values (such as ``{"from_encoding": "iso-8859-8"}``)
are treated as full kwargs to be used for the BeautifulSoup constructor,
allowing specification of any parameter. For parameter details, see the
Beautiful Soup documentation at:
.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
@@ -167,7 +197,7 @@ If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change.
The function that handles a HTML tag named ``abc`` is called
``convert_abc(self, el, text, convert_as_inline)`` and returns a string
``convert_abc(self, el, text, parent_tags)`` and returns a string
containing the converted HTML tag.
The ``MarkdownConverter`` object will handle the conversion based on the
function names:
@@ -180,8 +210,8 @@ function names:
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def convert_img(self, el, text, parent_tags):
return super().convert_img(el, text, parent_tags) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
@@ -195,7 +225,7 @@ function names:
"""
Create a custom MarkdownConverter that ignores paragraphs
"""
def convert_p(self, el, text, convert_as_inline):
def convert_p(self, el, text, parent_tags):
return ''
# Create shorthand method for conversion

View File

@@ -1,16 +1,48 @@
from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
from bs4 import BeautifulSoup, Comment, Doctype, NavigableString, Tag
from textwrap import fill
import re
import six
convert_heading_re = re.compile(r'convert_h(\d+)')
line_beginning_re = re.compile(r'^', re.MULTILINE)
whitespace_re = re.compile(r'[\t ]+')
all_whitespace_re = re.compile(r'[\t \r\n]+')
newline_whitespace_re = re.compile(r'[\t \r\n]*[\r\n][\t \r\n]*')
html_heading_re = re.compile(r'h[1-6]')
# General-purpose regex patterns
re_convert_heading = re.compile(r'convert_h(\d+)')
re_line_with_content = re.compile(r'^(.*)', flags=re.MULTILINE)
re_whitespace = re.compile(r'[\t ]+')
re_all_whitespace = re.compile(r'[\t \r\n]+')
re_newline_whitespace = re.compile(r'[\t \r\n]*[\r\n][\t \r\n]*')
re_html_heading = re.compile(r'h(\d+)')
re_pre_lstrip1 = re.compile(r'^ *\n')
re_pre_rstrip1 = re.compile(r'\n *$')
re_pre_lstrip = re.compile(r'^[ \n]*\n')
re_pre_rstrip = re.compile(r'[ \n]*$')
# Pattern for creating convert_<tag> function names from tag names
re_make_convert_fn_name = re.compile(r'[\[\]:-]')
# Extract (leading_nl, content, trailing_nl) from a string
# (functionally equivalent to r'^(\n*)(.*?)(\n*)$', but greedy is faster than reluctant here)
re_extract_newlines = re.compile(r'^(\n*)((?:.*[^\n])?)(\n*)$', flags=re.DOTALL)
# Escape miscellaneous special Markdown characters
re_escape_misc_chars = re.compile(r'([]\\&<`[>~=+|])')
# Escape sequence of one or more consecutive '-', preceded
# and followed by whitespace or start/end of fragment, as it
# might be confused with an underline of a header, or with a
# list marker
re_escape_misc_dash_sequences = re.compile(r'(\s|^)(-+(?:\s|$))')
# Escape sequence of up to six consecutive '#', preceded
# and followed by whitespace or start/end of fragment, as
# it might be confused with an ATX heading
re_escape_misc_hashes = re.compile(r'(\s|^)(#{1,6}(?:\s|$))')
# Escape '.' or ')' preceded by up to nine digits, as it might be
# confused with a list item
re_escape_misc_list_items = re.compile(r'((?:\s|^)[0-9]{1,9})([.)](?:\s|$))')
# Find consecutive backtick sequences in a string
re_backtick_runs = re.compile(r'`+')
# Heading styles
ATX = 'atx'
@@ -26,6 +58,26 @@ BACKSLASH = 'backslash'
ASTERISK = '*'
UNDERSCORE = '_'
# Document/pre strip styles
LSTRIP = 'lstrip'
RSTRIP = 'rstrip'
STRIP = 'strip'
STRIP_ONE = 'strip_one'
def strip1_pre(text):
"""Strip one leading and trailing newline from a <pre> string."""
text = re_pre_lstrip1.sub('', text)
text = re_pre_rstrip1.sub('', text)
return text
def strip_pre(text):
"""Strip all leading and trailing newlines from a <pre> string."""
text = re_pre_lstrip.sub('', text)
text = re_pre_rstrip.sub('', text)
return text
def chomp(text):
"""
@@ -48,13 +100,13 @@ def abstract_inline_conversion(markup_fn):
the text if it looks like an HTML tag. markup_fn is necessary to allow for
references to self.strong_em_symbol etc.
"""
def implementation(self, el, text, convert_as_inline):
def implementation(self, el, text, parent_tags):
markup_prefix = markup_fn(self)
if markup_prefix.startswith('<') and markup_prefix.endswith('>'):
markup_suffix = '</' + markup_prefix[1:]
else:
markup_suffix = markup_prefix
if el.find_parent(['pre', 'code', 'kbd', 'samp']):
if '_noformat' in parent_tags:
return text
prefix, suffix, text = chomp(text)
if not text:
@@ -71,10 +123,12 @@ def should_remove_whitespace_inside(el):
"""Return to remove whitespace immediately inside a block-level element."""
if not el or not el.name:
return False
if html_heading_re.match(el.name) is not None:
if re_html_heading.match(el.name) is not None:
return True
return el.name in ('p', 'blockquote',
'article', 'div', 'section',
'ol', 'ul', 'li',
'dl', 'dt', 'dd',
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th')
@@ -84,9 +138,45 @@ def should_remove_whitespace_outside(el):
return should_remove_whitespace_inside(el) or (el and el.name == 'pre')
def _is_block_content_element(el):
"""
In a block context, returns:
- True for content elements (tags and non-whitespace text)
- False for non-content elements (whitespace text, comments, doctypes)
"""
if isinstance(el, Tag):
return True
elif isinstance(el, (Comment, Doctype)):
return False # (subclasses of NavigableString, must test first)
elif isinstance(el, NavigableString):
return el.strip() != ''
else:
return False
def _prev_block_content_sibling(el):
"""Returns the first previous sibling that is a content element, else None."""
while el is not None:
el = el.previous_sibling
if _is_block_content_element(el):
return el
return None
def _next_block_content_sibling(el):
"""Returns the first next sibling that is a content element, else None."""
while el is not None:
el = el.next_sibling
if _is_block_content_element(el):
return el
return None
class MarkdownConverter(object):
class DefaultOptions:
autolinks = True
bs4_options = 'html.parser'
bullets = '*+-' # An iterable of bullet types.
code_language = ''
code_language_callback = None
@@ -99,9 +189,12 @@ class MarkdownConverter(object):
keep_inline_images_in = []
newline_style = SPACES
strip = None
strip_document = STRIP
strip_pre = STRIP
strong_em_symbol = ASTERISK
sub_symbol = ''
sup_symbol = ''
table_infer_header = False
wrap = False
wrap_width = 80
@@ -118,79 +211,153 @@ class MarkdownConverter(object):
raise ValueError('You may specify either tags to strip or tags to'
' convert, but not both.')
# If a string or list is passed to bs4_options, assume it is a 'features' specification
if not isinstance(self.options['bs4_options'], dict):
self.options['bs4_options'] = {'features': self.options['bs4_options']}
# Initialize the conversion function cache
self.convert_fn_cache = {}
def convert(self, html):
soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, **self.options['bs4_options'])
return self.convert_soup(soup)
def convert_soup(self, soup):
return self.process_tag(soup, convert_as_inline=False, children_only=True)
return self.process_tag(soup, parent_tags=set())
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
def process_element(self, node, parent_tags=None):
if isinstance(node, NavigableString):
return self.process_text(node, parent_tags=parent_tags)
else:
return self.process_tag(node, parent_tags=parent_tags)
# markdown headings or cells can't include
# block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
def process_tag(self, node, parent_tags=None):
# For the top-level element, initialize the parent context with an empty set.
if parent_tags is None:
parent_tags = set()
if not children_only and (isHeading or isCell):
convert_children_as_inline = True
# Remove whitespace-only textnodes just before, after or
# inside block-level elements.
# Collect child elements to process, ignoring whitespace-only text elements
# adjacent to the inner/outer boundaries of block elements.
should_remove_inside = should_remove_whitespace_inside(node)
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the
# conditions is true:
# - el is the first element in its parent (block-level)
# - el is the last element in its parent (block-level)
# - el is adjacent to a block-level node
can_extract = (should_remove_inside and (not el.previous_sibling
or not el.next_sibling)
or should_remove_whitespace_outside(el.previous_sibling)
or should_remove_whitespace_outside(el.next_sibling))
if (isinstance(el, NavigableString)
and six.text_type(el).strip() == ''
and can_extract):
el.extract()
# Convert the children first
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
def _can_ignore(el):
if isinstance(el, Tag):
# Tags are always processed.
return False
elif isinstance(el, (Comment, Doctype)):
# Comment and Doctype elements are always ignored.
# (subclasses of NavigableString, must test first)
return True
elif isinstance(el, NavigableString):
text += self.process_text(el)
if six.text_type(el).strip() != '':
# Non-whitespace text nodes are always processed.
return False
elif should_remove_inside and (not el.previous_sibling or not el.next_sibling):
# Inside block elements (excluding <pre>), ignore adjacent whitespace elements.
return True
elif should_remove_whitespace_outside(el.previous_sibling) or should_remove_whitespace_outside(el.next_sibling):
# Outside block elements (including <pre>), ignore adjacent whitespace elements.
return True
else:
return False
elif el is None:
return True
else:
text_strip = text.rstrip('\n')
newlines_left = len(text) - len(text_strip)
next_text = self.process_tag(el, convert_children_as_inline)
next_text_strip = next_text.lstrip('\n')
newlines_right = len(next_text) - len(next_text_strip)
newlines = '\n' * max(newlines_left, newlines_right)
text = text_strip + newlines + next_text_strip
raise ValueError('Unexpected element type: %s' % type(el))
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)
children_to_convert = [el for el in node.children if not _can_ignore(el)]
# Create a copy of this tag's parent context, then update it to include this tag
# to propagate down into the children.
parent_tags_for_children = set(parent_tags)
parent_tags_for_children.add(node.name)
# if this tag is a heading or table cell, add an '_inline' parent pseudo-tag
if (
re_html_heading.match(node.name) is not None # headings
or node.name in {'td', 'th'} # table cells
):
parent_tags_for_children.add('_inline')
# if this tag is a preformatted element, add a '_noformat' parent pseudo-tag
if node.name in {'pre', 'code', 'kbd', 'samp'}:
parent_tags_for_children.add('_noformat')
# Convert the children elements into a list of result strings.
child_strings = [
self.process_element(el, parent_tags=parent_tags_for_children)
for el in children_to_convert
]
# Remove empty string values.
child_strings = [s for s in child_strings if s]
# Collapse newlines at child element boundaries, if needed.
if node.name == 'pre' or node.find_parent('pre'):
# Inside <pre> blocks, do not collapse newlines.
pass
else:
# Collapse newlines at child element boundaries.
updated_child_strings = [''] # so the first lookback works
for child_string in child_strings:
# Separate the leading/trailing newlines from the content.
leading_nl, content, trailing_nl = re_extract_newlines.match(child_string).groups()
# If the last child had trailing newlines and this child has leading newlines,
# use the larger newline count, limited to 2.
if updated_child_strings[-1] and leading_nl:
prev_trailing_nl = updated_child_strings.pop() # will be replaced by the collapsed value
num_newlines = min(2, max(len(prev_trailing_nl), len(leading_nl)))
leading_nl = '\n' * num_newlines
# Add the results to the updated child string list.
updated_child_strings.extend([leading_nl, content, trailing_nl])
child_strings = updated_child_strings
# Join all child text strings into a single string.
text = ''.join(child_strings)
# apply this tag's final conversion function
convert_fn = self.get_conv_fn_cached(node.name)
if convert_fn is not None:
text = convert_fn(node, text, parent_tags=parent_tags)
return text
def process_text(self, el):
def convert__document_(self, el, text, parent_tags):
"""Final document-level formatting for BeautifulSoup object (node.name == "[document]")"""
if self.options['strip_document'] == LSTRIP:
text = text.lstrip('\n') # remove leading separation newlines
elif self.options['strip_document'] == RSTRIP:
text = text.rstrip('\n') # remove trailing separation newlines
elif self.options['strip_document'] == STRIP:
text = text.strip('\n') # remove leading and trailing separation newlines
elif self.options['strip_document'] is None:
pass # leave leading and trailing separation newlines as-is
else:
raise ValueError('Invalid value for strip_document: %s' % self.options['strip_document'])
return text
def process_text(self, el, parent_tags=None):
# For the top-level element, initialize the parent context with an empty set.
if parent_tags is None:
parent_tags = set()
text = six.text_type(el) or ''
# normalize whitespace if we're not inside a preformatted element
if not el.find_parent('pre'):
if 'pre' not in parent_tags:
if self.options['wrap']:
text = all_whitespace_re.sub(' ', text)
text = re_all_whitespace.sub(' ', text)
else:
text = newline_whitespace_re.sub('\n', text)
text = whitespace_re.sub(' ', text)
text = re_newline_whitespace.sub('\n', text)
text = re_whitespace.sub(' ', text)
# escape special characters if we're not inside a preformatted or code element
if not el.find_parent(['pre', 'code', 'kbd', 'samp']):
text = self.escape(text)
if '_noformat' not in parent_tags:
text = self.escape(text, parent_tags)
# remove leading whitespace at the start or just after a
# block-level element; remove traliing whitespace at the end
@@ -198,7 +365,7 @@ class MarkdownConverter(object):
if (should_remove_whitespace_outside(el.previous_sibling)
or (should_remove_whitespace_inside(el.parent)
and not el.previous_sibling)):
text = text.lstrip()
text = text.lstrip(' \t\r\n')
if (should_remove_whitespace_outside(el.next_sibling)
or (should_remove_whitespace_inside(el.parent)
and not el.next_sibling)):
@@ -206,23 +373,40 @@ class MarkdownConverter(object):
return text
def __getattr__(self, attr):
# Handle headings
m = convert_heading_re.match(attr)
if m:
n = int(m.group(1))
def get_conv_fn_cached(self, tag_name):
"""Given a tag name, return the conversion function using the cache."""
# If conversion function is not in cache, add it
if tag_name not in self.convert_fn_cache:
self.convert_fn_cache[tag_name] = self.get_conv_fn(tag_name)
def convert_tag(el, text, convert_as_inline):
return self._convert_hn(n, el, text, convert_as_inline)
# Return the cached entry
return self.convert_fn_cache[tag_name]
convert_tag.__name__ = 'convert_h%s' % n
setattr(self, convert_tag.__name__, convert_tag)
return convert_tag
def get_conv_fn(self, tag_name):
"""Given a tag name, find and return the conversion function."""
tag_name = tag_name.lower()
raise AttributeError(attr)
# Handle strip/convert exclusion options
if not self.should_convert_tag(tag_name):
return None
# Look for an explicitly defined conversion function by tag name first
convert_fn_name = "convert_%s" % re_make_convert_fn_name.sub("_", tag_name)
convert_fn = getattr(self, convert_fn_name, None)
if convert_fn:
return convert_fn
# If tag is any heading, handle with convert_hN() function
match = re_html_heading.match(tag_name)
if match:
n = int(match.group(1)) # get value of N from <hN>
return lambda el, text, parent_tags: self.convert_hN(n, el, text, parent_tags)
# No conversion function was found
return None
def should_convert_tag(self, tag):
tag = tag.lower()
"""Given a tag name, return whether to convert based on strip/convert options."""
strip = self.options['strip']
convert = self.options['convert']
if strip is not None:
@@ -232,38 +416,28 @@ class MarkdownConverter(object):
else:
return True
def escape(self, text):
def escape(self, text, parent_tags):
if not text:
return ''
if self.options['escape_misc']:
text = re.sub(r'([\\&<`[>~=+|])', r'\\\1', text)
# A sequence of one or more consecutive '-', preceded and
# followed by whitespace or start/end of fragment, might
# be confused with an underline of a header, or with a
# list marker.
text = re.sub(r'(\s|^)(-+(?:\s|$))', r'\1\\\2', text)
# A sequence of up to six consecutive '#', preceded and
# followed by whitespace or start/end of fragment, might
# be confused with an ATX heading.
text = re.sub(r'(\s|^)(#{1,6}(?:\s|$))', r'\1\\\2', text)
# '.' or ')' preceded by up to nine digits might be
# confused with a list item.
text = re.sub(r'((?:\s|^)[0-9]{1,9})([.)](?:\s|$))', r'\1\\\2',
text)
text = re_escape_misc_chars.sub(r'\\\1', text)
text = re_escape_misc_dash_sequences.sub(r'\1\\\2', text)
text = re_escape_misc_hashes.sub(r'\1\\\2', text)
text = re_escape_misc_list_items.sub(r'\1\\\2', text)
if self.options['escape_asterisks']:
text = text.replace('*', r'\*')
if self.options['escape_underscores']:
text = text.replace('_', r'\_')
return text
def indent(self, text, columns):
return line_beginning_re.sub(' ' * columns, text) if text else ''
def underline(self, text, pad_char):
text = (text or '').rstrip()
return '\n\n%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
def convert_a(self, el, text, convert_as_inline):
def convert_a(self, el, text, parent_tags):
if '_noformat' in parent_tags:
return text
prefix, suffix, text = chomp(text)
if not text:
return ''
@@ -283,40 +457,110 @@ class MarkdownConverter(object):
convert_b = abstract_inline_conversion(lambda self: 2 * self.options['strong_em_symbol'])
def convert_blockquote(self, el, text, convert_as_inline):
def convert_blockquote(self, el, text, parent_tags):
# handle some early-exit scenarios
text = (text or '').strip(' \t\r\n')
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return "\n"
if convert_as_inline:
return ' ' + text.strip() + ' '
# indent lines with blockquote marker
def _indent_for_blockquote(match):
line_content = match.group(1)
return '> ' + line_content if line_content else '>'
text = re_line_with_content.sub(_indent_for_blockquote, text)
return '\n' + (line_beginning_re.sub('> ', text.strip()) + '\n\n') if text else ''
return '\n' + text + '\n\n'
def convert_br(self, el, text, convert_as_inline):
if convert_as_inline:
return ""
def convert_br(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' '
if self.options['newline_style'].lower() == BACKSLASH:
return '\\\n'
else:
return ' \n'
def convert_code(self, el, text, convert_as_inline):
if el.parent.name == 'pre':
def convert_code(self, el, text, parent_tags):
if '_noformat' in parent_tags:
return text
converter = abstract_inline_conversion(lambda self: '`')
return converter(self, el, text, convert_as_inline)
prefix, suffix, text = chomp(text)
if not text:
return ''
# Find the maximum number of consecutive backticks in the text, then
# delimit the code span with one more backtick than that
max_backticks = max((len(match) for match in re.findall(re_backtick_runs, text)), default=0)
markup_delimiter = '`' * (max_backticks + 1)
# If the maximum number of backticks is greater than zero, add a space
# to avoid interpretation of inside backticks as literals
if max_backticks > 0:
text = " " + text + " "
return '%s%s%s%s%s' % (prefix, markup_delimiter, text, markup_delimiter, suffix)
convert_del = abstract_inline_conversion(lambda self: '~~')
def convert_div(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' ' + text.strip() + ' '
text = text.strip()
return '\n\n%s\n\n' % text if text else ''
convert_article = convert_div
convert_section = convert_div
convert_em = abstract_inline_conversion(lambda self: self.options['strong_em_symbol'])
convert_kbd = convert_code
def _convert_hn(self, n, el, text, convert_as_inline):
""" Method name prefixed with _ to prevent <hn> to call this """
if convert_as_inline:
def convert_dd(self, el, text, parent_tags):
text = (text or '').strip()
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return '\n'
# indent definition content lines by four spaces
def _indent_for_dd(match):
line_content = match.group(1)
return ' ' + line_content if line_content else ''
text = re_line_with_content.sub(_indent_for_dd, text)
# insert definition marker into first-line indent whitespace
text = ':' + text[1:]
return '%s\n' % text
# definition lists are formatted as follows:
# https://pandoc.org/MANUAL.html#definition-lists
# https://michelf.ca/projects/php-markdown/extra/#def-list
convert_dl = convert_div
def convert_dt(self, el, text, parent_tags):
# remove newlines from term text
text = (text or '').strip()
text = re_all_whitespace.sub(' ', text)
if '_inline' in parent_tags:
return ' ' + text + ' '
if not text:
return '\n'
# TODO - format consecutive <dt> elements as directly adjacent lines):
# https://michelf.ca/projects/php-markdown/extra/#def-list
return '\n\n%s\n' % text
def convert_hN(self, n, el, text, parent_tags):
# convert_hN() converts <hN> tags, where N is any integer
if '_inline' in parent_tags:
return text
# prevent MemoryErrors in case of very large n
# Markdown does not support heading depths of n > 6
n = max(1, min(6, n))
style = self.options['heading_style'].lower()
@@ -324,58 +568,77 @@ class MarkdownConverter(object):
if style == UNDERLINED and n <= 2:
line = '=' if n == 1 else '-'
return self.underline(text, line)
text = all_whitespace_re.sub(' ', text)
text = re_all_whitespace.sub(' ', text)
hashes = '#' * n
if style == ATX_CLOSED:
return '\n%s %s %s\n\n' % (hashes, text, hashes)
return '\n%s %s\n\n' % (hashes, text)
return '\n\n%s %s %s\n\n' % (hashes, text, hashes)
return '\n\n%s %s\n\n' % (hashes, text)
def convert_hr(self, el, text, convert_as_inline):
def convert_hr(self, el, text, parent_tags):
return '\n\n---\n\n'
convert_i = convert_em
def convert_img(self, el, text, convert_as_inline):
def convert_img(self, el, text, parent_tags):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if (convert_as_inline
if ('_inline' in parent_tags
and el.parent.name not in self.options['keep_inline_images_in']):
return alt
return '![%s](%s%s)' % (alt, src, title_part)
def convert_list(self, el, text, convert_as_inline):
def convert_video(self, el, text, parent_tags):
if ('_inline' in parent_tags
and el.parent.name not in self.options['keep_inline_images_in']):
return text
src = el.attrs.get('src', None) or ''
if not src:
sources = el.find_all('source', attrs={'src': True})
if sources:
src = sources[0].attrs.get('src', None) or ''
poster = el.attrs.get('poster', None) or ''
if src and poster:
return '[![%s](%s)](%s)' % (text, poster, src)
if src:
return '[%s](%s)' % (text, src)
if poster:
return '![%s](%s)' % (text, poster)
return text
def convert_list(self, el, text, parent_tags):
# Converting a list to inline is undefined.
# Ignoring convert_to_inline for list.
# Ignoring inline conversion parents for list.
nested = False
before_paragraph = False
if el.next_sibling and el.next_sibling.name not in ['ul', 'ol']:
next_sibling = _next_block_content_sibling(el)
if next_sibling and next_sibling.name not in ['ul', 'ol']:
before_paragraph = True
while el:
if el.name == 'li':
nested = True
break
el = el.parent
if nested:
# remove trailing newline if nested
if 'li' in parent_tags:
# remove trailing newline if we're in a nested list
return '\n' + text.rstrip()
return '\n\n' + text + ('\n' if before_paragraph else '')
convert_ul = convert_list
convert_ol = convert_list
def convert_li(self, el, text, convert_as_inline):
def convert_li(self, el, text, parent_tags):
# handle some early-exit scenarios
text = (text or '').strip()
if not text:
return "\n"
# determine list item bullet character to use
parent = el.parent
if parent is not None and parent.name == 'ol':
if parent.get("start") and str(parent.get("start")).isnumeric():
start = int(parent.get("start"))
else:
start = 1
bullet = '%s.' % (start + parent.index(el))
bullet = '%s.' % (start + len(el.find_previous_siblings('li')))
else:
depth = -1
while el:
@@ -385,34 +648,44 @@ class MarkdownConverter(object):
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
bullet = bullet + ' '
text = (text or '').strip()
text = self.indent(text, len(bullet))
if text:
text = bullet + text[len(bullet):]
bullet_width = len(bullet)
bullet_indent = ' ' * bullet_width
# indent content lines by bullet width
def _indent_for_li(match):
line_content = match.group(1)
return bullet_indent + line_content if line_content else ''
text = re_line_with_content.sub(_indent_for_li, text)
# insert bullet into first-line indent whitespace
text = bullet + text[bullet_width:]
return '%s\n' % text
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return ' ' + text.strip() + ' '
def convert_p(self, el, text, parent_tags):
if '_inline' in parent_tags:
return ' ' + text.strip(' \t\r\n') + ' '
text = text.strip(' \t\r\n')
if self.options['wrap']:
# Preserve newlines (and preceding whitespace) resulting
# from <br> tags. Newlines in the input have already been
# replaced by spaces.
lines = text.split('\n')
new_lines = []
for line in lines:
line = line.lstrip()
line_no_trailing = line.rstrip()
trailing = line[len(line_no_trailing):]
line = fill(line,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
new_lines.append(line + trailing)
text = '\n'.join(new_lines)
if self.options['wrap_width'] is not None:
lines = text.split('\n')
new_lines = []
for line in lines:
line = line.lstrip(' \t\r\n')
line_no_trailing = line.rstrip()
trailing = line[len(line_no_trailing):]
line = fill(line,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
new_lines.append(line + trailing)
text = '\n'.join(new_lines)
return '\n\n%s\n\n' % text if text else ''
def convert_pre(self, el, text, convert_as_inline):
def convert_pre(self, el, text, parent_tags):
if not text:
return ''
code_language = self.options['code_language']
@@ -420,12 +693,24 @@ class MarkdownConverter(object):
if self.options['code_language_callback']:
code_language = self.options['code_language_callback'](el) or code_language
return '\n```%s\n%s\n```\n' % (code_language, text)
if self.options['strip_pre'] == STRIP:
text = strip_pre(text) # remove all leading/trailing newlines
elif self.options['strip_pre'] == STRIP_ONE:
text = strip1_pre(text) # remove one leading/trailing newline
elif self.options['strip_pre'] is None:
pass # leave leading and trailing newlines as-is
else:
raise ValueError('Invalid value for strip_pre: %s' % self.options['strip_pre'])
def convert_script(self, el, text, convert_as_inline):
return '\n\n```%s\n%s\n```\n\n' % (code_language, text)
def convert_q(self, el, text, parent_tags):
return '"' + text + '"'
def convert_script(self, el, text, parent_tags):
return ''
def convert_style(self, el, text, convert_as_inline):
def convert_style(self, el, text, parent_tags):
return ''
convert_s = convert_del
@@ -438,55 +723,70 @@ class MarkdownConverter(object):
convert_sup = abstract_inline_conversion(lambda self: self.options['sup_symbol'])
def convert_table(self, el, text, convert_as_inline):
return '\n\n' + text + '\n'
def convert_table(self, el, text, parent_tags):
return '\n\n' + text.strip() + '\n\n'
def convert_caption(self, el, text, convert_as_inline):
return text + '\n'
def convert_caption(self, el, text, parent_tags):
return text.strip() + '\n\n'
def convert_figcaption(self, el, text, convert_as_inline):
return '\n\n' + text + '\n\n'
def convert_figcaption(self, el, text, parent_tags):
return '\n\n' + text.strip() + '\n\n'
def convert_td(self, el, text, convert_as_inline):
def convert_td(self, el, text, parent_tags):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = int(el['colspan'])
colspan = max(1, min(1000, int(el['colspan'])))
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_th(self, el, text, convert_as_inline):
def convert_th(self, el, text, parent_tags):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = int(el['colspan'])
colspan = max(1, min(1000, int(el['colspan'])))
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_tr(self, el, text, convert_as_inline):
def convert_tr(self, el, text, parent_tags):
cells = el.find_all(['td', 'th'])
is_first_row = el.find_previous_sibling() is None
is_headrow = (
all([cell.name == 'th' for cell in cells])
or (not el.previous_sibling and not el.parent.name == 'tbody')
or (not el.previous_sibling and el.parent.name == 'tbody' and len(el.parent.parent.find_all(['thead'])) < 1)
or (el.parent.name == 'thead'
# avoid multiple tr in thead
and len(el.parent.find_all('tr')) == 1)
)
is_head_row_missing = (
(is_first_row and not el.parent.name == 'tbody')
or (is_first_row and el.parent.name == 'tbody' and len(el.parent.parent.find_all(['thead'])) < 1)
)
overline = ''
underline = ''
if is_headrow and not el.previous_sibling:
# first row and is headline: print headline underline
full_colspan = 0
for cell in cells:
if 'colspan' in cell.attrs and cell['colspan'].isdigit():
full_colspan += int(cell["colspan"])
else:
full_colspan += 1
full_colspan = 0
for cell in cells:
if 'colspan' in cell.attrs and cell['colspan'].isdigit():
full_colspan += max(1, min(1000, int(cell['colspan'])))
else:
full_colspan += 1
if ((is_headrow
or (is_head_row_missing
and self.options['table_infer_header']))
and is_first_row):
# first row and:
# - is headline or
# - headline is missing and header inference is enabled
# print headline underline
underline += '| ' + ' | '.join(['---'] * full_colspan) + ' |' + '\n'
elif (not el.previous_sibling
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.previous_sibling))):
elif ((is_head_row_missing
and not self.options['table_infer_header'])
or (is_first_row
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.find_previous_sibling())))):
# headline is missing and header inference is disabled or:
# first row, not headline, and:
# - the parent is table or
# - the parent is tbody at the beginning of a table.
# - the parent is table or
# - the parent is tbody at the beginning of a table.
# print empty headline above this row
overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join([''] * full_colspan) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * full_colspan) + ' |' + '\n'
return overline + '|' + text + '\n' + underline

77
markdownify/__init__.pyi Normal file
View File

@@ -0,0 +1,77 @@
from _typeshed import Incomplete
from typing import Callable, Union
ATX: str
ATX_CLOSED: str
UNDERLINED: str
SETEXT = UNDERLINED
SPACES: str
BACKSLASH: str
ASTERISK: str
UNDERSCORE: str
LSTRIP: str
RSTRIP: str
STRIP: str
STRIP_ONE: str
def markdownify(
html: str,
autolinks: bool = ...,
bs4_options: str = ...,
bullets: str = ...,
code_language: str = ...,
code_language_callback: Union[Callable[[Incomplete], Union[str, None]], None] = ...,
convert: Union[list[str], None] = ...,
default_title: bool = ...,
escape_asterisks: bool = ...,
escape_underscores: bool = ...,
escape_misc: bool = ...,
heading_style: str = ...,
keep_inline_images_in: list[str] = ...,
newline_style: str = ...,
strip: Union[list[str], None] = ...,
strip_document: Union[str, None] = ...,
strip_pre: str = ...,
strong_em_symbol: str = ...,
sub_symbol: str = ...,
sup_symbol: str = ...,
table_infer_header: bool = ...,
wrap: bool = ...,
wrap_width: int = ...,
) -> str: ...
class MarkdownConverter:
def __init__(
self,
autolinks: bool = ...,
bs4_options: str = ...,
bullets: str = ...,
code_language: str = ...,
code_language_callback: Union[Callable[[Incomplete], Union[str, None]], None] = ...,
convert: Union[list[str], None] = ...,
default_title: bool = ...,
escape_asterisks: bool = ...,
escape_underscores: bool = ...,
escape_misc: bool = ...,
heading_style: str = ...,
keep_inline_images_in: list[str] = ...,
newline_style: str = ...,
strip: Union[list[str], None] = ...,
strip_document: Union[str, None] = ...,
strip_pre: str = ...,
strong_em_symbol: str = ...,
sub_symbol: str = ...,
sup_symbol: str = ...,
table_infer_header: bool = ...,
wrap: bool = ...,
wrap_width: int = ...,
) -> None:
...
def convert(self, html: str) -> str:
...
def convert_soup(self, soup: Incomplete) -> str:
...

13
markdownify/main.py Normal file → Executable file
View File

@@ -55,15 +55,26 @@ def main(argv=sys.argv[1:]):
parser.add_argument('--no-escape-underscores', dest='escape_underscores',
action='store_false',
help="Do not escape '_' to '\\_' in text.")
parser.add_argument('-i', '--keep-inline-images-in', nargs='*',
parser.add_argument('-i', '--keep-inline-images-in',
default=[],
nargs='*',
help="Images are converted to their alt-text when the images are "
"located inside headlines or table cells. If some inline images "
"should be converted to markdown images instead, this option can "
"be set to a list of parent tags that should be allowed to "
"contain inline images.")
parser.add_argument('--table-infer-header', dest='table_infer_header',
action='store_true',
help="When a table has no header row (as indicated by '<thead>' "
"or '<th>'), use the first body row as the header row.")
parser.add_argument('-w', '--wrap', action='store_true',
help="Wrap all text paragraphs at --wrap-width characters.")
parser.add_argument('--wrap-width', type=int, default=80)
parser.add_argument('--bs4-options',
default='html.parser',
help="Specifies the parser that BeautifulSoup should use to parse "
"the HTML markup. Examples include 'html5.parser', 'lxml', and "
"'html5lib'.")
args = parser.parse_args(argv)
print(markdownify(**vars(args)))

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "markdownify"
version = "0.14.1"
version = "1.2.0"
authors = [{name = "Matthew Tretter", email = "m@tthewwithanm.com"}]
description = "Convert HTML to markdown."
readme = "README.rst"

View File

@@ -1,4 +1,4 @@
from markdownify import markdownify as md
from .utils import md
def test_chomp():

View File

@@ -2,7 +2,8 @@
Test whitelisting/blacklisting of specific tags.
"""
from markdownify import markdownify as md
from markdownify import markdownify, LSTRIP, RSTRIP, STRIP, STRIP_ONE
from .utils import md
def test_strip():
@@ -23,3 +24,24 @@ def test_convert():
def test_do_not_convert():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=[])
assert text == 'Some Text'
def test_strip_document():
assert markdownify("<p>Hello</p>") == "Hello" # test default of STRIP
assert markdownify("<p>Hello</p>", strip_document=LSTRIP) == "Hello\n\n"
assert markdownify("<p>Hello</p>", strip_document=RSTRIP) == "\n\nHello"
assert markdownify("<p>Hello</p>", strip_document=STRIP) == "Hello"
assert markdownify("<p>Hello</p>", strip_document=None) == "\n\nHello\n\n"
def test_strip_pre():
assert markdownify("<pre> \n \n Hello \n \n </pre>") == "```\n Hello\n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=STRIP) == "```\n Hello\n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=STRIP_ONE) == "```\n \n Hello \n \n```"
assert markdownify("<pre> \n \n Hello \n \n </pre>", strip_pre=None) == "```\n \n \n Hello \n \n \n```"
def bs4_options():
assert markdownify("<p>Hello</p>", bs4_options="html.parser") == "Hello"
assert markdownify("<p>Hello</p>", bs4_options=["html.parser"]) == "Hello"
assert markdownify("<p>Hello</p>", bs4_options={"features": "html.parser"}) == "Hello"

View File

@@ -1,4 +1,4 @@
from markdownify import markdownify as md
from .utils import md
def test_single_tag():
@@ -6,7 +6,7 @@ def test_single_tag():
def test_soup():
assert md('<div><span>Hello</div></span>') == 'Hello'
assert md('<div><span>Hello</div></span>') == '\n\nHello\n\n'
def test_whitespace():

View File

@@ -1,4 +1,5 @@
from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERSCORE
from markdownify import ATX, ATX_CLOSED, BACKSLASH, SPACES, UNDERSCORE
from .utils import md
def inline_tests(tag, markup):
@@ -39,6 +40,11 @@ def test_a_no_autolinks():
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
def test_a_in_code():
assert md('<code><a href="https://google.com">Google</a></code>') == '`Google`'
assert md('<pre><a href="https://google.com">Google</a></pre>') == '\n\n```\nGoogle\n```\n\n'
def test_b():
assert md('<b>Hello</b>') == '**Hello**'
@@ -53,11 +59,12 @@ def test_b_spaces():
def test_blockquote():
assert md('<blockquote>Hello</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>\nHello\n</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>&nbsp;Hello</blockquote>') == '\n> \u00a0Hello\n\n'
def test_blockquote_with_nested_paragraph():
assert md('<blockquote><p>Hello</p></blockquote>') == '\n> Hello\n\n'
assert md('<blockquote><p>Hello</p><p>Hello again</p></blockquote>') == '\n> Hello\n> \n> Hello again\n\n'
assert md('<blockquote><p>Hello</p><p>Hello again</p></blockquote>') == '\n> Hello\n>\n> Hello again\n\n'
def test_blockquote_with_paragraph():
@@ -72,11 +79,8 @@ def test_blockquote_nested():
def test_br():
assert md('a<br />b<br />c') == 'a \nb \nc'
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
def test_caption():
assert md('TEXT<figure><figcaption>Caption</figcaption><span>SPAN</span></figure>') == 'TEXT\n\nCaption\n\nSPAN'
assert md('<figure><span>SPAN</span><figcaption>Caption</figcaption></figure>TEXT') == 'SPAN\n\nCaption\n\nTEXT'
assert md('<h1>foo<br />bar</h1>', heading_style=ATX) == '\n\n# foo bar\n\n'
assert md('<td>foo<br />bar</td>', heading_style=ATX) == ' foo bar |'
def test_code():
@@ -97,27 +101,56 @@ def test_code():
assert md('<code>foo<s> bar </s>baz</code>') == '`foo bar baz`'
assert md('<code>foo<sup>bar</sup>baz</code>', sup_symbol='^') == '`foobarbaz`'
assert md('<code>foo<sub>bar</sub>baz</code>', sub_symbol='^') == '`foobarbaz`'
assert md('foo<code>`bar`</code>baz') == 'foo`` `bar` ``baz'
assert md('foo<code>``bar``</code>baz') == 'foo``` ``bar`` ```baz'
assert md('foo<code> `bar` </code>baz') == 'foo `` `bar` `` baz'
def test_dl():
assert md('<dl><dt>term</dt><dd>definition</dd></dl>') == '\n\nterm\n: definition\n\n'
assert md('<dl><dt><p>te</p><p>rm</p></dt><dd>definition</dd></dl>') == '\n\nte rm\n: definition\n\n'
assert md('<dl><dt>term</dt><dd><p>definition-p1</p><p>definition-p2</p></dd></dl>') == '\n\nterm\n: definition-p1\n\n definition-p2\n\n'
assert md('<dl><dt>term</dt><dd><p>definition 1</p></dd><dd><p>definition 2</p></dd></dl>') == '\n\nterm\n: definition 1\n: definition 2\n\n'
assert md('<dl><dt>term 1</dt><dd>definition 1</dd><dt>term 2</dt><dd>definition 2</dd></dl>') == '\n\nterm 1\n: definition 1\n\nterm 2\n: definition 2\n\n'
assert md('<dl><dt>term</dt><dd><blockquote><p>line 1</p><p>line 2</p></blockquote></dd></dl>') == '\n\nterm\n: > line 1\n >\n > line 2\n\n'
assert md('<dl><dt>term</dt><dd><ol><li><p>1</p><ul><li>2a</li><li>2b</li></ul></li><li><p>3</p></li></ol></dd></dl>') == '\n\nterm\n: 1. 1\n\n * 2a\n * 2b\n 2. 3\n\n'
def test_del():
inline_tests('del', '~~')
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_div_section_article():
for tag in ['div', 'section', 'article']:
assert md(f'<{tag}>456</{tag}>') == '\n\n456\n\n'
assert md(f'123<{tag}>456</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}>\n 456 \n</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}><p>456</p></{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}>\n<p>456</p>\n</{tag}>789') == '123\n\n456\n\n789'
assert md(f'123<{tag}><pre>4 5 6</pre></{tag}>789') == '123\n\n```\n4 5 6\n```\n\n789'
assert md(f'123<{tag}>\n<pre>4 5 6</pre>\n</{tag}>789') == '123\n\n```\n4 5 6\n```\n\n789'
assert md(f'123<{tag}>4\n5\n6</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'123<{tag}>\n4\n5\n6\n</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'123<{tag}>\n<p>\n4\n5\n6\n</p>\n</{tag}>789') == '123\n\n4\n5\n6\n\n789'
assert md(f'<{tag}><h1>title</h1>body</{{tag}}>', heading_style=ATX) == '\n\n# title\n\nbody\n\n'
def test_em():
inline_tests('em', '*')
def test_figcaption():
assert (md("TEXT<figure><figcaption>\nCaption\n</figcaption><span>SPAN</span></figure>") == "TEXT\n\nCaption\n\nSPAN")
assert (md("<figure><span>SPAN</span><figcaption>\nCaption\n</figcaption></figure>TEXT") == "SPAN\n\nCaption\n\nTEXT")
def test_header_with_space():
assert md('<h3>\n\nHello</h3>') == '\n### Hello\n\n'
assert md('<h3>Hello\n\n\nWorld</h3>') == '\n### Hello World\n\n'
assert md('<h4>\n\nHello</h4>') == '\n#### Hello\n\n'
assert md('<h5>\n\nHello</h5>') == '\n##### Hello\n\n'
assert md('<h5>\n\nHello\n\n</h5>') == '\n##### Hello\n\n'
assert md('<h5>\n\nHello \n\n</h5>') == '\n##### Hello\n\n'
assert md('<h3>\n\nHello</h3>') == '\n\n### Hello\n\n'
assert md('<h3>Hello\n\n\nWorld</h3>') == '\n\n### Hello World\n\n'
assert md('<h4>\n\nHello</h4>') == '\n\n#### Hello\n\n'
assert md('<h5>\n\nHello</h5>') == '\n\n##### Hello\n\n'
assert md('<h5>\n\nHello\n\n</h5>') == '\n\n##### Hello\n\n'
assert md('<h5>\n\nHello \n\n</h5>') == '\n\n##### Hello\n\n'
def test_h1():
@@ -129,24 +162,25 @@ def test_h2():
def test_hn():
assert md('<h3>Hello</h3>') == '\n### Hello\n\n'
assert md('<h4>Hello</h4>') == '\n#### Hello\n\n'
assert md('<h5>Hello</h5>') == '\n##### Hello\n\n'
assert md('<h6>Hello</h6>') == '\n###### Hello\n\n'
assert md('<h3>Hello</h3>') == '\n\n### Hello\n\n'
assert md('<h4>Hello</h4>') == '\n\n#### Hello\n\n'
assert md('<h5>Hello</h5>') == '\n\n##### Hello\n\n'
assert md('<h6>Hello</h6>') == '\n\n###### Hello\n\n'
assert md('<h10>Hello</h10>') == md('<h6>Hello</h6>')
assert md('<hn>Hello</hn>') == md('Hello')
assert md('<h0>Hello</h0>') == md('<h1>Hello</h1>')
assert md('<hx>Hello</hx>') == md('Hello')
def test_hn_chained():
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '\n# First\n\n## Second\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X\n# First\n\n'
assert md('X<h1>First</h1>', heading_style=ATX_CLOSED) == 'X\n# First #\n\n'
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '\n\n# First\n\n## Second\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X\n\n# First\n\n'
assert md('X<h1>First</h1>', heading_style=ATX_CLOSED) == 'X\n\n# First #\n\n'
assert md('X<h1>First</h1>') == 'X\n\nFirst\n=====\n\n'
def test_hn_nested_tag_heading_style():
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '\n# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '\n# A P C\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '\n\n# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '\n\n# A P C\n\n'
def test_hn_nested_simple_tag():
@@ -162,9 +196,9 @@ def test_hn_nested_simple_tag():
]
for tag, markdown in tag_to_markdown:
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '\n### A ' + markdown + ' B\n\n'
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '\n\n### A ' + markdown + ' B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '\n### A B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '\n\n### A B\n\n'
# Nested lists not supported
# assert md('<h3>A <ul><li>li1</i><li>l2</li></ul></h3>', heading_style=ATX) == '\n### A li1 li2 B\n\n'
@@ -177,18 +211,23 @@ def test_hn_nested_img():
("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
]
for image_attributes, markdown, title in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '\n### A' + (' ' + markdown + ' ' if markdown else ' ') + 'B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '\n### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '\n\n### A' + (' ' + markdown + ' ' if markdown else ' ') + 'B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '\n\n### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
def test_hn_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '\n# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '\n## Hello\n\n'
assert md('<h1>Hello</h1>', heading_style=ATX) == '\n\n# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '\n\n## Hello\n\n'
def test_hn_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '\n# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '\n## Hello ##\n\n'
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '\n\n# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '\n\n## Hello ##\n\n'
def test_hn_newlines():
assert md("<h1>H1-1</h1>TEXT<h2>H2-2</h2>TEXT<h1>H1-2</h1>TEXT", heading_style=ATX) == '\n\n# H1-1\n\nTEXT\n\n## H2-2\n\nTEXT\n\n# H1-2\n\nTEXT'
assert md('<h1>H1-1</h1>\n<p>TEXT</p>\n<h2>H2-2</h2>\n<p>TEXT</p>\n<h1>H1-2</h1>\n<p>TEXT</p>', heading_style=ATX) == '\n\n# H1-1\n\nTEXT\n\n## H2-2\n\nTEXT\n\n# H1-2\n\nTEXT\n\n'
def test_head():
@@ -210,15 +249,25 @@ def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
def test_video():
assert md('<video src="/path/to/video.mp4" poster="/path/to/img.jpg">text</video>') == '[![text](/path/to/img.jpg)](/path/to/video.mp4)'
assert md('<video src="/path/to/video.mp4">text</video>') == '[text](/path/to/video.mp4)'
assert md('<video><source src="/path/to/video.mp4"/>text</video>') == '[text](/path/to/video.mp4)'
assert md('<video poster="/path/to/img.jpg">text</video>') == '![text](/path/to/img.jpg)'
assert md('<video>text</video>') == 'text'
def test_kbd():
inline_tests('kbd', '`')
def test_p():
assert md('<p>hello</p>') == '\n\nhello\n\n'
assert md("<p><p>hello</p></p>") == "\n\nhello\n\n"
assert md('<p>123456789 123456789</p>') == '\n\n123456789 123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>') == '\n\n123456789\n123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>', wrap=True, wrap_width=80) == '\n\n123456789 123456789\n\n'
assert md('<p>123456789\n\n\n123456789</p>', wrap=True, wrap_width=None) == '\n\n123456789 123456789\n\n'
assert md('<p>123456789 123456789</p>', wrap=True, wrap_width=10) == '\n\n123456789\n123456789\n\n'
assert md('<p><a href="https://example.com">Some long link</a></p>', wrap=True, wrap_width=10) == '\n\n[Some long\nlink](https://example.com)\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '\n\n12345\\\n67890\n\n'
@@ -232,26 +281,36 @@ def test_p():
assert md('<p>1234 5678 9012<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '\n\n1234 5678\n9012\\\n67890\n\n'
assert md('<p>1234 5678 9012<br />67890</p>', wrap=True, wrap_width=10, newline_style=SPACES) == '\n\n1234 5678\n9012 \n67890\n\n'
assert md('First<p>Second</p><p>Third</p>Fourth') == 'First\n\nSecond\n\nThird\n\nFourth'
assert md('<p>&nbsp;x y</p>', wrap=True, wrap_width=80) == '\n\n\u00a0x y\n\n'
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre>*this_should_not_escape*</pre>') == '\n```\n*this_should_not_escape*\n```\n'
assert md('<pre><span>*this_should_not_escape*</span></pre>') == '\n```\n*this_should_not_escape*\n```\n'
assert md('<pre>\t\tthis should\t\tnot normalize</pre>') == '\n```\n\t\tthis should\t\tnot normalize\n```\n'
assert md('<pre><span>\t\tthis should\t\tnot normalize</span></pre>') == '\n```\n\t\tthis should\t\tnot normalize\n```\n'
assert md('<pre>foo<b>\nbar\n</b>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<i>\nbar\n</i>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo\n<i>bar</i>\nbaz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<i>\n</i>baz</pre>') == '\n```\nfoo\nbaz\n```\n'
assert md('<pre>foo<del>\nbar\n</del>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<em>\nbar\n</em>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<code>\nbar\n</code>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<strong>\nbar\n</strong>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<s>\nbar\n</s>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<sup>\nbar\n</sup>baz</pre>', sup_symbol='^') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>test\n foo\nbar</pre>') == '\n\n```\ntest\n foo\nbar\n```\n\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n\n```\ntest\n foo\nbar\n```\n\n'
assert md('<pre>*this_should_not_escape*</pre>') == '\n\n```\n*this_should_not_escape*\n```\n\n'
assert md('<pre><span>*this_should_not_escape*</span></pre>') == '\n\n```\n*this_should_not_escape*\n```\n\n'
assert md('<pre>\t\tthis should\t\tnot normalize</pre>') == '\n\n```\n\t\tthis should\t\tnot normalize\n```\n\n'
assert md('<pre><span>\t\tthis should\t\tnot normalize</span></pre>') == '\n\n```\n\t\tthis should\t\tnot normalize\n```\n\n'
assert md('<pre>foo<b>\nbar\n</b>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<i>\nbar\n</i>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo\n<i>bar</i>\nbaz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<i>\n</i>baz</pre>') == '\n\n```\nfoo\nbaz\n```\n\n'
assert md('<pre>foo<del>\nbar\n</del>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<em>\nbar\n</em>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<code>\nbar\n</code>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<strong>\nbar\n</strong>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<s>\nbar\n</s>baz</pre>') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sup>\nbar\n</sup>baz</pre>', sup_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n\n```\nfoo\nbar\nbaz\n```\n\n'
assert md('foo<pre>bar</pre>baz', sub_symbol='^') == 'foo\n\n```\nbar\n```\n\nbaz'
assert md("<p>foo</p>\n<pre>bar</pre>\n</p>baz</p>", sub_symbol="^") == "\n\nfoo\n\n```\nbar\n```\n\nbaz"
def test_q():
assert md('foo <q>quote</q> bar') == 'foo "quote" bar'
assert md('foo <q cite="https://example.com">quote</q> bar') == 'foo "quote" bar'
def test_script():
@@ -294,17 +353,17 @@ def test_sup():
def test_lang():
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n\n```python\ntest\n foo\nbar\n```\n\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
def test_lang_callback():
def callback(el):
return el['class'][0] if el.has_attr('class') else None
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n\n```python\ntest\n foo\nbar\n```\n\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n\n```javascript\ntest\n foo\nbar\n```\n\n'
def test_spaces():
@@ -314,4 +373,4 @@ def test_spaces():
assert md('test <blockquote> text </blockquote> after') == 'test\n> text\n\nafter'
assert md(' <ol> <li> x </li> <li> y </li> </ol> ') == '\n\n1. x\n2. y\n'
assert md(' <ul> <li> x </li> <li> y </li> </ol> ') == '\n\n* x\n* y\n'
assert md('test <pre> foo </pre> bar') == 'test\n```\n foo \n```\nbar'
assert md('test <pre> foo </pre> bar') == 'test\n\n```\n foo\n```\n\nbar'

View File

@@ -2,21 +2,40 @@ from markdownify import MarkdownConverter
from bs4 import BeautifulSoup
class ImageBlockConverter(MarkdownConverter):
class UnitTestConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
Create a custom MarkdownConverter for unit tests
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def convert_img(self, el, text, parent_tags):
"""Add two newlines after an image"""
return super().convert_img(el, text, parent_tags) + '\n\n'
def convert_custom_tag(self, el, text, parent_tags):
"""Ensure conversion function is found for tags with special characters in name"""
return "convert_custom_tag(): %s" % text
def convert_h1(self, el, text, parent_tags):
"""Ensure explicit heading conversion function is used"""
return "convert_h1: %s" % (text)
def convert_hN(self, n, el, text, parent_tags):
"""Ensure general heading conversion function is used"""
return "convert_hN(%d): %s" % (n, text)
def test_img():
def test_custom_conversion_functions():
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
return UnitTestConverter(**options).convert(html)
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />text') == '![Alt text](/path/to/img.jpg "Optional title")\n\ntext'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />text') == '![Alt text](/path/to/img.jpg)\n\ntext'
assert md("<custom-tag>text</custom-tag>") == "convert_custom_tag(): text"
assert md("<h1>text</h1>") == "convert_h1: text"
assert md("<h3>text</h3>") == "convert_hN(3): text"
def test_soup():

View File

@@ -1,6 +1,6 @@
import warnings
from bs4 import MarkupResemblesLocatorWarning
from markdownify import markdownify as md
from .utils import md
def test_asterisks():
@@ -51,7 +51,9 @@ def test_misc():
assert md('-y', escape_misc=True) == '-y'
assert md('+ x\n+ y\n', escape_misc=True) == '\\+ x\n\\+ y\n'
assert md('`x`', escape_misc=True) == r'\`x\`'
assert md('[text](link)', escape_misc=True) == r'\[text](link)'
assert md('[text](notalink)', escape_misc=True) == r'\[text\](notalink)'
assert md('<a href="link">text]</a>', escape_misc=True) == r'[text\]](link)'
assert md('<a href="link">[text]</a>', escape_misc=True) == r'[\[text\]](link)'
assert md('1. x', escape_misc=True) == r'1\. x'
# assert md('1<span>.</span> x', escape_misc=True) == r'1\. x'
assert md('<span>1.</span> x', escape_misc=True) == r'1\. x'

View File

@@ -1,4 +1,4 @@
from markdownify import markdownify as md
from .utils import md
nested_uls = """
@@ -42,12 +42,13 @@ nested_ols = """
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol><!--comment--><li>a</li><span/><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '\n\n3. a\n4. b\n'
assert md('foo<ol start="3"><li>a</li><li>b</li></ol>bar') == 'foo\n\n3. a\n4. b\n\nbar'
assert md('<ol start="-1"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="foo"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="1.5"><li>a</li><li>b</li></ol>') == '\n\n1. a\n2. b\n'
assert md('<ol start="1234"><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ol>') == '\n\n1234. first para\n \n second para\n1235. third para\n \n fourth para\n'
assert md('<ol start="1234"><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ol>') == '\n\n1234. first para\n\n second para\n1235. third para\n\n fourth para\n'
def test_nested_ols():
@@ -64,7 +65,7 @@ def test_ul():
<li> c
</li>
</ul>""") == '\n\n* a\n* b\n* c\n'
assert md('<ul><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ul>') == '\n\n* first para\n \n second para\n* third para\n \n fourth para\n'
assert md('<ul><li><p>first para</p><p>second para</p></li><li><p>third para</p><p>fourth para</p></li></ul>') == '\n\n* first para\n\n second para\n* third para\n\n fourth para\n'
def test_inline_ul():

View File

@@ -1,4 +1,4 @@
from markdownify import markdownify as md
from .utils import md
table = """<table>
@@ -141,6 +141,33 @@ table_head_body_missing_head = """<table>
</tbody>
</table>"""
table_head_body_multiple_head = """<table>
<thead>
<tr>
<td>Creator</td>
<td>Editor</td>
<td>Server</td>
</tr>
<tr>
<td>Operator</td>
<td>Manager</td>
<td>Engineer</td>
</tr>
</thead>
<tbody>
<tr>
<td>Bob</td>
<td>Oliver</td>
<td>Tom</td>
</tr>
<tr>
<td>Thomas</td>
<td>Lucas</td>
<td>Ethan</td>
</tr>
</tbody>
</table>"""
table_missing_text = """<table>
<thead>
<tr>
@@ -201,7 +228,10 @@ table_body = """<table>
</tbody>
</table>"""
table_with_caption = """TEXT<table><caption>Caption</caption>
table_with_caption = """TEXT<table>
<caption>
Caption
</caption>
<tbody><tr><td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
@@ -237,6 +267,23 @@ table_with_undefined_colspan = """<table>
</tr>
</table>"""
table_with_colspan_missing_head = """<table>
<tr>
<td colspan="2">Name</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
@@ -245,10 +292,30 @@ def test_table():
assert md(table_with_linebreaks) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith Jackson | 50 |\n| Eve | Jackson Smith | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body_multiple_head) == '\n\n| | | |\n| --- | --- | --- |\n| Creator | Editor | Server |\n| Operator | Manager | Engineer |\n| Bob | Oliver | Tom |\n| Thomas | Lucas | Ethan |\n\n'
assert md(table_head_body_missing_head) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption) == 'TEXT\n\nCaption\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n\n'
assert md(table_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption) == 'TEXT\n\nCaption\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n\n'
assert md(table_with_colspan) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_undefined_colspan) == '\n\n| Name | Age |\n| --- | --- |\n| Jill | Smith |\n\n'
assert md(table_with_colspan_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Name | | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
def test_table_infer_header():
assert md(table, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_linebreaks, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith Jackson | 50 |\n| Eve | Jackson Smith | 94 |\n\n'
assert md(table_with_header_column, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body_multiple_head, table_infer_header=True) == '\n\n| Creator | Editor | Server |\n| --- | --- | --- |\n| Operator | Manager | Engineer |\n| Bob | Oliver | Tom |\n| Thomas | Lucas | Ethan |\n\n'
assert md(table_head_body_missing_head, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text, table_infer_header=True) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body, table_infer_header=True) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption, table_infer_header=True) == 'TEXT\n\nCaption\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n\n'
assert md(table_with_colspan, table_infer_header=True) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_undefined_colspan, table_infer_header=True) == '\n\n| Name | Age |\n| --- | --- |\n| Jill | Smith |\n\n'
assert md(table_with_colspan_missing_head, table_infer_header=True) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'

70
tests/types.py Normal file
View File

@@ -0,0 +1,70 @@
from markdownify import markdownify, ASTERISK, BACKSLASH, LSTRIP, RSTRIP, SPACES, STRIP, UNDERLINED, UNDERSCORE, MarkdownConverter
from bs4 import BeautifulSoup
from typing import Union
markdownify("<p>Hello</p>") == "Hello" # test default of STRIP
markdownify("<p>Hello</p>", strip_document=LSTRIP) == "Hello\n\n"
markdownify("<p>Hello</p>", strip_document=RSTRIP) == "\n\nHello"
markdownify("<p>Hello</p>", strip_document=STRIP) == "Hello"
markdownify("<p>Hello</p>", strip_document=None) == "\n\nHello\n\n"
# default options
MarkdownConverter(
autolinks=True,
bs4_options='html.parser',
bullets='*+-',
code_language='',
code_language_callback=None,
convert=None,
default_title=False,
escape_asterisks=True,
escape_underscores=True,
escape_misc=False,
heading_style=UNDERLINED,
keep_inline_images_in=[],
newline_style=SPACES,
strip=None,
strip_document=STRIP,
strip_pre=STRIP,
strong_em_symbol=ASTERISK,
sub_symbol='',
sup_symbol='',
table_infer_header=False,
wrap=False,
wrap_width=80,
).convert("")
# custom options
MarkdownConverter(
strip_document=None,
bullets="-",
escape_asterisks=True,
escape_underscores=True,
escape_misc=True,
autolinks=True,
default_title=True,
newline_style=BACKSLASH,
sup_symbol='^',
sub_symbol='^',
keep_inline_images_in=['h3'],
wrap=True,
wrap_width=80,
strong_em_symbol=UNDERSCORE,
code_language='python',
code_language_callback=None
).convert("")
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
MarkdownConverter().convert_soup(soup) == '**test**'
def callback(el: BeautifulSoup) -> Union[str, None]:
return el['class'][0] if el.has_attr('class') else None
MarkdownConverter(code_language_callback=callback).convert("")
MarkdownConverter(code_language_callback=lambda el: None).convert("")
markdownify('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback)
markdownify('<pre class="python">test\n foo\nbar</pre>', code_language_callback=lambda el: None)

9
tests/utils.py Normal file
View File

@@ -0,0 +1,9 @@
from markdownify import MarkdownConverter
# for unit testing, disable document-level stripping by default so that
# separation newlines are included in testing
def md(html, **options):
options = {"strip_document": None, **options}
return MarkdownConverter(**options).convert(html)