Compare commits

..

17 Commits

Author SHA1 Message Date
AlexVonB
652714859d Merge branch 'develop' 2021-05-21 14:18:14 +02:00
AlexVonB
ea5b22824b Merge branch 'develop' 2021-05-18 10:42:27 +02:00
AlexVonB
ec5858e42f Merge branch 'develop' 2021-05-16 18:41:24 +02:00
AlexVonB
02bb914ef3 Merge branch 'develop' 2021-05-02 13:49:30 +02:00
AlexVonB
21c0d034d0 Merge branch 'develop' 2021-05-02 10:51:00 +02:00
AlexVonB
e3ddc789a2 Merge branch 'develop' 2021-04-22 12:43:27 +02:00
AlexVonB
2d0cd97323 Merge branch 'develop' 2021-04-22 12:13:03 +02:00
AlexVonB
ec185e2e9c Merge branch 'develop' 2021-02-21 23:09:55 +01:00
AlexVonB
079d1721aa Merge branch 'develop' 2021-02-21 20:58:34 +01:00
AlexVonB
bf24df3e2e bump to v0.6.3 2021-01-12 22:43:18 +01:00
AlexVonB
15329588b1 Merge branch 'develop' 2021-01-12 22:42:58 +01:00
AlexVonB
34ad8485fa bump to v0.6.2 2021-01-12 22:40:03 +01:00
AlexVonB
f0ce934bf8 Merge branch 'develop' 2021-01-12 22:39:47 +01:00
AlexVonB
99cd237f27 Merge branch 'develop' 2021-01-04 10:22:02 +01:00
AlexVonB
2bde8d3e8e Merge branch 'develop' 2021-01-02 16:49:28 +01:00
AlexVonB
8c9b029756 Merge branch 'develop' 2020-09-01 18:10:07 +02:00
AlexVonB
ae50065872 Merge branch 'develop' 2020-08-18 18:53:10 +02:00
14 changed files with 389 additions and 654 deletions

View File

@@ -23,7 +23,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install tox
- name: Lint and test
pip install flake8==3.8.4 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
tox
python setup.py lint
- name: Test with pytest
run: |
python setup.py test

2
.gitignore vendored
View File

@@ -8,5 +8,3 @@
/MANIFEST
/venv
build/
.vscode/settings.json
.tox/

View File

@@ -32,14 +32,14 @@ Convert some HTML to Markdown:
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude:
Specify tags to exclude (blacklist):
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include:
\...or specify the tags you want to include (whitelist):
.. code:: python
@@ -53,20 +53,16 @@ Options
Markdownify supports the following options:
strip
A list of tags to strip. This option can't be used with the
A list of tags to strip (blacklist). This option can't be used with the
``convert`` option.
convert
A list of tags to convert. This option can't be used with the
A list of tags to convert (whitelist). This option can't be used with the
``strip`` option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a ``a`` tag's contents match its href. Defaults to ``True``.
default_title
A boolean to enable setting the title of a link to its href, if no title is
given. Defaults to ``False``.
a ``a`` tag's contents match its href. Defaults to ``True``
heading_style
Defines how headings should be converted. Accepted values are ``ATX``,
@@ -84,106 +80,24 @@ strong_em_symbol
*emphasized* texts. Either of these symbols can be chosen by the options
``ASTERISK`` (default) or ``UNDERSCORE`` respectively.
sub_symbol, sup_symbol
Define the chars that surround ``<sub>`` and ``<sup>`` text. Defaults to an
empty string, because this is non-standard behavior. Could be something like
``~`` and ``^`` to result in ``~sub~`` and ``^sup^``.
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash an a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all ``<pre>`` sections.
Useful, if all code on a page is in the same programming language and
should be annotated with `````python`` or similar.
Defaults to ``''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, an BeautifylSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to ``None``.
escape_asterisks
If set to ``False``, do not escape ``*`` to ``\*`` in text.
Defaults to ``True``.
escape_underscores
If set to ``False``, do not escape ``_`` to ``\_`` in text.
Defaults to ``True``.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Converting BeautifulSoup objects
================================
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
Creating Custom Converters
==========================
If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change:
.. code:: python
from markdownify import MarkdownConverter
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
Command Line Interface
=====================
Use ``markdownify example.html > example.md`` or pipe input from stdin
(``cat example.html | markdownify > example.md``).
Call ``markdownify -h`` to see all available options.
They are the same as listed above and take the same arguments.
Development
===========
To run tests and the linter run ``pip install tox`` once, then ``tox``.
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``

View File

@@ -1,5 +1,4 @@
from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
from textwrap import fill
from bs4 import BeautifulSoup, NavigableString, Comment
import re
import six
@@ -26,6 +25,12 @@ ASTERISK = '*'
UNDERSCORE = '_'
def escape(text):
if not text:
return ''
return text.replace('_', r'\_')
def chomp(text):
"""
If the text in an inline tag like b, a, or em contains a leading or trailing
@@ -61,23 +66,13 @@ def _todict(obj):
class MarkdownConverter(object):
class DefaultOptions:
autolinks = True
bullets = '*+-' # An iterable of bullet types.
code_language = ''
code_language_callback = None
convert = None
default_title = False
escape_asterisks = True
escape_underscores = True
heading_style = UNDERLINED
keep_inline_images_in = []
newline_style = SPACES
strip = None
convert = None
autolinks = True
heading_style = UNDERLINED
bullets = '*+-' # An iterable of bullet types.
strong_em_symbol = ASTERISK
sub_symbol = ''
sup_symbol = ''
wrap = False
wrap_width = 80
newline_style = SPACES
class Options(DefaultOptions):
pass
@@ -94,21 +89,15 @@ class MarkdownConverter(object):
def convert(self, html):
soup = BeautifulSoup(html, 'html.parser')
return self.convert_soup(soup)
def convert_soup(self, soup):
return self.process_tag(soup, convert_as_inline=False, children_only=True)
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
# markdown headings or cells can't include
# block elements (elements w/newlines)
# markdown headings can't include block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
if not children_only and (isHeading or isCell):
if not children_only and isHeading:
convert_children_as_inline = True
# Remove whitespace-only textnodes in purely nested nodes
@@ -135,7 +124,7 @@ class MarkdownConverter(object):
# Convert the children first
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
if isinstance(el, Comment):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
@@ -150,26 +139,22 @@ class MarkdownConverter(object):
return text
def process_text(self, el):
text = six.text_type(el) or ''
text = six.text_type(el)
# dont remove any whitespace when handling pre or code in pre
if not (el.parent.name == 'pre'
or (el.parent.name == 'code'
and el.parent.parent.name == 'pre')):
text = whitespace_re.sub(' ', text)
if (el.parent.name == 'pre'
or (el.parent.name == 'code' and el.parent.parent.name == 'pre')):
return escape(text or '')
if el.parent.name != 'code' and el.parent.name != 'pre':
text = self.escape(text)
cleaned_text = escape(whitespace_re.sub(' ', text or ''))
# remove trailing whitespaces if any of the following condition is true:
# - current text node is the last node in li
# - current text node is followed by an embedded list
if (el.parent.name == 'li'
and (not el.next_sibling
or el.next_sibling.name in ['ul', 'ol'])):
text = text.rstrip()
if el.parent.name == 'li' and (not el.next_sibling or el.next_sibling.name in ['ul', 'ol']):
return cleaned_text.rstrip()
return text
return cleaned_text
def __getattr__(self, attr):
# Handle headings
@@ -197,15 +182,6 @@ class MarkdownConverter(object):
else:
return True
def escape(self, text):
if not text:
return ''
if self.options['escape_asterisks']:
text = text.replace('*', r'\*')
if self.options['escape_underscores']:
text = text.replace('_', r'\_')
return text
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
@@ -217,17 +193,14 @@ class MarkdownConverter(object):
prefix, suffix, text = chomp(text)
if not text:
return ''
if convert_as_inline:
return text
href = el.get('href')
title = el.get('title')
# For the replacement see #29: text nodes underscores are escaped
if (self.options['autolinks']
and text.replace(r'\_', '_') == href
and not title
and not self.options['default_title']):
if self.options['autolinks'] and text.replace(r'\_', '_') == href and not title:
# Shortcut syntax
return '<%s>' % href
if self.options['default_title'] and not title:
title = href
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
@@ -285,8 +258,7 @@ class MarkdownConverter(object):
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if (convert_as_inline
and el.parent.name not in self.options['keep_inline_images_in']):
if convert_as_inline:
return alt
return '![%s](%s%s)' % (alt, src, title_part)
@@ -329,27 +301,17 @@ class MarkdownConverter(object):
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
return '%s %s\n' % (bullet, (text or '').strip())
return '%s %s\n' % (bullet, text or '')
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return text
if self.options['wrap']:
text = fill(text,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
return '%s\n\n' % text if text else ''
def convert_pre(self, el, text, convert_as_inline):
if not text:
return ''
code_language = self.options['code_language']
if self.options['code_language_callback']:
code_language = self.options['code_language_callback'](el) or code_language
return '\n```%s\n%s\n```\n' % (code_language, text)
return '\n```\n%s\n```\n' % text
convert_s = convert_del
@@ -357,10 +319,6 @@ class MarkdownConverter(object):
convert_samp = convert_code
convert_sub = abstract_inline_conversion(lambda self: self.options['sub_symbol'])
convert_sup = abstract_inline_conversion(lambda self: self.options['sup_symbol'])
def convert_table(self, el, text, convert_as_inline):
return '\n\n' + text + '\n'
@@ -378,13 +336,8 @@ class MarkdownConverter(object):
if is_headrow and not el.previous_sibling:
# first row and is headline: print headline underline
underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
elif (not el.previous_sibling
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.previous_sibling))):
# first row, not headline, and:
# - the parent is table or
# - the parent is tbody at the beginning of a table.
elif not el.previous_sibling and not el.parent.name != 'table':
# first row, not headline, and the parent is sth. like tbody:
# print empty headline above this row
overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'

View File

@@ -1,65 +0,0 @@
#!/usr/bin/env python
import argparse
import sys
from markdownify import markdownify
def main(argv=sys.argv[1:]):
parser = argparse.ArgumentParser(
prog='markdownify',
description='Converts html to markdown.',
)
parser.add_argument('html', nargs='?', type=argparse.FileType('r'),
default=sys.stdin,
help="The html file to convert. Defaults to STDIN if not "
"provided.")
parser.add_argument('-s', '--strip', nargs='*',
help="A list of tags to strip. This option can't be used with "
"the --convert option.")
parser.add_argument('-c', '--convert', nargs='*',
help="A list of tags to convert. This option can't be used with "
"the --strip option.")
parser.add_argument('-a', '--autolinks', action='store_true',
help="A boolean indicating whether the 'automatic link' style "
"should be used when a 'a' tag's contents match its href.")
parser.add_argument('--default-title', action='store_false',
help="A boolean to enable setting the title of a link to its "
"href, if no title is given.")
parser.add_argument('--heading-style',
choices=('ATX', 'ATX_CLOSED', 'SETEXT', 'UNDERLINED'),
help="Defines how headings should be converted.")
parser.add_argument('-b', '--bullets', default='*+-',
help="A string of bullet styles to use; the bullet will "
"alternate based on nesting level.")
parser.add_argument('--sub-symbol', default='',
help="Define the chars that surround '<sub>'.")
parser.add_argument('--sup-symbol', default='',
help="Define the chars that surround '<sup>'.")
parser.add_argument('--code-language', default='',
help="Defines the language that should be assumed for all "
"'<pre>' sections.")
parser.add_argument('--no-escape-asterisks', dest='escape_asterisks',
action='store_false',
help="Do not escape '*' to '\\*' in text.")
parser.add_argument('--no-escape-underscores', dest='escape_underscores',
action='store_false',
help="Do not escape '_' to '\\_' in text.")
parser.add_argument('-i', '--keep-inline-images-in', nargs='*',
help="Images are converted to their alt-text when the images are "
"located inside headlines or table cells. If some inline images "
"should be converted to markdown images instead, this option can "
"be set to a list of parent tags that should be allowed to "
"contain inline images.")
parser.add_argument('-w', '--wrap', action='store_true',
help="Wrap all text paragraphs at --wrap-width characters.")
parser.add_argument('--wrap-width', type=int, default=80)
args = parser.parse_args(argv)
print(markdownify(**vars(args)))
if __name__ == '__main__':
main()

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[flake8]
ignore = E501 W503

View File

@@ -2,6 +2,7 @@
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
@@ -9,10 +10,52 @@ read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.11.2',
'__version__': '0.8.0',
}
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
class PyTest(TestCommand):
def finalize_options(self):
TestCommand.finalize_options(self)
self.test_args = ['tests', '-s']
self.test_suite = True
def run_tests(self):
import pytest
errno = pytest.main(self.test_args)
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.api.legacy import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
setup(
name='markdownify',
@@ -26,9 +69,14 @@ setup(
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8>=3.8,<4',
],
tests_require=[
'pytest>=6.2,<7',
],
install_requires=[
'beautifulsoup4>=4.9,<5',
'six>=1.15,<2',
'beautifulsoup4>=4.9,<5', 'six>=1.15,<2'
],
classifiers=[
'Environment :: Web Environment',
@@ -44,9 +92,8 @@ setup(
'Programming Language :: Python :: 3.8',
'Topic :: Utilities'
],
entry_points={
'console_scripts': [
'markdownify = markdownify.main:main'
]
}
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
)

View File

@@ -1,17 +1,6 @@
from markdownify import markdownify as md
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_nested():
text = md('<p>This is an <a href="http://example.com/">example link</a>.</p>')
assert text == 'This is an [example link](http://example.com/).\n\n'
@@ -32,8 +21,3 @@ def test_code_with_tricky_content():
assert md('<code>/home/</code><b>username</b>') == "`/home/`**username**"
assert md('First line <code>blah blah<br />blah blah</code> second line') \
== "First line `blah blah \nblah blah` second line"
def test_special_tags():
assert md('<!DOCTYPE html>') == ''
assert md('<![CDATA[foobar]]>') == 'foobar'

View File

@@ -1,17 +1,179 @@
from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, UNDERSCORE
def inline_tests(tag, markup):
# test template for different inline tags
assert md(f'<{tag}>Hello</{tag}>') == f'{markup}Hello{markup}'
assert md(f'foo <{tag}>Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo<{tag}> Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}>Hello </{tag}>bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}></{tag}> bar') in ['foo bar', 'foo bar'] # Either is OK
nested_uls = """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>"""
nested_ols = """
<ol>
<li>1
<ol>
<li>a
<ol>
<li>I</li>
<li>II</li>
<li>III</li>
</ol>
</li>
<li>b</li>
<li>c</li>
</ol>
</li>
<li>2</li>
<li>3</li>
</ul>"""
table = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_html_content = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td><b>Jill</b></td>
<td><i>Smith</i></td>
<td><a href="#">50</a></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_header_column = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<th>Jill</th>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<th>Eve</th>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_head_body = """<table>
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_text = """<table>
<thead>
<tr>
<th></th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td></td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_head = """<table>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_a():
assert md('<a href="https://google.com">Google</a>') == '[Google](https://google.com)'
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
assert md('<a href="https://google.com">https://google.com</a>') == '<https://google.com>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>') == '<https://community.kde.org/Get_Involved>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>', autolinks=False) == '[https://community.kde.org/Get\\_Involved](https://community.kde.org/Get_Involved)'
@@ -27,7 +189,6 @@ def test_a_spaces():
def test_a_with_title():
text = md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>')
assert text == r'[Google](http://google.com "The \"Goog\"")'
assert md('<a href="https://google.com">https://google.com</a>', default_title=True) == '[https://google.com](https://google.com "https://google.com")'
def test_a_shortcut():
@@ -36,7 +197,8 @@ def test_a_shortcut():
def test_a_no_autolinks():
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
text = md('<a href="http://google.com">http://google.com</a>', autolinks=False)
assert text == '[http://google.com](http://google.com)'
def test_b():
@@ -58,31 +220,58 @@ def test_blockquote_with_paragraph():
assert md('<blockquote>Hello</blockquote><p>handsome</p>') == '\n> Hello\n\nhandsome\n\n'
def test_blockquote_nested():
def test_nested_blockquote():
text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>')
assert text == '\n> And she was like \n> > Hello\n> \n> \n\n'
def test_br():
assert md('a<br />b<br />c') == 'a \nb \nc'
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
def test_em():
assert md('<em>Hello</em>') == '*Hello*'
def test_em_spaces():
assert md('foo <em>Hello</em> bar') == 'foo *Hello* bar'
assert md('foo<em> Hello</em> bar') == 'foo *Hello* bar'
assert md('foo <em>Hello </em>bar') == 'foo *Hello* bar'
assert md('foo <em></em> bar') == 'foo bar'
def inline_tests(tag, markup):
# Basically re-use test_em() and test_em_spaces(),
assert md(f'<{tag}>Hello</{tag}>') == f'{markup}Hello{markup}'
assert md(f'foo <{tag}>Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo<{tag}> Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}>Hello </{tag}>bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}></{tag}> bar') in ['foo bar', 'foo bar'] # Either is OK
def test_code():
inline_tests('code', '`')
assert md('<code>this_should_not_escape</code>') == '`this_should_not_escape`'
def test_samp():
inline_tests('samp', '`')
def test_kbd():
inline_tests('kbd', '`')
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
def test_del():
inline_tests('del', '~~')
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_em():
inline_tests('em', '*')
def test_s():
inline_tests('s', '~~')
def test_h1():
@@ -95,8 +284,6 @@ def test_h2():
def test_hn():
assert md('<h3>Hello</h3>') == '### Hello\n\n'
assert md('<h4>Hello</h4>') == '#### Hello\n\n'
assert md('<h5>Hello</h5>') == '##### Hello\n\n'
assert md('<h6>Hello</h6>') == '###### Hello\n\n'
@@ -132,28 +319,15 @@ def test_hn_nested_simple_tag():
def test_hn_nested_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
image_attributes_to_markdown = [
("", "", ""),
("alt='Alt Text'", "Alt Text", ""),
("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
("", ""),
("alt='Alt Text'", "Alt Text"),
("alt='Alt Text' title='Optional title'", "Alt Text"),
]
for image_attributes, markdown, title in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
def test_hn_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_hn_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_head():
assert md('<head>head</head>') == 'head'
for image_attributes, markdown in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg " ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
def test_hr():
@@ -162,44 +336,81 @@ def test_hr():
assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == 'Hello\n\n\n\n\n---\n\n\nWorld\n\n'
def test_head():
assert md('<head>head</head>') == 'head'
def test_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_i():
assert md('<i>Hello</i>') == '*Hello*'
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
def test_p():
assert md('<p>hello</p>') == 'hello\n\n'
def test_strong():
assert md('<strong>Hello</strong>') == '**Hello**'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
def test_nested_ols():
assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
def test_li_text():
assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar </li><li>foo <b>bar</b> <i>space</i>.</ul>') == '* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'
def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
def test_kbd():
inline_tests('kbd', '`')
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_p():
assert md('<p>hello</p>') == 'hello\n\n'
assert md('<p>123456789 123456789</p>') == '123456789 123456789\n\n'
assert md('<p>123456789 123456789</p>', wrap=True, wrap_width=10) == '123456789\n123456789\n\n'
assert md('<p><a href="https://example.com">Some long link</a></p>', wrap=True, wrap_width=10) == '[Some long\nlink](https://example.com)\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345\\\n67890\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345678901\\\n12345\n\n'
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre>this_should_not_escape</pre>') == '\n```\nthis_should_not_escape\n```\n'
def test_s():
inline_tests('s', '~~')
def test_samp():
inline_tests('samp', '`')
def test_strong():
assert md('<strong>Hello</strong>') == '**Hello**'
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
def test_strong_em_symbol():
@@ -209,25 +420,5 @@ def test_strong_em_symbol():
assert md('<i>Hello</i>', strong_em_symbol=UNDERSCORE) == '_Hello_'
def test_sub():
assert md('<sub>foo</sub>') == 'foo'
assert md('<sub>foo</sub>', sub_symbol='~') == '~foo~'
def test_sup():
assert md('<sup>foo</sup>') == 'foo'
assert md('<sup>foo</sup>', sup_symbol='^') == '^foo^'
def test_lang():
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n foo\nbar\n```\n'
def test_lang_callback():
def callback(el):
return el['class'][0] if el.has_attr('class') else None
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
def test_newline_style():
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'

View File

@@ -1,25 +0,0 @@
from markdownify import MarkdownConverter
from bs4 import BeautifulSoup
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def test_img():
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'
def test_soup():
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
assert MarkdownConverter().convert_soup(soup) == '**test**'

View File

@@ -1,14 +1,8 @@
from markdownify import markdownify as md
def test_asterisks():
assert md('*hey*dude*') == r'\*hey\*dude\*'
assert md('*hey*dude*', escape_asterisks=False) == r'*hey*dude*'
def test_underscore():
assert md('_hey_dude_') == r'\_hey\_dude\_'
assert md('_hey_dude_', escape_underscores=False) == r'_hey_dude_'
def test_xml_entities():

View File

@@ -1,81 +0,0 @@
from markdownify import markdownify as md
nested_uls = """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>"""
nested_ols = """
<ol>
<li>1
<ol>
<li>a
<ol>
<li>I</li>
<li>II</li>
<li>III</li>
</ol>
</li>
<li>b</li>
<li>c</li>
</ol>
</li>
<li>2</li>
<li>3</li>
</ul>"""
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
def test_nested_ols():
assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
assert md("""<ul>
<li>
a
</li>
<li> b </li>
<li> c
</li>
</ul>""") == '* a\n* b\n* c\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
def test_li_text():
assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar </li><li>foo <b>bar</b> <i>space</i>.</ul>') == '* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'

View File

@@ -1,171 +0,0 @@
from markdownify import markdownify as md
table = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_html_content = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td><b>Jill</b></td>
<td><i>Smith</i></td>
<td><a href="#">50</a></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_paragraphs = """<table>
<tr>
<th>Firstname</th>
<th><p>Lastname</p></th>
<th>Age</th>
</tr>
<tr>
<td><p>Jill</p></td>
<td><p>Smith</p></td>
<td><p>50</p></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_header_column = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<th>Jill</th>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<th>Eve</th>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_head_body = """<table>
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_text = """<table>
<thead>
<tr>
<th></th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td></td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_head = """<table>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_body = """<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'

10
tox.ini
View File

@@ -1,10 +0,0 @@
[tox]
envlist = py38
[testenv]
deps =
flake8
pytest
commands =
flake8 --ignore=E501,W503 markdownify tests
pytest