Compare commits

...

57 Commits

Author SHA1 Message Date
AlexVonB
194c646a20 Merge branch 'develop' 2022-08-28 21:43:12 +02:00
AlexVonB
9914474828 bump to v0.11.3 2022-08-28 21:42:46 +02:00
AlexVonB
6263f0e5f0 Switch to tox for tests (#73) 2022-08-28 21:40:52 +02:00
Adam Bambuch
17d8586843 don't escape text in pre tag (Fenced Code Blocks) (#67)
don't escape text in pre tag (Fenced Code Blocks)
2022-08-28 20:58:54 +02:00
AlexVonB
59eb069700 added readme for cli 2022-08-28 20:56:23 +02:00
Daniel J. Perry
e79971a7eb Add console entry point (#72)
* Add console entry point

* Make entry point conform to linter settings.
2022-08-28 20:53:15 +02:00
AlexVonB
2c533339cf Merge branch 'develop' 2022-04-24 11:01:54 +02:00
AlexVonB
5adda130b8 bump to v0.11.2 2022-04-24 11:01:29 +02:00
AlexVonB
5f1b98e25d added wrap option
closes #66
2022-04-24 11:00:04 +02:00
AlexVonB
16acd2b763 typo in readme 2022-04-24 10:59:22 +02:00
AlexVonB
2b8cf444f1 Merge branch 'develop' 2022-04-14 10:25:35 +02:00
AlexVonB
207d0f4ec6 bump to v0.11.1 2022-04-14 10:25:25 +02:00
Mikko Korpela
ebb9ea713d Fix detection of "first row, not headline" (#63)
Improved handling of "first row, not headline".

Works for tables with
1) neither thead nor tbody
2) tbody but no thead
2022-04-14 10:24:32 +02:00
AlexVonB
d375116807 Merge branch 'develop' 2022-04-13 20:47:52 +02:00
AlexVonB
87b9f6c88e bump to v0.11.0 2022-04-13 20:47:30 +02:00
AlexVonB
bda367dad9 Merge branch 'tdgroot-code_language_callback' into develop
closes #64
2022-04-13 20:44:18 +02:00
AlexVonB
61e8940486 added readme for callback 2022-04-13 20:42:38 +02:00
AlexVonB
35479d2d3b Merge branch 'code_language_callback' of https://github.com/tdgroot/python-markdownify into tdgroot-code_language_callback 2022-04-13 20:25:37 +02:00
AlexVonB
b589863715 add escaping of asterisks and option to disable it
closes #62
2022-04-13 20:04:12 +02:00
AlexVonB
423b7e948c add option to allow inline images in selected tags
fixes #61
2022-04-13 19:55:34 +02:00
Timon de Groot
0ea95de4d0 Add code language callback 2022-04-09 13:22:28 +02:00
AlexVonB
ed3eee78d2 fixed readme 2022-01-24 18:18:19 +01:00
AlexVonB
eb0330bfc6 Merge branch 'develop' 2022-01-23 11:01:45 +01:00
AlexVonB
ddda696396 bump to v0.10.3 2022-01-23 11:01:26 +01:00
AlexVonB
0a1343a538 allow BeautifulSoup objects to be converted 2022-01-23 11:00:19 +01:00
AlexVonB
9d0b839b73 wording 2022-01-23 10:59:24 +01:00
AlexVonB
28793ac0b3 Merge branch 'develop' 2022-01-18 08:56:33 +01:00
AlexVonB
d3eff11617 bump to v0.10.2 2022-01-18 08:53:33 +01:00
AlexVonB
bd6b581122 add option to not escape underscores
closes #59
2022-01-18 08:51:44 +01:00
AlexVonB
9231704988 Merge branch 'develop' 2021-12-11 14:44:58 +01:00
AlexVonB
c8f7cf63e3 bump to v0.10.1 2021-12-11 14:44:34 +01:00
AlexVonB
12a68a7d14 allow flake8 v4.x
closes #57
2021-12-11 14:43:14 +01:00
AlexVonB
1613c302bc Merge branch 'develop' 2021-11-17 17:11:01 +01:00
AlexVonB
478b1c7e13 bump to v0.10.0 2021-11-17 17:10:15 +01:00
AlexVonB
ffcf6cbcb2 fix readme for code_language 2021-11-17 17:09:47 +01:00
AlexVonB
0ab0452414 add readme for code_language 2021-11-17 17:08:14 +01:00
AlexVonB
b62b067cbd Merge branch 'Inzaniak-develop' into develop 2021-11-17 17:05:07 +01:00
AlexVonB
cb2646cd93 differentiated between text and code language 2021-11-17 17:03:31 +01:00
AlexVonB
9692b5e714 satisfy linter 2021-11-17 16:55:00 +01:00
Umberto Grando
ac68c53a7d added language for multiline code 2021-11-01 21:19:35 +01:00
AlexVonB
55c9e84f38 Merge branch 'develop' 2021-09-04 21:50:34 +02:00
AlexVonB
40dd30419c bump to v0.9.4 2021-09-04 21:50:05 +02:00
AlexVonB
da56f7f56a Merge pull request #53 from Hozhyi/fix/bullet_list_tags_in_separate_lines
Fixed issue #52 - added stripping of text to list
2021-09-04 21:48:16 +02:00
AlexVonB
8400b39dd9 remove trailing whitespace to satisfy the linter 2021-09-04 21:47:27 +02:00
Viktor Hozhyi
5fc1441fe7 Added appropriate test 2021-09-04 20:51:08 +03:00
Viktor Hozhyi
044615eff1 Fixed issue #52 - added stripping of text to list 2021-09-04 12:39:30 +03:00
AlexVonB
99875683ac Merge branch 'develop' 2021-08-25 08:53:38 +02:00
AlexVonB
dbd9f3f3d2 bump to v0.9.3 2021-08-25 08:53:17 +02:00
AlexVonB
0fdeb1ff6e convert tags inside table cells as inline
in part resolves #49
2021-08-25 08:48:30 +02:00
AlexVonB
eaeb0603eb Merge branch 'develop' 2021-07-11 13:21:20 +02:00
AlexVonB
6a2f3a4b42 fix rst syntax error 2021-07-11 13:21:02 +02:00
AlexVonB
cb73590623 Merge branch 'develop' 2021-07-11 13:14:29 +02:00
AlexVonB
22180a166d bump to v0.9.1 2021-07-11 13:13:31 +02:00
AlexVonB
16d8a0e1f7 Revert "add figure/figcaption"
This reverts commit 828e116530.
2021-07-11 13:12:16 +02:00
AlexVonB
4aa6cf2a24 rewrote text processing to not escape _ in code
fixes #47
2021-07-11 13:10:59 +02:00
AlexVonB
828e116530 add figure/figcaption
for #46
2021-06-30 13:02:42 +02:00
AlexVonB
62e9f0de02 add examples for custom converters
closes #46
2021-06-27 15:53:23 +02:00
13 changed files with 343 additions and 107 deletions

View File

@@ -23,11 +23,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8==3.8.4 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
pip install tox
- name: Lint and test
run: |
python setup.py lint
- name: Test with pytest
run: |
python setup.py test
tox

2
.gitignore vendored
View File

@@ -8,3 +8,5 @@
/MANIFEST
/venv
build/
.vscode/settings.json
.tox/

View File

@@ -32,14 +32,14 @@ Convert some HTML to Markdown:
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude (blacklist):
Specify tags to exclude:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include (whitelist):
\...or specify the tags you want to include:
.. code:: python
@@ -53,11 +53,11 @@ Options
Markdownify supports the following options:
strip
A list of tags to strip (blacklist). This option can't be used with the
A list of tags to strip. This option can't be used with the
``convert`` option.
convert
A list of tags to convert (whitelist). This option can't be used with the
A list of tags to convert. This option can't be used with the
``strip`` option.
autolinks
@@ -92,21 +92,98 @@ sub_symbol, sup_symbol
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash an a
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all ``<pre>`` sections.
Useful, if all code on a page is in the same programming language and
should be annotated with `````python`` or similar.
Defaults to ``''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, an BeautifylSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to ``None``.
escape_asterisks
If set to ``False``, do not escape ``*`` to ``\*`` in text.
Defaults to ``True``.
escape_underscores
If set to ``False``, do not escape ``_`` to ``\_`` in text.
Defaults to ``True``.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Converting BeautifulSoup objects
================================
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
Creating Custom Converters
==========================
If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change:
.. code:: python
from markdownify import MarkdownConverter
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
Command Line Interface
=====================
Use ``markdownify example.html > example.md`` or pipe input from stdin
(``cat example.html | markdownify > example.md``).
Call ``markdownify -h`` to see all available options.
They are the same as listed above and take the same arguments.
Development
===========
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``
To run tests and the linter run ``pip install tox`` once, then ``tox``.

View File

@@ -1,4 +1,5 @@
from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
from textwrap import fill
import re
import six
@@ -25,12 +26,6 @@ ASTERISK = '*'
UNDERSCORE = '_'
def escape(text):
if not text:
return ''
return text.replace('_', r'\_')
def chomp(text):
"""
If the text in an inline tag like b, a, or em contains a leading or trailing
@@ -68,14 +63,21 @@ class MarkdownConverter(object):
class DefaultOptions:
autolinks = True
bullets = '*+-' # An iterable of bullet types.
code_language = ''
code_language_callback = None
convert = None
default_title = False
escape_asterisks = True
escape_underscores = True
heading_style = UNDERLINED
keep_inline_images_in = []
newline_style = SPACES
strip = None
strong_em_symbol = ASTERISK
sub_symbol = ''
sup_symbol = ''
wrap = False
wrap_width = 80
class Options(DefaultOptions):
pass
@@ -92,15 +94,21 @@ class MarkdownConverter(object):
def convert(self, html):
soup = BeautifulSoup(html, 'html.parser')
return self.convert_soup(soup)
def convert_soup(self, soup):
return self.process_tag(soup, convert_as_inline=False, children_only=True)
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
# markdown headings can't include block elements (elements w/newlines)
# markdown headings or cells can't include
# block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
if not children_only and isHeading:
if not children_only and (isHeading or isCell):
convert_children_as_inline = True
# Remove whitespace-only textnodes in purely nested nodes
@@ -142,22 +150,26 @@ class MarkdownConverter(object):
return text
def process_text(self, el):
text = six.text_type(el)
text = six.text_type(el) or ''
# dont remove any whitespace when handling pre or code in pre
if (el.parent.name == 'pre'
or (el.parent.name == 'code' and el.parent.parent.name == 'pre')):
return escape(text or '')
if not (el.parent.name == 'pre'
or (el.parent.name == 'code'
and el.parent.parent.name == 'pre')):
text = whitespace_re.sub(' ', text)
cleaned_text = escape(whitespace_re.sub(' ', text or ''))
if el.parent.name != 'code' and el.parent.name != 'pre':
text = self.escape(text)
# remove trailing whitespaces if any of the following condition is true:
# - current text node is the last node in li
# - current text node is followed by an embedded list
if el.parent.name == 'li' and (not el.next_sibling or el.next_sibling.name in ['ul', 'ol']):
return cleaned_text.rstrip()
if (el.parent.name == 'li'
and (not el.next_sibling
or el.next_sibling.name in ['ul', 'ol'])):
text = text.rstrip()
return cleaned_text
return text
def __getattr__(self, attr):
# Handle headings
@@ -185,6 +197,15 @@ class MarkdownConverter(object):
else:
return True
def escape(self, text):
if not text:
return ''
if self.options['escape_asterisks']:
text = text.replace('*', r'\*')
if self.options['escape_underscores']:
text = text.replace('_', r'\_')
return text
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
@@ -196,8 +217,6 @@ class MarkdownConverter(object):
prefix, suffix, text = chomp(text)
if not text:
return ''
if convert_as_inline:
return text
href = el.get('href')
title = el.get('title')
# For the replacement see #29: text nodes underscores are escaped
@@ -266,7 +285,8 @@ class MarkdownConverter(object):
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if convert_as_inline:
if (convert_as_inline
and el.parent.name not in self.options['keep_inline_images_in']):
return alt
return '![%s](%s%s)' % (alt, src, title_part)
@@ -309,17 +329,27 @@ class MarkdownConverter(object):
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
return '%s %s\n' % (bullet, text or '')
return '%s %s\n' % (bullet, (text or '').strip())
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return text
if self.options['wrap']:
text = fill(text,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
return '%s\n\n' % text if text else ''
def convert_pre(self, el, text, convert_as_inline):
if not text:
return ''
return '\n```\n%s\n```\n' % text
code_language = self.options['code_language']
if self.options['code_language_callback']:
code_language = self.options['code_language_callback'](el) or code_language
return '\n```%s\n%s\n```\n' % (code_language, text)
convert_s = convert_del
@@ -348,8 +378,13 @@ class MarkdownConverter(object):
if is_headrow and not el.previous_sibling:
# first row and is headline: print headline underline
underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
elif not el.previous_sibling and not el.parent.name != 'table':
# first row, not headline, and the parent is sth. like tbody:
elif (not el.previous_sibling
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.previous_sibling))):
# first row, not headline, and:
# - the parent is table or
# - the parent is tbody at the beginning of a table.
# print empty headline above this row
overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'

65
markdownify/main.py Normal file
View File

@@ -0,0 +1,65 @@
#!/usr/bin/env python
import argparse
import sys
from markdownify import markdownify
def main(argv=sys.argv[1:]):
parser = argparse.ArgumentParser(
prog='markdownify',
description='Converts html to markdown.',
)
parser.add_argument('html', nargs='?', type=argparse.FileType('r'),
default=sys.stdin,
help="The html file to convert. Defaults to STDIN if not "
"provided.")
parser.add_argument('-s', '--strip', nargs='*',
help="A list of tags to strip. This option can't be used with "
"the --convert option.")
parser.add_argument('-c', '--convert', nargs='*',
help="A list of tags to convert. This option can't be used with "
"the --strip option.")
parser.add_argument('-a', '--autolinks', action='store_true',
help="A boolean indicating whether the 'automatic link' style "
"should be used when a 'a' tag's contents match its href.")
parser.add_argument('--default-title', action='store_false',
help="A boolean to enable setting the title of a link to its "
"href, if no title is given.")
parser.add_argument('--heading-style',
choices=('ATX', 'ATX_CLOSED', 'SETEXT', 'UNDERLINED'),
help="Defines how headings should be converted.")
parser.add_argument('-b', '--bullets', default='*+-',
help="A string of bullet styles to use; the bullet will "
"alternate based on nesting level.")
parser.add_argument('--sub-symbol', default='',
help="Define the chars that surround '<sub>'.")
parser.add_argument('--sup-symbol', default='',
help="Define the chars that surround '<sup>'.")
parser.add_argument('--code-language', default='',
help="Defines the language that should be assumed for all "
"'<pre>' sections.")
parser.add_argument('--no-escape-asterisks', dest='escape_asterisks',
action='store_false',
help="Do not escape '*' to '\\*' in text.")
parser.add_argument('--no-escape-underscores', dest='escape_underscores',
action='store_false',
help="Do not escape '_' to '\\_' in text.")
parser.add_argument('-i', '--keep-inline-images-in', nargs='*',
help="Images are converted to their alt-text when the images are "
"located inside headlines or table cells. If some inline images "
"should be converted to markdown images instead, this option can "
"be set to a list of parent tags that should be allowed to "
"contain inline images.")
parser.add_argument('-w', '--wrap', action='store_true',
help="Wrap all text paragraphs at --wrap-width characters.")
parser.add_argument('--wrap-width', type=int, default=80)
args = parser.parse_args(argv)
print(markdownify(**vars(args)))
if __name__ == '__main__':
main()

View File

@@ -1,2 +0,0 @@
[flake8]
ignore = E501 W503

View File

@@ -2,7 +2,6 @@
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
@@ -10,52 +9,10 @@ read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.9.0',
'__version__': '0.11.3',
}
class PyTest(TestCommand):
def finalize_options(self):
TestCommand.finalize_options(self)
self.test_args = ['tests', '-s']
self.test_suite = True
def run_tests(self):
import pytest
errno = pytest.main(self.test_args)
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.api.legacy import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
setup(
name='markdownify',
@@ -69,14 +26,9 @@ setup(
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8>=3.8,<4',
],
tests_require=[
'pytest>=6.2,<7',
],
install_requires=[
'beautifulsoup4>=4.9,<5', 'six>=1.15,<2'
'beautifulsoup4>=4.9,<5',
'six>=1.15,<2',
],
classifiers=[
'Environment :: Web Environment',
@@ -92,8 +44,9 @@ setup(
'Programming Language :: Python :: 3.8',
'Topic :: Utilities'
],
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
entry_points={
'console_scripts': [
'markdownify = markdownify.main:main'
]
}
)

View File

@@ -70,6 +70,7 @@ def test_br():
def test_code():
inline_tests('code', '`')
assert md('<code>this_should_not_escape</code>') == '`this_should_not_escape`'
def test_del():
@@ -131,15 +132,14 @@ def test_hn_nested_simple_tag():
def test_hn_nested_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
image_attributes_to_markdown = [
("", ""),
("alt='Alt Text'", "Alt Text"),
("alt='Alt Text' title='Optional title'", "Alt Text"),
("", "", ""),
("alt='Alt Text'", "Alt Text", ""),
("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
]
for image_attributes, markdown in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg " ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
for image_attributes, markdown, title in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
def test_hn_atx_headings():
@@ -177,11 +177,17 @@ def test_kbd():
def test_p():
assert md('<p>hello</p>') == 'hello\n\n'
assert md('<p>123456789 123456789</p>') == '123456789 123456789\n\n'
assert md('<p>123456789 123456789</p>', wrap=True, wrap_width=10) == '123456789\n123456789\n\n'
assert md('<p><a href="https://example.com">Some long link</a></p>', wrap=True, wrap_width=10) == '[Some long\nlink](https://example.com)\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345\\\n67890\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345678901\\\n12345\n\n'
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre>this_should_not_escape</pre>') == '\n```\nthis_should_not_escape\n```\n'
def test_s():
@@ -211,3 +217,17 @@ def test_sub():
def test_sup():
assert md('<sup>foo</sup>') == 'foo'
assert md('<sup>foo</sup>', sup_symbol='^') == '^foo^'
def test_lang():
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n foo\nbar\n```\n'
def test_lang_callback():
def callback(el):
return el['class'][0] if el.has_attr('class') else None
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'

View File

@@ -0,0 +1,25 @@
from markdownify import MarkdownConverter
from bs4 import BeautifulSoup
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def test_img():
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'
def test_soup():
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
assert MarkdownConverter().convert_soup(soup) == '**test**'

View File

@@ -1,8 +1,14 @@
from markdownify import markdownify as md
def test_asterisks():
assert md('*hey*dude*') == r'\*hey\*dude\*'
assert md('*hey*dude*', escape_asterisks=False) == r'*hey*dude*'
def test_underscore():
assert md('_hey_dude_') == r'\_hey\_dude\_'
assert md('_hey_dude_', escape_underscores=False) == r'_hey_dude_'
def test_xml_entities():

View File

@@ -51,6 +51,14 @@ def test_nested_ols():
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
assert md("""<ul>
<li>
a
</li>
<li> b </li>
<li> c
</li>
</ul>""") == '* a\n* b\n* c\n'
def test_inline_ul():

View File

@@ -39,6 +39,25 @@ table_with_html_content = """<table>
</table>"""
table_with_paragraphs = """<table>
<tr>
<th>Firstname</th>
<th><p>Lastname</p></th>
<th>Age</th>
</tr>
<tr>
<td><p>Jill</p></td>
<td><p>Smith</p></td>
<td><p>50</p></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_header_column = """<table>
<tr>
<th>Firstname</th>
@@ -120,11 +139,33 @@ table_missing_head = """<table>
</tr>
</table>"""
table_body = """<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| | | |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'

10
tox.ini Normal file
View File

@@ -0,0 +1,10 @@
[tox]
envlist = py38
[testenv]
deps =
flake8
pytest
commands =
flake8 --ignore=E501,W503 markdownify tests
pytest