Compare commits

..

1 Commits

Author SHA1 Message Date
AlexVonB
ae50065872 Merge branch 'develop' 2020-08-18 18:53:10 +02:00
19 changed files with 248 additions and 1209 deletions

View File

@@ -16,17 +16,18 @@ jobs:
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.6
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.6
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build tox
- name: Lint and test
pip install flake8==2.5.4 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
tox
- name: Build
python setup.py lint
- name: Test with pytest
run: |
python -m build -nwsx .
python setup.py test

View File

@@ -17,15 +17,15 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --upgrade setuptools setuptools_scm wheel build twine
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python -m build -nwsx .
python setup.py sdist bdist_wheel
twine upload dist/*

2
.gitignore vendored
View File

@@ -8,5 +8,3 @@
/MANIFEST
/venv
build/
.vscode/settings.json
.tox/

View File

@@ -1,2 +1 @@
include README.rst
prune tests

View File

@@ -1,21 +1,3 @@
|build| |version| |license| |downloads|
.. |build| image:: https://img.shields.io/github/actions/workflow/status/matthewwithanm/python-markdownify/python-app.yml?branch=develop
:alt: GitHub Workflow Status
:target: https://github.com/matthewwithanm/python-markdownify/actions/workflows/python-app.yml?query=workflow%3A%22Python+application%22
.. |version| image:: https://img.shields.io/pypi/v/markdownify
:alt: Pypi version
:target: https://pypi.org/project/markdownify/
.. |license| image:: https://img.shields.io/pypi/l/markdownify
:alt: License
:target: https://github.com/matthewwithanm/python-markdownify/blob/develop/LICENSE
.. |downloads| image:: https://pepy.tech/badge/markdownify
:alt: Pypi Downloads
:target: https://pepy.tech/project/markdownify
Installation
============
@@ -32,14 +14,14 @@ Convert some HTML to Markdown:
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude:
Specify tags to exclude (blacklist):
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include:
\...or specify the tags you want to include (whitelist):
.. code:: python
@@ -53,20 +35,16 @@ Options
Markdownify supports the following options:
strip
A list of tags to strip. This option can't be used with the
A list of tags to strip (blacklist). This option can't be used with the
``convert`` option.
convert
A list of tags to convert. This option can't be used with the
A list of tags to convert (whitelist). This option can't be used with the
``strip`` option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a ``a`` tag's contents match its href. Defaults to ``True``.
default_title
A boolean to enable setting the title of a link to its href, if no title is
given. Defaults to ``False``.
a ``a`` tag's contents match its href. Defaults to ``True``
heading_style
Defines how headings should be converted. Accepted values are ``ATX``,
@@ -79,140 +57,17 @@ bullets
lists are nested. Otherwise, the bullet will alternate based on nesting
level. Defaults to ``'*+-'``.
strong_em_symbol
In markdown, both ``*`` and ``_`` are used to encode **strong** or
*emphasized* texts. Either of these symbols can be chosen by the options
``ASTERISK`` (default) or ``UNDERSCORE`` respectively.
sub_symbol, sup_symbol
Define the chars that surround ``<sub>`` and ``<sup>`` text. Defaults to an
empty string, because this is non-standard behavior. Could be something like
``~`` and ``^`` to result in ``~sub~`` and ``^sup^``. If the value starts
with ``<`` and ends with ``>``, it is treated as an HTML tag and a ``/`` is
inserted after the ``<`` in the string used after the text; this allows
specifying ``<sub>`` to use raw HTML in the output for subscripts, for
example.
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all ``<pre>`` sections.
Useful, if all code on a page is in the same programming language and
should be annotated with `````python`` or similar.
Defaults to ``''`` (empty string) and can be any string.
code_language_callback
When the HTML code contains ``pre`` tags that in some way provide the code
language, for example as class, this callback can be used to extract the
language from the tag and prefix it to the converted ``pre`` tag.
The callback gets one single argument, an BeautifylSoup object, and returns
a string containing the code language, or ``None``.
An example to use the class name as code language could be::
def callback(el):
return el['class'][0] if el.has_attr('class') else None
Defaults to ``None``.
escape_asterisks
If set to ``False``, do not escape ``*`` to ``\*`` in text.
Defaults to ``True``.
escape_underscores
If set to ``False``, do not escape ``_`` to ``\_`` in text.
Defaults to ``True``.
escape_misc
If set to ``False``, do not escape miscellaneous punctuation characters
that sometimes have Markdown significance in text.
Defaults to ``True``.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside
headlines or table cells. If some inline images should be converted to
markdown images instead, this option can be set to a list of parent tags
that should be allowed to contain inline images, for example ``['td']``.
Defaults to an empty list.
wrap, wrap_width
If ``wrap`` is set to ``True``, all text paragraphs are wrapped at
``wrap_width`` characters. Defaults to ``False`` and ``80``.
Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Converting BeautifulSoup objects
================================
.. code:: python
from markdownify import MarkdownConverter
# Create shorthand method for conversion
def md(soup, **options):
return MarkdownConverter(**options).convert_soup(soup)
Creating Custom Converters
==========================
If you have a special usecase that calls for a special conversion, you can
always inherit from ``MarkdownConverter`` and override the method you want to
change.
The function that handles a HTML tag named ``abc`` is called
``convert_abc(self, el, text, convert_as_inline)`` and returns a string
containing the converted HTML tag.
The ``MarkdownConverter`` object will handle the conversion based on the
function names:
.. code:: python
from markdownify import MarkdownConverter
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
.. code:: python
from markdownify import MarkdownConverter
class IgnoreParagraphsConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that ignores paragraphs
"""
def convert_p(self, el, text, convert_as_inline):
return ''
# Create shorthand method for conversion
def md(html, **options):
return IgnoreParagraphsConverter(**options).convert(html)
Command Line Interface
======================
Use ``markdownify example.html > example.md`` or pipe input from stdin
(``cat example.html | markdownify > example.md``).
Call ``markdownify -h`` to see all available options.
They are the same as listed above and take the same arguments.
Development
===========
To run tests and the linter run ``pip install tox`` once, then ``tox``.
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``

View File

@@ -1,14 +1,13 @@
from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
from textwrap import fill
from bs4 import BeautifulSoup, NavigableString
import re
import six
convert_heading_re = re.compile(r'convert_h(\d+)')
line_beginning_re = re.compile(r'^', re.MULTILINE)
whitespace_re = re.compile(r'[\t ]+')
all_whitespace_re = re.compile(r'[\s]+')
html_heading_re = re.compile(r'h[1-6]')
whitespace_re = re.compile(r'[\r\n\s\t ]+')
FRAGMENT_ID = '__MARKDOWNIFY_WRAPPER__'
wrapped = '<div id="%s">%%s</div>' % FRAGMENT_ID
# Heading styles
@@ -17,13 +16,11 @@ ATX_CLOSED = 'atx_closed'
UNDERLINED = 'underlined'
SETEXT = UNDERLINED
# Newline style
SPACES = 'spaces'
BACKSLASH = 'backslash'
# Strong and emphasis style
ASTERISK = '*'
UNDERSCORE = '_'
def escape(text):
if not text:
return ''
return text.replace('_', r'\_')
def chomp(text):
@@ -39,53 +36,17 @@ def chomp(text):
return (prefix, suffix, text)
def abstract_inline_conversion(markup_fn):
"""
This abstracts all simple inline tags like b, em, del, ...
Returns a function that wraps the chomped text in a pair of the string
that is returned by markup_fn, with '/' inserted in the string used after
the text if it looks like an HTML tag. markup_fn is necessary to allow for
references to self.strong_em_symbol etc.
"""
def implementation(self, el, text, convert_as_inline):
markup_prefix = markup_fn(self)
if markup_prefix.startswith('<') and markup_prefix.endswith('>'):
markup_suffix = '</' + markup_prefix[1:]
else:
markup_suffix = markup_prefix
if el.find_parent(['pre', 'code', 'kbd', 'samp']):
return text
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s%s%s%s%s' % (prefix, markup_prefix, text, markup_suffix, suffix)
return implementation
def _todict(obj):
return dict((k, getattr(obj, k)) for k in dir(obj) if not k.startswith('_'))
class MarkdownConverter(object):
class DefaultOptions:
autolinks = True
bullets = '*+-' # An iterable of bullet types.
code_language = ''
code_language_callback = None
convert = None
default_title = False
escape_asterisks = True
escape_underscores = True
escape_misc = True
heading_style = UNDERLINED
keep_inline_images_in = []
newline_style = SPACES
strip = None
strong_em_symbol = ASTERISK
sub_symbol = ''
sup_symbol = ''
wrap = False
wrap_width = 80
convert = None
autolinks = True
heading_style = UNDERLINED
bullets = '*+-' # An iterable of bullet types.
class Options(DefaultOptions):
pass
@@ -101,82 +62,32 @@ class MarkdownConverter(object):
' convert, but not both.')
def convert(self, html):
# We want to take advantage of the html5 parsing, but we don't actually
# want a full document. Therefore, we'll mark our fragment with an id,
# create the document, and extract the element with the id.
html = wrapped % html
soup = BeautifulSoup(html, 'html.parser')
return self.convert_soup(soup)
return self.process_tag(soup.find(id=FRAGMENT_ID), children_only=True)
def convert_soup(self, soup):
return self.process_tag(soup, convert_as_inline=False, children_only=True)
def process_tag(self, node, convert_as_inline, children_only=False):
def process_tag(self, node, children_only=False):
text = ''
# markdown headings or cells can't include
# block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
if not children_only and (isHeading or isCell):
convert_children_as_inline = True
# Remove whitespace-only textnodes in purely nested nodes
def is_nested_node(el):
return el and el.name in ['ol', 'ul', 'li',
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th']
if is_nested_node(node):
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the
# conditions is true:
# - el is the first element in its parent
# - el is the last element in its parent
# - el is adjacent to an nested node
can_extract = (not el.previous_sibling
or not el.next_sibling
or is_nested_node(el.previous_sibling)
or is_nested_node(el.next_sibling))
if (isinstance(el, NavigableString)
and six.text_type(el).strip() == ''
and can_extract):
el.extract()
# Convert the children first
for el in node.children:
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
if isinstance(el, NavigableString):
text += self.process_text(six.text_type(el))
else:
text += self.process_tag(el, convert_children_as_inline)
text += self.process_tag(el)
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)
text = convert_fn(node, text)
return text
def process_text(self, el):
text = six.text_type(el) or ''
# normalize whitespace if we're not inside a preformatted element
if not el.find_parent('pre'):
text = whitespace_re.sub(' ', text)
# escape special characters if we're not inside a preformatted or code element
if not el.find_parent(['pre', 'code', 'kbd', 'samp']):
text = self.escape(text)
# remove trailing whitespaces if any of the following condition is true:
# - current text node is the last node in li
# - current text node is followed by an embedded list
if (el.parent.name == 'li'
and (not el.next_sibling
or el.next_sibling.name in ['ul', 'ol'])):
text = text.rstrip()
return text
def process_text(self, text):
return escape(whitespace_re.sub(' ', text or ''))
def __getattr__(self, attr):
# Handle headings
@@ -184,8 +95,8 @@ class MarkdownConverter(object):
if m:
n = int(m.group(1))
def convert_tag(el, text, convert_as_inline):
return self.convert_hn(n, el, text, convert_as_inline)
def convert_tag(el, text):
return self.convert_hn(n, el, text)
convert_tag.__name__ = 'convert_h%s' % n
setattr(self, convert_tag.__name__, convert_tag)
@@ -204,18 +115,6 @@ class MarkdownConverter(object):
else:
return True
def escape(self, text):
if not text:
return ''
if self.options['escape_misc']:
text = re.sub(r'([\\&<`[>~#=+|-])', r'\\\1', text)
text = re.sub(r'([0-9])([.)])', r'\1\\\2', text)
if self.options['escape_asterisks']:
text = text.replace('*', r'\*')
if self.options['escape_underscores']:
text = text.replace('_', r'\_')
return text
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
@@ -223,60 +122,36 @@ class MarkdownConverter(object):
text = (text or '').rstrip()
return '%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
def convert_a(self, el, text, convert_as_inline):
def convert_a(self, el, text):
prefix, suffix, text = chomp(text)
if not text:
return ''
href = el.get('href')
title = el.get('title')
# For the replacement see #29: text nodes underscores are escaped
if (self.options['autolinks']
and text.replace(r'\_', '_') == href
and not title
and not self.options['default_title']):
if self.options['autolinks'] and text == href and not title:
# Shortcut syntax
return '<%s>' % href
if self.options['default_title'] and not title:
title = href
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
convert_b = abstract_inline_conversion(lambda self: 2 * self.options['strong_em_symbol'])
def convert_b(self, el, text):
return self.convert_strong(el, text)
def convert_blockquote(self, el, text, convert_as_inline):
def convert_blockquote(self, el, text):
return '\n' + line_beginning_re.sub('> ', text) if text else ''
if convert_as_inline:
return text
def convert_br(self, el, text):
return ' \n'
return '\n' + (line_beginning_re.sub('> ', text.strip()) + '\n\n') if text else ''
def convert_em(self, el, text):
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s*%s*%s' % (prefix, text, suffix)
def convert_br(self, el, text, convert_as_inline):
if convert_as_inline:
return ""
if self.options['newline_style'].lower() == BACKSLASH:
return '\\\n'
else:
return ' \n'
def convert_code(self, el, text, convert_as_inline):
if el.parent.name == 'pre':
return text
converter = abstract_inline_conversion(lambda self: '`')
return converter(self, el, text, convert_as_inline)
convert_del = abstract_inline_conversion(lambda self: '~~')
convert_em = abstract_inline_conversion(lambda self: self.options['strong_em_symbol'])
convert_kbd = convert_code
def convert_hn(self, n, el, text, convert_as_inline):
if convert_as_inline:
return text
style = self.options['heading_style'].lower()
text = text.strip()
def convert_hn(self, n, el, text):
style = self.options['heading_style']
text = text.rstrip()
if style == UNDERLINED and n <= 2:
line = '=' if n == 1 else '-'
return self.underline(text, line)
@@ -285,31 +160,11 @@ class MarkdownConverter(object):
return '%s %s %s\n\n' % (hashes, text, hashes)
return '%s %s\n\n' % (hashes, text)
def convert_hr(self, el, text, convert_as_inline):
return '\n\n---\n\n'
convert_i = convert_em
def convert_img(self, el, text, convert_as_inline):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if (convert_as_inline
and el.parent.name not in self.options['keep_inline_images_in']):
return alt
return '![%s](%s%s)' % (alt, src, title_part)
def convert_list(self, el, text, convert_as_inline):
# Converting a list to inline is undefined.
# Ignoring convert_to_inline for list.
def convert_i(self, el, text):
return self.convert_em(el, text)
def convert_list(self, el, text):
nested = False
before_paragraph = False
if el.next_sibling and el.next_sibling.name not in ['ul', 'ol']:
before_paragraph = True
while el:
if el.name == 'li':
nested = True
@@ -318,15 +173,15 @@ class MarkdownConverter(object):
if nested:
# remove trailing newline if nested
return '\n' + self.indent(text, 1).rstrip()
return text + ('\n' if before_paragraph else '')
return '\n' + text + '\n'
convert_ul = convert_list
convert_ol = convert_list
def convert_li(self, el, text, convert_as_inline):
def convert_li(self, el, text):
parent = el.parent
if parent is not None and parent.name == 'ol':
if parent.get("start") and str(parent.get("start")).isnumeric():
if parent.get("start"):
start = int(parent.get("start"))
else:
start = 1
@@ -339,94 +194,23 @@ class MarkdownConverter(object):
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
return '%s %s\n' % (bullet, (text or '').strip())
return '%s %s\n' % (bullet, text or '')
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return text
if self.options['wrap']:
text = fill(text,
width=self.options['wrap_width'],
break_long_words=False,
break_on_hyphens=False)
def convert_p(self, el, text):
return '%s\n\n' % text if text else ''
def convert_pre(self, el, text, convert_as_inline):
def convert_strong(self, el, text):
prefix, suffix, text = chomp(text)
if not text:
return ''
code_language = self.options['code_language']
return '%s**%s**%s' % (prefix, text, suffix)
if self.options['code_language_callback']:
code_language = self.options['code_language_callback'](el) or code_language
return '\n```%s\n%s\n```\n' % (code_language, text)
def convert_script(self, el, text, convert_as_inline):
return ''
def convert_style(self, el, text, convert_as_inline):
return ''
convert_s = convert_del
convert_strong = convert_b
convert_samp = convert_code
convert_sub = abstract_inline_conversion(lambda self: self.options['sub_symbol'])
convert_sup = abstract_inline_conversion(lambda self: self.options['sup_symbol'])
def convert_table(self, el, text, convert_as_inline):
return '\n\n' + text + '\n'
def convert_caption(self, el, text, convert_as_inline):
return text + '\n'
def convert_figcaption(self, el, text, convert_as_inline):
return '\n\n' + text + '\n\n'
def convert_td(self, el, text, convert_as_inline):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = int(el['colspan'])
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_th(self, el, text, convert_as_inline):
colspan = 1
if 'colspan' in el.attrs and el['colspan'].isdigit():
colspan = int(el['colspan'])
return ' ' + text.strip().replace("\n", " ") + ' |' * colspan
def convert_tr(self, el, text, convert_as_inline):
cells = el.find_all(['td', 'th'])
is_headrow = (
all([cell.name == 'th' for cell in cells])
or (not el.previous_sibling and not el.parent.name == 'tbody')
or (not el.previous_sibling and el.parent.name == 'tbody' and len(el.parent.parent.find_all(['thead'])) < 1)
)
overline = ''
underline = ''
if is_headrow and not el.previous_sibling:
# first row and is headline: print headline underline
full_colspan = 0
for cell in cells:
if 'colspan' in cell.attrs and cell['colspan'].isdigit():
full_colspan += int(cell["colspan"])
else:
full_colspan += 1
underline += '| ' + ' | '.join(['---'] * full_colspan) + ' |' + '\n'
elif (not el.previous_sibling
and (el.parent.name == 'table'
or (el.parent.name == 'tbody'
and not el.parent.previous_sibling))):
# first row, not headline, and:
# - the parent is table or
# - the parent is tbody at the beginning of a table.
# print empty headline above this row
overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
return overline + '|' + text + '\n' + underline
def convert_img(self, el, text):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '![%s](%s%s)' % (alt, src, title_part)
def markdownify(html, **options):

View File

@@ -1,73 +0,0 @@
#!/usr/bin/env python
import argparse
import sys
from markdownify import markdownify, ATX, ATX_CLOSED, UNDERLINED, \
SPACES, BACKSLASH, ASTERISK, UNDERSCORE
def main(argv=sys.argv[1:]):
parser = argparse.ArgumentParser(
prog='markdownify',
description='Converts html to markdown.',
)
parser.add_argument('html', nargs='?', type=argparse.FileType('r'),
default=sys.stdin,
help="The html file to convert. Defaults to STDIN if not "
"provided.")
parser.add_argument('-s', '--strip', nargs='*',
help="A list of tags to strip. This option can't be used with "
"the --convert option.")
parser.add_argument('-c', '--convert', nargs='*',
help="A list of tags to convert. This option can't be used with "
"the --strip option.")
parser.add_argument('-a', '--autolinks', action='store_true',
help="A boolean indicating whether the 'automatic link' style "
"should be used when a 'a' tag's contents match its href.")
parser.add_argument('--default-title', action='store_false',
help="A boolean to enable setting the title of a link to its "
"href, if no title is given.")
parser.add_argument('--heading-style', default=UNDERLINED,
choices=(ATX, ATX_CLOSED, UNDERLINED),
help="Defines how headings should be converted.")
parser.add_argument('-b', '--bullets', default='*+-',
help="A string of bullet styles to use; the bullet will "
"alternate based on nesting level.")
parser.add_argument('--strong-em-symbol', default=ASTERISK,
choices=(ASTERISK, UNDERSCORE),
help="Use * or _ to convert strong and italics text"),
parser.add_argument('--sub-symbol', default='',
help="Define the chars that surround '<sub>'.")
parser.add_argument('--sup-symbol', default='',
help="Define the chars that surround '<sup>'.")
parser.add_argument('--newline-style', default=SPACES,
choices=(SPACES, BACKSLASH),
help="Defines the style of <br> conversions: two spaces "
"or backslash at the and of the line thet should break.")
parser.add_argument('--code-language', default='',
help="Defines the language that should be assumed for all "
"'<pre>' sections.")
parser.add_argument('--no-escape-asterisks', dest='escape_asterisks',
action='store_false',
help="Do not escape '*' to '\\*' in text.")
parser.add_argument('--no-escape-underscores', dest='escape_underscores',
action='store_false',
help="Do not escape '_' to '\\_' in text.")
parser.add_argument('-i', '--keep-inline-images-in', nargs='*',
help="Images are converted to their alt-text when the images are "
"located inside headlines or table cells. If some inline images "
"should be converted to markdown images instead, this option can "
"be set to a list of parent tags that should be allowed to "
"contain inline images.")
parser.add_argument('-w', '--wrap', action='store_true',
help="Wrap all text paragraphs at --wrap-width characters.")
parser.add_argument('--wrap-width', type=int, default=80)
args = parser.parse_args(argv)
print(markdownify(**vars(args)))
if __name__ == '__main__':
main()

View File

@@ -1,45 +0,0 @@
[build-system]
requires = ["setuptools>=61.2", "setuptools_scm[toml]>=3.4.3"]
build-backend = "setuptools.build_meta"
[project]
name = "markdownify"
version = "0.13.0"
authors = [{name = "Matthew Tretter", email = "m@tthewwithanm.com"}]
description = "Convert HTML to markdown."
readme = "README.rst"
classifiers = [
"Environment :: Web Environment",
"Framework :: Django",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 2.5",
"Programming Language :: Python :: 2.6",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Topic :: Utilities",
]
dependencies = [
"beautifulsoup4>=4.9,<5",
"six>=1.15,<2"
]
[project.urls]
Homepage = "http://github.com/matthewwithanm/python-markdownify"
Download = "http://github.com/matthewwithanm/python-markdownify/tarball/master"
[project.scripts]
markdownify = "markdownify.main:main"
[tool.setuptools]
zip-safe = false
include-package-data = true
[tool.setuptools.packages.find]
include = ["markdownify", "markdownify.*"]
namespaces = false
[tool.setuptools_scm]

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[flake8]
ignore = E501

96
setup.py Normal file
View File

@@ -0,0 +1,96 @@
#/usr/bin/env python
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.5.2',
}
class PyTest(TestCommand):
def finalize_options(self):
TestCommand.finalize_options(self)
self.test_args = ['tests', '-s']
self.test_suite = True
def run_tests(self):
import pytest
errno = pytest.main(self.test_args)
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.engine import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
setup(
name='markdownify',
description='Convert HTML to markdown.',
long_description=read(os.path.join(os.path.dirname(__file__), 'README.rst')),
version=pkgmeta['__version__'],
author=pkgmeta['__author__'],
author_email='m@tthewwithanm.com',
url='http://github.com/matthewwithanm/python-markdownify',
download_url='http://github.com/matthewwithanm/python-markdownify/tarball/master',
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8',
],
tests_require=[
'pytest',
],
install_requires=[
'beautifulsoup4', 'six'
],
classifiers=[
'Environment :: Web Environment',
'Framework :: Django',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python :: 2.5',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Topic :: Utilities'
],
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
)

View File

@@ -1,10 +0,0 @@
{ pkgs ? import <nixpkgs> {} }:
pkgs.mkShell {
name = "python-shell";
buildInputs = with pkgs; [
python38
python38Packages.tox
python38Packages.setuptools
python38Packages.virtualenv
];
}

View File

@@ -1,39 +1,6 @@
from markdownify import markdownify as md
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_nested():
text = md('<p>This is an <a href="http://example.com/">example link</a>.</p>')
assert text == 'This is an [example link](http://example.com/).\n\n'
def test_ignore_comments():
text = md("<!-- This is a comment -->")
assert text == ""
def test_ignore_comments_with_other_tags():
text = md("<!-- This is a comment --><a href='http://example.com/'>example link</a>")
assert text == "[example link](http://example.com/)"
def test_code_with_tricky_content():
assert md('<code>></code>') == "`>`"
assert md('<code>/home/</code><b>username</b>') == "`/home/`**username**"
assert md('First line <code>blah blah<br />blah blah</code> second line') \
== "First line `blah blah \nblah blah` second line"
def test_special_tags():
assert md('<!DOCTYPE html>') == ''
assert md('<![CDATA[foobar]]>') == 'foobar'

View File

@@ -10,4 +10,4 @@ def test_soup():
def test_whitespace():
assert md(' a b \t\t c ') == ' a b c '
assert md(' a b \n\n c ') == ' a b c '

View File

@@ -1,20 +1,40 @@
from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, UNDERSCORE
from markdownify import markdownify as md, ATX, ATX_CLOSED
import re
def inline_tests(tag, markup):
# test template for different inline tags
assert md(f'<{tag}>Hello</{tag}>') == f'{markup}Hello{markup}'
assert md(f'foo <{tag}>Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo<{tag}> Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}>Hello </{tag}>bar') == f'foo {markup}Hello{markup} bar'
assert md(f'foo <{tag}></{tag}> bar') in ['foo bar', 'foo bar'] # Either is OK
nested_uls = re.sub(r'\s+', '', """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>""")
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_a():
assert md('<a href="https://google.com">Google</a>') == '[Google](https://google.com)'
assert md('<a href="https://google.com">https://google.com</a>') == '<https://google.com>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>') == '<https://community.kde.org/Get_Involved>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>', autolinks=False) == '[https://community.kde.org/Get\\_Involved](https://community.kde.org/Get_Involved)'
assert md('<a href="http://google.com">Google</a>') == '[Google](http://google.com)'
def test_a_spaces():
@@ -27,7 +47,6 @@ def test_a_spaces():
def test_a_with_title():
text = md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>')
assert text == r'[Google](http://google.com "The \"Goog\"")'
assert md('<a href="https://google.com">https://google.com</a>', default_title=True) == '[https://google.com](https://google.com "https://google.com")'
def test_a_shortcut():
@@ -36,7 +55,8 @@ def test_a_shortcut():
def test_a_no_autolinks():
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
text = md('<a href="http://google.com">http://google.com</a>', autolinks=False)
assert text == '[http://google.com](http://google.com)'
def test_b():
@@ -51,72 +71,27 @@ def test_b_spaces():
def test_blockquote():
assert md('<blockquote>Hello</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>\nHello\n</blockquote>') == '\n> Hello\n\n'
assert md('<blockquote>Hello</blockquote>').strip() == '> Hello'
def test_blockquote_with_nested_paragraph():
assert md('<blockquote><p>Hello</p></blockquote>') == '\n> Hello\n\n'
assert md('<blockquote><p>Hello</p><p>Hello again</p></blockquote>') == '\n> Hello\n> \n> Hello again\n\n'
def test_blockquote_with_paragraph():
assert md('<blockquote>Hello</blockquote><p>handsome</p>') == '\n> Hello\n\nhandsome\n\n'
def test_blockquote_nested():
text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>')
assert text == '\n> And she was like \n> > Hello\n\n'
def test_nested_blockquote():
text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>').strip()
assert text == '> And she was like \n> > Hello'
def test_br():
assert md('a<br />b<br />c') == 'a \nb \nc'
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
def test_caption():
assert md('TEXT<figure><figcaption>Caption</figcaption><span>SPAN</span></figure>') == 'TEXT\n\nCaption\n\nSPAN'
assert md('<figure><span>SPAN</span><figcaption>Caption</figcaption></figure>TEXT') == 'SPAN\n\nCaption\n\nTEXT'
def test_code():
inline_tests('code', '`')
assert md('<code>*this_should_not_escape*</code>') == '`*this_should_not_escape*`'
assert md('<kbd>*this_should_not_escape*</kbd>') == '`*this_should_not_escape*`'
assert md('<samp>*this_should_not_escape*</samp>') == '`*this_should_not_escape*`'
assert md('<code><span>*this_should_not_escape*</span></code>') == '`*this_should_not_escape*`'
assert md('<code>this should\t\tnormalize</code>') == '`this should normalize`'
assert md('<code><span>this should\t\tnormalize</span></code>') == '`this should normalize`'
assert md('<code>foo<b>bar</b>baz</code>') == '`foobarbaz`'
assert md('<kbd>foo<i>bar</i>baz</kbd>') == '`foobarbaz`'
assert md('<samp>foo<del> bar </del>baz</samp>') == '`foo bar baz`'
assert md('<samp>foo <del>bar</del> baz</samp>') == '`foo bar baz`'
assert md('<code>foo<em> bar </em>baz</code>') == '`foo bar baz`'
assert md('<code>foo<code> bar </code>baz</code>') == '`foo bar baz`'
assert md('<code>foo<strong> bar </strong>baz</code>') == '`foo bar baz`'
assert md('<code>foo<s> bar </s>baz</code>') == '`foo bar baz`'
assert md('<code>foo<sup>bar</sup>baz</code>', sup_symbol='^') == '`foobarbaz`'
assert md('<code>foo<sub>bar</sub>baz</code>', sub_symbol='^') == '`foobarbaz`'
def test_del():
inline_tests('del', '~~')
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_em():
inline_tests('em', '*')
assert md('<em>Hello</em>') == '*Hello*'
def test_header_with_space():
assert md('<h3>\n\nHello</h3>') == '### Hello\n\n'
assert md('<h4>\n\nHello</h4>') == '#### Hello\n\n'
assert md('<h5>\n\nHello</h5>') == '##### Hello\n\n'
assert md('<h5>\n\nHello\n\n</h5>') == '##### Hello\n\n'
assert md('<h5>\n\nHello \n\n</h5>') == '##### Hello\n\n'
def test_em_spaces():
assert md('foo <em>Hello</em> bar') == 'foo *Hello* bar'
assert md('foo<em> Hello</em> bar') == 'foo *Hello* bar'
assert md('foo <em>Hello </em>bar') == 'foo *Hello* bar'
assert md('foo <em></em> bar') == 'foo bar'
def test_h1():
@@ -129,163 +104,56 @@ def test_h2():
def test_hn():
assert md('<h3>Hello</h3>') == '### Hello\n\n'
assert md('<h4>Hello</h4>') == '#### Hello\n\n'
assert md('<h5>Hello</h5>') == '##### Hello\n\n'
assert md('<h6>Hello</h6>') == '###### Hello\n\n'
def test_hn_chained():
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '# First\n\n\n## Second\n\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X# First\n\n'
def test_hn_nested_tag_heading_style():
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '# A P C\n\n'
def test_hn_nested_simple_tag():
tag_to_markdown = [
("strong", "**strong**"),
("b", "**b**"),
("em", "*em*"),
("i", "*i*"),
("p", "p"),
("a", "a"),
("div", "div"),
("blockquote", "blockquote"),
]
for tag, markdown in tag_to_markdown:
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '### A B\n\n'
# Nested lists not supported
# assert md('<h3>A <ul><li>li1</i><li>l2</li></ul></h3>', heading_style=ATX) == '### A li1 li2 B\n\n'
def test_hn_nested_img():
image_attributes_to_markdown = [
("", "", ""),
("alt='Alt Text'", "Alt Text", ""),
("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
]
for image_attributes, markdown, title in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
def test_hn_atx_headings():
def test_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_hn_atx_closed_headings():
def test_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_head():
assert md('<head>head</head>') == 'head'
def test_hr():
assert md('Hello<hr>World') == 'Hello\n\n---\n\nWorld'
assert md('Hello<hr />World') == 'Hello\n\n---\n\nWorld'
assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == 'Hello\n\n\n\n\n---\n\n\nWorld\n\n'
def test_i():
assert md('<i>Hello</i>') == '*Hello*'
def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
def test_kbd():
inline_tests('kbd', '`')
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '\n1. a\n2. b\n\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '\n3. a\n4. b\n\n'
def test_p():
assert md('<p>hello</p>') == 'hello\n\n'
assert md('<p>123456789 123456789</p>') == '123456789 123456789\n\n'
assert md('<p>123456789 123456789</p>', wrap=True, wrap_width=10) == '123456789\n123456789\n\n'
assert md('<p><a href="https://example.com">Some long link</a></p>', wrap=True, wrap_width=10) == '[Some long\nlink](https://example.com)\n\n'
assert md('<p>12345<br />67890</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345\\\n67890\n\n'
assert md('<p>12345678901<br />12345</p>', wrap=True, wrap_width=10, newline_style=BACKSLASH) == '12345678901\\\n12345\n\n'
def test_pre():
assert md('<pre>test\n foo\nbar</pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>') == '\n```\ntest\n foo\nbar\n```\n'
assert md('<pre>*this_should_not_escape*</pre>') == '\n```\n*this_should_not_escape*\n```\n'
assert md('<pre><span>*this_should_not_escape*</span></pre>') == '\n```\n*this_should_not_escape*\n```\n'
assert md('<pre>\t\tthis should\t\tnot normalize</pre>') == '\n```\n\t\tthis should\t\tnot normalize\n```\n'
assert md('<pre><span>\t\tthis should\t\tnot normalize</span></pre>') == '\n```\n\t\tthis should\t\tnot normalize\n```\n'
assert md('<pre>foo<b>\nbar\n</b>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<i>\nbar\n</i>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo\n<i>bar</i>\nbaz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<i>\n</i>baz</pre>') == '\n```\nfoo\nbaz\n```\n'
assert md('<pre>foo<del>\nbar\n</del>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<em>\nbar\n</em>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<code>\nbar\n</code>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<strong>\nbar\n</strong>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<s>\nbar\n</s>baz</pre>') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<sup>\nbar\n</sup>baz</pre>', sup_symbol='^') == '\n```\nfoo\nbar\nbaz\n```\n'
assert md('<pre>foo<sub>\nbar\n</sub>baz</pre>', sub_symbol='^') == '\n```\nfoo\nbar\nbaz\n```\n'
def test_script():
assert md('foo <script>var foo=42;</script> bar') == 'foo bar'
def test_style():
assert md('foo <style>h1 { font-size: larger }</style> bar') == 'foo bar'
def test_s():
inline_tests('s', '~~')
def test_samp():
inline_tests('samp', '`')
def test_strong():
assert md('<strong>Hello</strong>') == '**Hello**'
def test_strong_em_symbol():
assert md('<strong>Hello</strong>', strong_em_symbol=UNDERSCORE) == '__Hello__'
assert md('<b>Hello</b>', strong_em_symbol=UNDERSCORE) == '__Hello__'
assert md('<em>Hello</em>', strong_em_symbol=UNDERSCORE) == '_Hello_'
assert md('<i>Hello</i>', strong_em_symbol=UNDERSCORE) == '_Hello_'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '\n* a\n* b\n\n'
def test_sub():
assert md('<sub>foo</sub>') == 'foo'
assert md('<sub>foo</sub>', sub_symbol='~') == '~foo~'
assert md('<sub>foo</sub>', sub_symbol='<sub>') == '<sub>foo</sub>'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n\n* a\n* b\n\nbar\n\n'
def test_sup():
assert md('<sup>foo</sup>') == 'foo'
assert md('<sup>foo</sup>', sup_symbol='^') == '^foo^'
assert md('<sup>foo</sup>', sup_symbol='<sup>') == '<sup>foo</sup>'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n\n'
def test_lang():
assert md('<pre>test\n foo\nbar</pre>', code_language='python') == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre><code>test\n foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n foo\nbar\n```\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n\n'
def test_lang_callback():
def callback(el):
return el['class'][0] if el.has_attr('class') else None
assert md('<pre class="python">test\n foo\nbar</pre>', code_language_callback=callback) == '\n```python\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code>test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
assert md('<pre class="javascript"><code class="javascript">test\n foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n foo\nbar\n```\n'
def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'

View File

@@ -1,25 +0,0 @@
from markdownify import MarkdownConverter
from bs4 import BeautifulSoup
class ImageBlockConverter(MarkdownConverter):
"""
Create a custom MarkdownConverter that adds two newlines after an image
"""
def convert_img(self, el, text, convert_as_inline):
return super().convert_img(el, text, convert_as_inline) + '\n\n'
def test_img():
# Create shorthand method for conversion
def md(html, **options):
return ImageBlockConverter(**options).convert(html)
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'
def test_soup():
html = '<b>test</b>'
soup = BeautifulSoup(html, 'html.parser')
assert MarkdownConverter().convert_soup(soup) == '**test**'

View File

@@ -1,18 +1,12 @@
from markdownify import markdownify as md
def test_asterisks():
assert md('*hey*dude*') == r'\*hey\*dude\*'
assert md('*hey*dude*', escape_asterisks=False) == r'*hey*dude*'
def test_underscore():
assert md('_hey_dude_') == r'\_hey\_dude\_'
assert md('_hey_dude_', escape_underscores=False) == r'_hey_dude_'
assert md('_hey_dude_') == '\_hey\_dude\_'
def test_xml_entities():
assert md('&amp;') == r'\&'
assert md('&amp;') == '&'
def test_named_entities():
@@ -25,23 +19,4 @@ def test_hexadecimal_entities():
def test_single_escaping_entities():
assert md('&amp;amp;') == r'\&amp;'
def text_misc():
assert md('\\*') == r'\\\*'
assert md('<foo>') == r'\<foo\>'
assert md('# foo') == r'\# foo'
assert md('> foo') == r'\> foo'
assert md('~~foo~~') == r'\~\~foo\~\~'
assert md('foo\n===\n') == 'foo\n\\=\\=\\=\n'
assert md('---\n') == '\\-\\-\\-\n'
assert md('+ x\n+ y\n') == '\\+ x\n\\+ y\n'
assert md('`x`') == r'\`x\`'
assert md('[text](link)') == r'\[text](link)'
assert md('1. x') == r'1\. x'
assert md('not a number. x') == r'not a number. x'
assert md('1) x') == r'1\) x'
assert md('not a number) x') == r'not a number) x'
assert md('|not table|') == r'\|not table\|'
assert md(r'\ <foo> &amp;amp; | ` `', escape_misc=False) == r'\ <foo> &amp; | ` `'
assert md('&amp;amp;') == '&amp;'

View File

@@ -1,84 +0,0 @@
from markdownify import markdownify as md
nested_uls = """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>"""
nested_ols = """
<ol>
<li>1
<ol>
<li>a
<ol>
<li>I</li>
<li>II</li>
<li>III</li>
</ol>
</li>
<li>b</li>
<li>c</li>
</ol>
</li>
<li>2</li>
<li>3</li>
</ul>"""
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
assert md('<ol start="-1"><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="foo"><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol start="1.5"><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
def test_nested_ols():
assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
assert md("""<ul>
<li>
a
</li>
<li> b </li>
<li> c
</li>
</ul>""") == '* a\n* b\n* c\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
def test_li_text():
assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar </li><li>foo <b>bar</b> <i>space</i>.</ul>') == '* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'

View File

@@ -1,254 +0,0 @@
from markdownify import markdownify as md
table = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_html_content = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td><b>Jill</b></td>
<td><i>Smith</i></td>
<td><a href="#">50</a></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_paragraphs = """<table>
<tr>
<th>Firstname</th>
<th><p>Lastname</p></th>
<th>Age</th>
</tr>
<tr>
<td><p>Jill</p></td>
<td><p>Smith</p></td>
<td><p>50</p></td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_linebreaks = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith
Jackson</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson
Smith</td>
<td>94</td>
</tr>
</table>"""
table_with_header_column = """<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<th>Jill</th>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<th>Eve</th>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_head_body = """<table>
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_head_body_missing_head = """<table>
<thead>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_text = """<table>
<thead>
<tr>
<th></th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td></td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_missing_head = """<table>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_body = """<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>"""
table_with_caption = """TEXT<table><caption>Caption</caption>
<tbody><tr><td>Firstname</td>
<td>Lastname</td>
<td>Age</td>
</tr>
</tbody>
</table>"""
table_with_colspan = """<table>
<tr>
<th colspan="2">Name</th>
<th>Age</th>
</tr>
<tr>
<td colspan="1">Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>"""
table_with_undefined_colspan = """<table>
<tr>
<th colspan="undefined">Name</th>
<th>Age</th>
</tr>
<tr>
<td colspan="-1">Jill</td>
<td>Smith</td>
</tr>
</table>"""
def test_table():
assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_paragraphs) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_linebreaks) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith Jackson | 50 |\n| Eve | Jackson Smith | 94 |\n\n'
assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_head_body_missing_head) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_text) == '\n\n| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_missing_head) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_caption) == 'TEXT\n\nCaption\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n\n'
assert md(table_with_colspan) == '\n\n| Name | | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
assert md(table_with_undefined_colspan) == '\n\n| Name | Age |\n| --- | --- |\n| Jill | Smith |\n\n'

15
tox.ini
View File

@@ -1,15 +0,0 @@
[tox]
envlist = py38
[testenv]
passenv = PYTHONPATH
deps =
pytest==8
flake8
restructuredtext_lint
Pygments
commands =
pytest
flake8 --ignore=E501,W503 markdownify tests
restructuredtext-lint README.rst