added readme for callback

Merge branch 'code_language_callback' of https://github.com/tdgroot/python-markdownify into tdgroot-code_language_callback
add escaping of asterisks and option to disable it
2022-04-13 20:42:38 +02:00 · 2022-04-13 20:25:37 +02:00 · 2022-04-13 20:04:12 +02:00 · 2022-04-13 19:55:34 +02:00 · 2022-04-09 13:22:28 +02:00 · 2022-01-24 18:18:19 +01:00
11 changed files with 618 additions and 269 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -8,3 +8,4 @@
 /MANIFEST
 /venv
 build/
 .vscode/settings.json
--- a/README.rst
+++ b/README.rst
@@ -32,14 +32,14 @@ Convert some HTML to Markdown:
    from markdownify import markdownify as md
    md('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'
-Specify tags to exclude (blacklist):
+Specify tags to exclude:
 .. code:: python
    from markdownify import markdownify as md
    md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a'])  # > '**Yay** GitHub'
-\...or specify the tags you want to include (whitelist):
+\...or specify the tags you want to include:
 .. code:: python
@@ -53,16 +53,20 @@ Options
 Markdownify supports the following options:
 strip
-  A list of tags to strip (blacklist). This option can't be used with the
+  A list of tags to strip. This option can't be used with the
  ``convert`` option.
 convert
-  A list of tags to convert (whitelist). This option can't be used with the
+  A list of tags to convert. This option can't be used with the
  ``strip`` option.
 autolinks
  A boolean indicating whether the "automatic link" style should be used when
-  a ``a`` tag's contents match its href. Defaults to ``True``
+  a ``a`` tag's contents match its href. Defaults to ``True``.
 default_title
  A boolean to enable setting the title of a link to its href, if no title is
  given. Defaults to ``False``.
 heading_style
  Defines how headings should be converted. Accepted values are ``ATX``,
@@ -80,6 +84,11 @@ strong_em_symbol
  *emphasized* texts. Either of these symbols can be chosen by the options
  ``ASTERISK`` (default) or ``UNDERSCORE`` respectively.
 sub_symbol, sup_symbol
  Define the chars that surround ``<sub>`` and ``<sup>`` text. Defaults to an
  empty string, because this is non-standard behavior. Could be something like
  ``~`` and ``^`` to result in ``~sub~`` and ``^sup^``.
 newline_style
  Defines the style of marking linebreaks (``<br>``) in markdown. The default
  value ``SPACES`` of this option will adopt the usual two spaces and a newline,
@@ -87,10 +96,79 @@ newline_style
  newline). While the latter convention is non-standard, it is commonly
  preferred and supported by a lot of interpreters.
 code_language
  Defines the language that should be assumed for all ``<pre>`` sections.
  Useful, if all code on a page is in the same programming language and
  should be annotated with `````python`` or similar.
  Defaults to ``''`` (empty string) and can be any string.
 code_language_callback
  When the HTML code contains ``pre`` tags that in some way provide the code
  language, for example as class, this callback can be used to extract the
  language from the tag and prefix it to the converted ``pre`` tag.
  The callback gets one single argument, an BeautifylSoup object, and returns
  a string containing the code language, or ``None``.
  An example to use the class name as code language could be::
    def callback(el):
        return el['class'][0] if el.has_attr('class') else None
  Defaults to ``None``.
 escape_asterisks
  If set to ``False``, do not escape ``*`` to ``\*`` in text.
  Defaults to ``True``.
 escape_underscores
  If set to ``False``, do not escape ``_`` to ``\_`` in text.
  Defaults to ``True``.
 keep_inline_images_in
  Images are converted to their alt-text when the images are located inside
  headlines or table cells. If some inline images should be converted to
  markdown images instead, this option can be set to a list of parent tags
  that should be allowed to contain inline images, for example ``['td']``.
  Defaults to an empty list.
 Options may be specified as kwargs to the ``markdownify`` function, or as a
 nested ``Options`` class in ``MarkdownConverter`` subclasses.
 Converting BeautifulSoup objects
 ================================
 .. code:: python
    from markdownify import MarkdownConverter
    # Create shorthand method for conversion
    def md(soup, **options):
        return MarkdownConverter(**options).convert_soup(soup)
 Creating Custom Converters
 ==========================
 If you have a special usecase that calls for a special conversion, you can
 always inherit from ``MarkdownConverter`` and override the method you want to
 change:
 .. code:: python
    from markdownify import MarkdownConverter
    class ImageBlockConverter(MarkdownConverter):
        """
        Create a custom MarkdownConverter that adds two newlines after an image
        """
        def convert_img(self, el, text, convert_as_inline):
            return super().convert_img(el, text, convert_as_inline) + '\n\n'
    # Create shorthand method for conversion
    def md(html, **options):
        return ImageBlockConverter(**options).convert(html)
 Development
 ===========
--- a/markdownify/init.py
+++ b/markdownify/init.py
@@ -1,4 +1,4 @@
-from bs4 import BeautifulSoup, NavigableString, Comment
+from bs4 import BeautifulSoup, NavigableString, Comment, Doctype
 import re
 import six
@@ -25,12 +25,6 @@ ASTERISK = '*'
 UNDERSCORE = '_'
 def escape(text):
    if not text:
        return ''
    return text.replace('_', r'\_')
 def chomp(text):
    """
    If the text in an inline tag like b, a, or em contains a leading or trailing
@@ -44,19 +38,43 @@ def chomp(text):
    return (prefix, suffix, text)
 def abstract_inline_conversion(markup_fn):
    """
    This abstracts all simple inline tags like b, em, del, ...
    Returns a function that wraps the chomped text in a pair of the string
    that is returned by markup_fn. markup_fn is necessary to allow for
    references to self.strong_em_symbol etc.
    """
    def implementation(self, el, text, convert_as_inline):
        markup = markup_fn(self)
        prefix, suffix, text = chomp(text)
        if not text:
            return ''
        return '%s%s%s%s%s' % (prefix, markup, text, markup, suffix)
    return implementation
 def _todict(obj):
    return dict((k, getattr(obj, k)) for k in dir(obj) if not k.startswith('_'))
 class MarkdownConverter(object):
    class DefaultOptions:
        strip = None
        convert = None
        autolinks = True
        heading_style = UNDERLINED
        bullets = '*+-'  # An iterable of bullet types.
-        strong_em_symbol = ASTERISK
+        code_language = ''
        code_language_callback = None
        convert = None
        default_title = False
        escape_asterisks = True
        escape_underscores = True
        heading_style = UNDERLINED
        keep_inline_images_in = []
        newline_style = SPACES
        strip = None
        strong_em_symbol = ASTERISK
        sub_symbol = ''
        sup_symbol = ''
    class Options(DefaultOptions):
        pass
@@ -73,26 +91,48 @@ class MarkdownConverter(object):
    def convert(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        return self.convert_soup(soup)
    def convert_soup(self, soup):
        return self.process_tag(soup, convert_as_inline=False, children_only=True)
    def process_tag(self, node, convert_as_inline, children_only=False):
        text = ''
-        # markdown headings can't include block elements (elements w/newlines)
+
        # markdown headings or cells can't include
        # block elements (elements w/newlines)
        isHeading = html_heading_re.match(node.name) is not None
        isCell = node.name in ['td', 'th']
        convert_children_as_inline = convert_as_inline
-        if not children_only and isHeading:
+        if not children_only and (isHeading or isCell):
            convert_children_as_inline = True
-        # Remove whitespace-only textnodes in lists
+        # Remove whitespace-only textnodes in purely nested nodes
-        if node.name in ['ol', 'ul', 'li']:
+        def is_nested_node(el):
            return el and el.name in ['ol', 'ul', 'li',
                                      'table', 'thead', 'tbody', 'tfoot',
                                      'tr', 'td', 'th']
        if is_nested_node(node):
            for el in node.children:
-                if isinstance(el, NavigableString) and six.text_type(el).strip() == '':
+                # Only extract (remove) whitespace-only text node if any of the
                # conditions is true:
                # - el is the first element in its parent
                # - el is the last element in its parent
                # - el is adjacent to an nested node
                can_extract = (not el.previous_sibling
                               or not el.next_sibling
                               or is_nested_node(el.previous_sibling)
                               or is_nested_node(el.next_sibling))
                if (isinstance(el, NavigableString)
                        and six.text_type(el).strip() == ''
                        and can_extract):
                    el.extract()
        # Convert the children first
        for el in node.children:
-            if isinstance(el, Comment):
+            if isinstance(el, Comment) or isinstance(el, Doctype):
                continue
            elif isinstance(el, NavigableString):
                text += self.process_text(el)
@@ -107,10 +147,26 @@ class MarkdownConverter(object):
        return text
    def process_text(self, el):
-        text = six.text_type(el)
+        text = six.text_type(el) or ''
-        if el.parent.name == 'li':
+
-            return escape(all_whitespace_re.sub(' ', text or '')).rstrip()
+        # dont remove any whitespace when handling pre or code in pre
-        return escape(whitespace_re.sub(' ', text or ''))
+        if not (el.parent.name == 'pre'
                or (el.parent.name == 'code'
                    and el.parent.parent.name == 'pre')):
            text = whitespace_re.sub(' ', text)
        if el.parent.name != 'code':
            text = self.escape(text)
        # remove trailing whitespaces if any of the following condition is true:
        # - current text node is the last node in li
        # - current text node is followed by an embedded list
        if (el.parent.name == 'li'
                and (not el.next_sibling
                     or el.next_sibling.name in ['ul', 'ol'])):
            text = text.rstrip()
        return text
    def __getattr__(self, attr):
        # Handle headings
@@ -138,6 +194,15 @@ class MarkdownConverter(object):
        else:
            return True
    def escape(self, text):
        if not text:
            return ''
        if self.options['escape_asterisks']:
            text = text.replace('*', r'\*')
        if self.options['escape_underscores']:
            text = text.replace('_', r'\_')
        return text
    def indent(self, text, level):
        return line_beginning_re.sub('\t' * level, text) if text else ''
@@ -149,19 +214,21 @@ class MarkdownConverter(object):
        prefix, suffix, text = chomp(text)
        if not text:
            return ''
        if convert_as_inline:
            return text
        href = el.get('href')
        title = el.get('title')
        # For the replacement see #29: text nodes underscores are escaped
-        if self.options['autolinks'] and text.replace(r'\_', '_') == href and not title:
+        if (self.options['autolinks']
                and text.replace(r'\_', '_') == href
                and not title
                and not self.options['default_title']):
            # Shortcut syntax
            return '<%s>' % href
        if self.options['default_title'] and not title:
            title = href
        title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
        return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
-    def convert_b(self, el, text, convert_as_inline):
+    convert_b = abstract_inline_conversion(lambda self: 2 * self.options['strong_em_symbol'])
        return self.convert_strong(el, text, convert_as_inline)
    def convert_blockquote(self, el, text, convert_as_inline):
@@ -179,12 +246,17 @@ class MarkdownConverter(object):
        else:
            return '  \n'
-    def convert_em(self, el, text, convert_as_inline):
+    def convert_code(self, el, text, convert_as_inline):
-        em_tag = self.options['strong_em_symbol']
+        if el.parent.name == 'pre':
-        prefix, suffix, text = chomp(text)
+            return text
-        if not text:
+        converter = abstract_inline_conversion(lambda self: '`')
-            return ''
+        return converter(self, el, text, convert_as_inline)
-        return '%s%s%s%s%s' % (prefix, em_tag, text, em_tag, suffix)
+
    convert_del = abstract_inline_conversion(lambda self: '~~')
    convert_em = abstract_inline_conversion(lambda self: self.options['strong_em_symbol'])
    convert_kbd = convert_code
    def convert_hn(self, n, el, text, convert_as_inline):
        if convert_as_inline:
@@ -200,8 +272,21 @@ class MarkdownConverter(object):
            return '%s %s %s\n\n' % (hashes, text, hashes)
        return '%s %s\n\n' % (hashes, text)
-    def convert_i(self, el, text, convert_as_inline):
+    def convert_hr(self, el, text, convert_as_inline):
-        return self.convert_em(el, text, convert_as_inline)
+        return '\n\n---\n\n'
    convert_i = convert_em
    def convert_img(self, el, text, convert_as_inline):
        alt = el.attrs.get('alt', None) or ''
        src = el.attrs.get('src', None) or ''
        title = el.attrs.get('title', None) or ''
        title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
        if (convert_as_inline
                and el.parent.name not in self.options['keep_inline_images_in']):
            return alt
        return '![%s](%s%s)' % (alt, src, title_part)
    def convert_list(self, el, text, convert_as_inline):
@@ -241,49 +326,56 @@ class MarkdownConverter(object):
                el = el.parent
            bullets = self.options['bullets']
            bullet = bullets[depth % len(bullets)]
-        return '%s %s\n' % (bullet, text or '')
+        return '%s %s\n' % (bullet, (text or '').strip())
    def convert_p(self, el, text, convert_as_inline):
        if convert_as_inline:
            return text
        return '%s\n\n' % text if text else ''
-    def convert_strong(self, el, text, convert_as_inline):
+    def convert_pre(self, el, text, convert_as_inline):
        strong_tag = 2 * self.options['strong_em_symbol']
        prefix, suffix, text = chomp(text)
        if not text:
            return ''
-        return '%s%s%s%s%s' % (prefix, strong_tag, text, strong_tag, suffix)
+        code_language = self.options['code_language']
-    def convert_img(self, el, text, convert_as_inline):
+        if self.options['code_language_callback']:
-        alt = el.attrs.get('alt', None) or ''
+            code_language = self.options['code_language_callback'](el) or code_language
        src = el.attrs.get('src', None) or ''
        title = el.attrs.get('title', None) or ''
        title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
        if convert_as_inline:
            return alt
-        return '![%s](%s%s)' % (alt, src, title_part)
+        return '\n```%s\n%s\n```\n' % (code_language, text)
    convert_s = convert_del
    convert_strong = convert_b
    convert_samp = convert_code
    convert_sub = abstract_inline_conversion(lambda self: self.options['sub_symbol'])
    convert_sup = abstract_inline_conversion(lambda self: self.options['sup_symbol'])
    def convert_table(self, el, text, convert_as_inline):
-        rows = el.find_all('tr')
+        return '\n\n' + text + '\n'
        text_data = []
        for row in rows:
            headers = row.find_all('th')
            columns = row.find_all('td')
            if len(headers) > 0:
                headers = [head.text.strip() for head in headers]
                text_data.append('| ' + ' | '.join(headers) + ' |')
                text_data.append('| ' + ' | '.join(['---'] * len(headers)) + ' |')
            elif len(columns) > 0:
                columns = [colm.text.strip() for colm in columns]
                text_data.append('| ' + ' | '.join(columns) + ' |')
            else:
                continue
        return '\n'.join(text_data)
-    def convert_hr(self, el, text, convert_as_inline):
+    def convert_td(self, el, text, convert_as_inline):
-        return '\n\n---\n\n'
+        return ' ' + text + ' |'
    def convert_th(self, el, text, convert_as_inline):
        return ' ' + text + ' |'
    def convert_tr(self, el, text, convert_as_inline):
        cells = el.find_all(['td', 'th'])
        is_headrow = all([cell.name == 'th' for cell in cells])
        overline = ''
        underline = ''
        if is_headrow and not el.previous_sibling:
            # first row and is headline: print headline underline
            underline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
        elif not el.previous_sibling and not el.parent.name != 'table':
            # first row, not headline, and the parent is sth. like tbody:
            # print empty headline above this row
            overline += '| ' + ' | '.join([''] * len(cells)) + ' |' + '\n'
            overline += '| ' + ' | '.join(['---'] * len(cells)) + ' |' + '\n'
        return overline + '|' + text + '\n' + underline
 def markdownify(html, **options):
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,2 +1,2 @@
 [flake8]
-ignore = E501
+ignore = E501 W503
--- a/setup.py
+++ b/setup.py
@@ -10,7 +10,7 @@ read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
 pkgmeta = {
    '__title__': 'markdownify',
    '__author__': 'Matthew Tretter',
-    '__version__': '0.7.2',
+    '__version__': '0.10.3',
 }
@@ -70,7 +70,7 @@ setup(
    zip_safe=False,
    include_package_data=True,
    setup_requires=[
-        'flake8>=3.8,<4',
+        'flake8>=3.8,<5',
    ],
    tests_require=[
        'pytest>=6.2,<7',
--- a/tests/test_advanced.py
+++ b/tests/test_advanced.py
@@ -1,6 +1,17 @@
 from markdownify import markdownify as md
 def test_chomp():
    assert md(' <b></b> ') == '  '
    assert md(' <b> </b> ') == '  '
    assert md(' <b>  </b> ') == '  '
    assert md(' <b>   </b> ') == '  '
    assert md(' <b>s </b> ') == ' **s**  '
    assert md(' <b> s</b> ') == '  **s** '
    assert md(' <b> s </b> ') == '  **s**  '
    assert md(' <b>  s  </b> ') == '  **s**  '
 def test_nested():
    text = md('<p>This is an <a href="http://example.com/">example link</a>.</p>')
    assert text == 'This is an [example link](http://example.com/).\n\n'
@@ -14,3 +25,15 @@ def test_ignore_comments():
 def test_ignore_comments_with_other_tags():
    text = md("<!-- This is a comment --><a href='http://example.com/'>example link</a>")
    assert text == "[example link](http://example.com/)"
 def test_code_with_tricky_content():
    assert md('<code>></code>') == "`>`"
    assert md('<code>/home/</code><b>username</b>') == "`/home/`**username**"
    assert md('First line <code>blah blah<br />blah blah</code> second line') \
        == "First line `blah blah  \nblah blah` second line"
 def test_special_tags():
    assert md('<!DOCTYPE html>') == ''
    assert md('<![CDATA[foobar]]>') == 'foobar'
--- a/tests/test_conversions.py
+++ b/tests/test_conversions.py
@@ -1,130 +1,17 @@
 from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, UNDERSCORE
 import re
-nested_uls = """
+def inline_tests(tag, markup):
-    <ul>
+    # test template for different inline tags
-        <li>1
+    assert md(f'<{tag}>Hello</{tag}>') == f'{markup}Hello{markup}'
-            <ul>
+    assert md(f'foo <{tag}>Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
-                <li>a
+    assert md(f'foo<{tag}> Hello</{tag}> bar') == f'foo {markup}Hello{markup} bar'
-                    <ul>
+    assert md(f'foo <{tag}>Hello </{tag}>bar') == f'foo {markup}Hello{markup} bar'
-                        <li>I</li>
+    assert md(f'foo <{tag}></{tag}> bar') in ['foo  bar', 'foo bar']  # Either is OK
                        <li>II</li>
                        <li>III</li>
                    </ul>
                </li>
                <li>b</li>
                <li>c</li>
            </ul>
        </li>
        <li>2</li>
        <li>3</li>
    </ul>"""
 nested_ols = """
    <ol>
        <li>1
            <ol>
                <li>a
                    <ol>
                        <li>I</li>
                        <li>II</li>
                        <li>III</li>
                    </ol>
                </li>
                <li>b</li>
                <li>c</li>
            </ol>
        </li>
        <li>2</li>
        <li>3</li>
    </ul>"""
 table = re.sub(r'\s+', '', """
 <table>
    <tr>
        <th>Firstname</th>
        <th>Lastname</th>
        <th>Age</th>
    </tr>
    <tr>
        <td>Jill</td>
        <td>Smith</td>
        <td>50</td>
    </tr>
    <tr>
        <td>Eve</td>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>
 """)
 table_head_body = re.sub(r'\s+', '', """
 <table>
    <thead>
            <tr>
            <th>Firstname</th>
            <th>Lastname</th>
            <th>Age</th>
            </tr>
    </thead>
    <tbody>
        <tr>
            <td>Jill</td>
            <td>Smith</td>
            <td>50</td>
        </tr>
        <tr>
            <td>Eve</td>
            <td>Jackson</td>
            <td>94</td>
        </tr>
    </tbody>
 </table>
 """)
 table_missing_text = re.sub(r'\s+', '', """
 <table>
    <thead>
            <tr>
            <th></th>
            <th>Lastname</th>
            <th>Age</th>
            </tr>
    </thead>
    <tbody>
        <tr>
            <td>Jill</td>
            <td></td>
            <td>50</td>
        </tr>
        <tr>
            <td>Eve</td>
            <td>Jackson</td>
            <td>94</td>
        </tr>
    </tbody>
 </table>
 """)
 def test_chomp():
    assert md(' <b></b> ') == '  '
    assert md(' <b> </b> ') == '  '
    assert md(' <b>  </b> ') == '  '
    assert md(' <b>   </b> ') == '  '
    assert md(' <b>s </b> ') == ' **s**  '
    assert md(' <b> s</b> ') == '  **s** '
    assert md(' <b> s </b> ') == '  **s**  '
    assert md(' <b>  s  </b> ') == '  **s**  '
 def test_a():
    assert md('<a href="https://google.com">Google</a>') == '[Google](https://google.com)'
    assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
    assert md('<a href="https://google.com">https://google.com</a>') == '<https://google.com>'
    assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>') == '<https://community.kde.org/Get_Involved>'
    assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>', autolinks=False) == '[https://community.kde.org/Get\\_Involved](https://community.kde.org/Get_Involved)'
@@ -140,6 +27,7 @@ def test_a_spaces():
 def test_a_with_title():
    text = md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>')
    assert text == r'[Google](http://google.com "The \"Goog\"")'
    assert md('<a href="https://google.com">https://google.com</a>', default_title=True) == '[https://google.com](https://google.com "https://google.com")'
 def test_a_shortcut():
@@ -148,8 +36,7 @@ def test_a_shortcut():
 def test_a_no_autolinks():
-    text = md('<a href="http://google.com">http://google.com</a>', autolinks=False)
+    assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
    assert text == '[http://google.com](http://google.com)'
 def test_b():
@@ -171,24 +58,31 @@ def test_blockquote_with_paragraph():
    assert md('<blockquote>Hello</blockquote><p>handsome</p>') == '\n> Hello\n\nhandsome\n\n'
-def test_nested_blockquote():
+def test_blockquote_nested():
    text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>')
    assert text == '\n> And she was like \n> > Hello\n> \n> \n\n'
 def test_br():
    assert md('a<br />b<br />c') == 'a  \nb  \nc'
    assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
 def test_code():
    inline_tests('code', '`')
    assert md('<code>this_should_not_escape</code>') == '`this_should_not_escape`'
 def test_del():
    inline_tests('del', '~~')
 def test_div():
    assert md('Hello</div> World') == 'Hello World'
 def test_em():
-    assert md('<em>Hello</em>') == '*Hello*'
+    inline_tests('em', '*')
 def test_em_spaces():
    assert md('foo <em>Hello</em> bar') == 'foo *Hello* bar'
    assert md('foo<em> Hello</em> bar') == 'foo *Hello* bar'
    assert md('foo <em>Hello </em>bar') == 'foo *Hello* bar'
    assert md('foo <em></em> bar') == 'foo  bar'
 def test_h1():
@@ -201,6 +95,8 @@ def test_h2():
 def test_hn():
    assert md('<h3>Hello</h3>') == '### Hello\n\n'
    assert md('<h4>Hello</h4>') == '#### Hello\n\n'
    assert md('<h5>Hello</h5>') == '##### Hello\n\n'
    assert md('<h6>Hello</h6>') == '###### Hello\n\n'
@@ -236,15 +132,28 @@ def test_hn_nested_simple_tag():
 def test_hn_nested_img():
    assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
    assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
    image_attributes_to_markdown = [
-        ("", ""),
+        ("", "", ""),
-        ("alt='Alt Text'", "Alt Text"),
+        ("alt='Alt Text'", "Alt Text", ""),
-        ("alt='Alt Text' title='Optional title'", "Alt Text"),
+        ("alt='Alt Text' title='Optional title'", "Alt Text", " \"Optional title\""),
    ]
-    for image_attributes, markdown in image_attributes_to_markdown:
+    for image_attributes, markdown, title in image_attributes_to_markdown:
-        assert md('<h3>A <img src="/path/to/img.jpg " ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
+        assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
        assert md('<h3>A <img src="/path/to/img.jpg" ' + image_attributes + '/> B</h3>', keep_inline_images_in=['h3']) == '### A ![' + markdown + '](/path/to/img.jpg' + title + ') B\n\n'
 def test_hn_atx_headings():
    assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
    assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
 def test_hn_atx_closed_headings():
    assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
    assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
 def test_head():
    assert md('<head>head</head>') == 'head'
 def test_hr():
@@ -253,74 +162,38 @@ def test_hr():
    assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == 'Hello\n\n\n\n\n---\n\n\nWorld\n\n'
 def test_head():
    assert md('<head>head</head>') == 'head'
 def test_atx_headings():
    assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
    assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
 def test_atx_closed_headings():
    assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
    assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
 def test_i():
    assert md('<i>Hello</i>') == '*Hello*'
 def test_ol():
    assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
    assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
 def test_p():
    assert md('<p>hello</p>') == 'hello\n\n'
 def test_strong():
    assert md('<strong>Hello</strong>') == '**Hello**'
 def test_ul():
    assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
 def test_nested_ols():
    assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
 def test_inline_ul():
    assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
 def test_nested_uls():
    """
    Nested ULs should alternate bullet characters.
    """
    assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
 def test_bullets():
    assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
 def test_img():
    assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
    assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
-def test_div():
+def test_kbd():
-    assert md('Hello</div> World') == 'Hello World'
+    inline_tests('kbd', '`')
-def test_table():
+def test_p():
-    assert md(table) == '| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |'
+    assert md('<p>hello</p>') == 'hello\n\n'
-    assert md(table_head_body) == '| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |'
+
-    assert md(table_missing_text) == '|  | Lastname | Age |\n| --- | --- | --- |\n| Jill |  | 50 |\n| Eve | Jackson | 94 |'
+
 def test_pre():
    assert md('<pre>test\n    foo\nbar</pre>') == '\n```\ntest\n    foo\nbar\n```\n'
    assert md('<pre><code>test\n    foo\nbar</code></pre>') == '\n```\ntest\n    foo\nbar\n```\n'
 def test_s():
    inline_tests('s', '~~')
 def test_samp():
    inline_tests('samp', '`')
 def test_strong():
    assert md('<strong>Hello</strong>') == '**Hello**'
 def test_strong_em_symbol():
@@ -330,5 +203,25 @@ def test_strong_em_symbol():
    assert md('<i>Hello</i>', strong_em_symbol=UNDERSCORE) == '_Hello_'
-def test_newline_style():
+def test_sub():
-    assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'
+    assert md('<sub>foo</sub>') == 'foo'
    assert md('<sub>foo</sub>', sub_symbol='~') == '~foo~'
 def test_sup():
    assert md('<sup>foo</sup>') == 'foo'
    assert md('<sup>foo</sup>', sup_symbol='^') == '^foo^'
 def test_lang():
    assert md('<pre>test\n    foo\nbar</pre>', code_language='python') == '\n```python\ntest\n    foo\nbar\n```\n'
    assert md('<pre><code>test\n    foo\nbar</code></pre>', code_language='javascript') == '\n```javascript\ntest\n    foo\nbar\n```\n'
 def test_lang_callback():
    def callback(el):
        return el['class'][0] if el.has_attr('class') else None
    assert md('<pre class="python">test\n    foo\nbar</pre>', code_language_callback=callback) == '\n```python\ntest\n    foo\nbar\n```\n'
    assert md('<pre class="javascript"><code>test\n    foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n    foo\nbar\n```\n'
    assert md('<pre class="javascript"><code class="javascript">test\n    foo\nbar</code></pre>', code_language_callback=callback) == '\n```javascript\ntest\n    foo\nbar\n```\n'
--- a/tests/test_custom_converter.py
+++ b/tests/test_custom_converter.py
@@ -0,0 +1,25 @@
 from markdownify import MarkdownConverter
 from bs4 import BeautifulSoup
 class ImageBlockConverter(MarkdownConverter):
    """
    Create a custom MarkdownConverter that adds two newlines after an image
    """
    def convert_img(self, el, text, convert_as_inline):
        return super().convert_img(el, text, convert_as_inline) + '\n\n'
 def test_img():
    # Create shorthand method for conversion
    def md(html, **options):
        return ImageBlockConverter(**options).convert(html)
    assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")\n\n'
    assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)\n\n'
 def test_soup():
    html = '<b>test</b>'
    soup = BeautifulSoup(html, 'html.parser')
    assert MarkdownConverter().convert_soup(soup) == '**test**'
--- a/tests/test_escaping.py
+++ b/tests/test_escaping.py
@@ -1,8 +1,14 @@
 from markdownify import markdownify as md
 def test_asterisks():
    assert md('*hey*dude*') == r'\*hey\*dude\*'
    assert md('*hey*dude*', escape_asterisks=False) == r'*hey*dude*'
 def test_underscore():
    assert md('_hey_dude_') == r'\_hey\_dude\_'
    assert md('_hey_dude_', escape_underscores=False) == r'_hey_dude_'
 def test_xml_entities():
--- a/tests/test_lists.py
+++ b/tests/test_lists.py
@@ -0,0 +1,81 @@
 from markdownify import markdownify as md
 nested_uls = """
    <ul>
        <li>1
            <ul>
                <li>a
                    <ul>
                        <li>I</li>
                        <li>II</li>
                        <li>III</li>
                    </ul>
                </li>
                <li>b</li>
                <li>c</li>
            </ul>
        </li>
        <li>2</li>
        <li>3</li>
    </ul>"""
 nested_ols = """
    <ol>
        <li>1
            <ol>
                <li>a
                    <ol>
                        <li>I</li>
                        <li>II</li>
                        <li>III</li>
                    </ol>
                </li>
                <li>b</li>
                <li>c</li>
            </ol>
        </li>
        <li>2</li>
        <li>3</li>
    </ul>"""
 def test_ol():
    assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
    assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '3. a\n4. b\n'
 def test_nested_ols():
    assert md(nested_ols) == '\n1. 1\n\t1. a\n\t\t1. I\n\t\t2. II\n\t\t3. III\n\t2. b\n\t3. c\n2. 2\n3. 3\n'
 def test_ul():
    assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
    assert md("""<ul>
     <li>
             a
     </li>
     <li> b </li>
     <li>   c
     </li>
 </ul>""") == '* a\n* b\n* c\n'
 def test_inline_ul():
    assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n* a\n* b\n\nbar\n\n'
 def test_nested_uls():
    """
    Nested ULs should alternate bullet characters.
    """
    assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n'
 def test_bullets():
    assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n'
 def test_li_text():
    assert md('<ul><li>foo <a href="#">bar</a></li><li>foo bar  </li><li>foo <b>bar</b>   <i>space</i>.</ul>') == '* foo [bar](#)\n* foo bar\n* foo **bar** *space*.\n'
--- a/tests/test_tables.py
+++ b/tests/test_tables.py
@@ -0,0 +1,150 @@
 from markdownify import markdownify as md
 table = """<table>
    <tr>
        <th>Firstname</th>
        <th>Lastname</th>
        <th>Age</th>
    </tr>
    <tr>
        <td>Jill</td>
        <td>Smith</td>
        <td>50</td>
    </tr>
    <tr>
        <td>Eve</td>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>"""
 table_with_html_content = """<table>
    <tr>
        <th>Firstname</th>
        <th>Lastname</th>
        <th>Age</th>
    </tr>
    <tr>
        <td><b>Jill</b></td>
        <td><i>Smith</i></td>
        <td><a href="#">50</a></td>
    </tr>
    <tr>
        <td>Eve</td>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>"""
 table_with_paragraphs = """<table>
    <tr>
        <th>Firstname</th>
        <th><p>Lastname</p></th>
        <th>Age</th>
    </tr>
    <tr>
        <td><p>Jill</p></td>
        <td><p>Smith</p></td>
        <td><p>50</p></td>
    </tr>
    <tr>
        <td>Eve</td>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>"""
 table_with_header_column = """<table>
    <tr>
        <th>Firstname</th>
        <th>Lastname</th>
        <th>Age</th>
    </tr>
    <tr>
        <th>Jill</th>
        <td>Smith</td>
        <td>50</td>
    </tr>
    <tr>
        <th>Eve</th>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>"""
 table_head_body = """<table>
    <thead>
        <tr>
            <th>Firstname</th>
            <th>Lastname</th>
            <th>Age</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Jill</td>
            <td>Smith</td>
            <td>50</td>
        </tr>
        <tr>
            <td>Eve</td>
            <td>Jackson</td>
            <td>94</td>
        </tr>
    </tbody>
 </table>"""
 table_missing_text = """<table>
    <thead>
        <tr>
            <th></th>
            <th>Lastname</th>
            <th>Age</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Jill</td>
            <td></td>
            <td>50</td>
        </tr>
        <tr>
            <td>Eve</td>
            <td>Jackson</td>
            <td>94</td>
        </tr>
    </tbody>
 </table>"""
 table_missing_head = """<table>
    <tr>
        <td>Firstname</td>
        <td>Lastname</td>
        <td>Age</td>
    </tr>
    <tr>
        <td>Jill</td>
        <td>Smith</td>
        <td>50</td>
    </tr>
    <tr>
        <td>Eve</td>
        <td>Jackson</td>
        <td>94</td>
    </tr>
 </table>"""
 def test_table():
    assert md(table) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_with_html_content) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| **Jill** | *Smith* | [50](#) |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_with_paragraphs) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_with_header_column) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_head_body) == '\n\n| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_missing_text) == '\n\n|  | Lastname | Age |\n| --- | --- | --- |\n| Jill |  | 50 |\n| Eve | Jackson | 94 |\n\n'
    assert md(table_missing_head) == '\n\n|  |  |  |\n| --- | --- | --- |\n| Firstname | Lastname | Age |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |\n\n'
Author	SHA1	Message	Date
AlexVonB	61e8940486	added readme for callback	2022-04-13 20:42:38 +02:00
AlexVonB	35479d2d3b	Merge branch 'code_language_callback' of https://github.com/tdgroot/python-markdownify into tdgroot-code_language_callback	2022-04-13 20:25:37 +02:00
AlexVonB	b589863715	add escaping of asterisks and option to disable it closes #62	2022-04-13 20:04:12 +02:00
AlexVonB	423b7e948c	add option to allow inline images in selected tags fixes #61	2022-04-13 19:55:34 +02:00
Timon de Groot	0ea95de4d0	Add code language callback	2022-04-09 13:22:28 +02:00
AlexVonB	ed3eee78d2	fixed readme	2022-01-24 18:18:19 +01:00
AlexVonB	ddda696396	bump to v0.10.3	2022-01-23 11:01:26 +01:00
AlexVonB	0a1343a538	allow BeautifulSoup objects to be converted	2022-01-23 11:00:19 +01:00
AlexVonB	9d0b839b73	wording	2022-01-23 10:59:24 +01:00
AlexVonB	d3eff11617	bump to v0.10.2	2022-01-18 08:53:33 +01:00
AlexVonB	bd6b581122	add option to not escape underscores closes #59	2022-01-18 08:51:44 +01:00
AlexVonB	c8f7cf63e3	bump to v0.10.1	2021-12-11 14:44:34 +01:00
AlexVonB	12a68a7d14	allow flake8 v4.x closes #57	2021-12-11 14:43:14 +01:00
AlexVonB	478b1c7e13	bump to v0.10.0	2021-11-17 17:10:15 +01:00
AlexVonB	ffcf6cbcb2	fix readme for code_language	2021-11-17 17:09:47 +01:00
AlexVonB	0ab0452414	add readme for code_language	2021-11-17 17:08:14 +01:00
AlexVonB	b62b067cbd	Merge branch 'Inzaniak-develop' into develop	2021-11-17 17:05:07 +01:00
AlexVonB	cb2646cd93	differentiated between text and code language	2021-11-17 17:03:31 +01:00
AlexVonB	9692b5e714	satisfy linter	2021-11-17 16:55:00 +01:00
Umberto Grando	ac68c53a7d	added language for multiline code	2021-11-01 21:19:35 +01:00
AlexVonB	40dd30419c	bump to v0.9.4	2021-09-04 21:50:05 +02:00
AlexVonB	da56f7f56a	Merge pull request #53 from Hozhyi/fix/bullet_list_tags_in_separate_lines Fixed issue #52 - added stripping of text to list	2021-09-04 21:48:16 +02:00
AlexVonB	8400b39dd9	remove trailing whitespace to satisfy the linter	2021-09-04 21:47:27 +02:00
Viktor Hozhyi	5fc1441fe7	Added appropriate test	2021-09-04 20:51:08 +03:00
Viktor Hozhyi	044615eff1	Fixed issue #52 - added stripping of text to list	2021-09-04 12:39:30 +03:00
AlexVonB	dbd9f3f3d2	bump to v0.9.3	2021-08-25 08:53:17 +02:00
AlexVonB	0fdeb1ff6e	convert tags inside table cells as inline in part resolves #49	2021-08-25 08:48:30 +02:00
AlexVonB	6a2f3a4b42	fix rst syntax error	2021-07-11 13:21:02 +02:00
AlexVonB	22180a166d	bump to v0.9.1	2021-07-11 13:13:31 +02:00
AlexVonB	16d8a0e1f7	Revert "add figure/figcaption" This reverts commit `828e116530`.	2021-07-11 13:12:16 +02:00
AlexVonB	4aa6cf2a24	rewrote text processing to not escape _ in code fixes #47	2021-07-11 13:10:59 +02:00
AlexVonB	828e116530	add figure/figcaption for #46	2021-06-30 13:02:42 +02:00
AlexVonB	62e9f0de02	add examples for custom converters closes #46	2021-06-27 15:53:23 +02:00
AlexVonB	cec570fc49	bump to v0.9.0	2021-05-30 19:10:31 +02:00
AlexVonB	a6a31624ad	add options for sub and sup tags fixes #44	2021-05-30 19:07:43 +02:00
AlexVonB	6f3732307d	restructured test files	2021-05-30 19:06:52 +02:00
AlexVonB	8f6d7e500d	add option 'default_title' to links fixes #39	2021-05-30 18:40:40 +02:00
AlexVonB	e96351b666	bump to v0.8.1	2021-05-30 11:20:16 +02:00
AlexVonB	129c4ef060	ignore doctype tag, test cdata tag fixes #45	2021-05-30 11:18:18 +02:00
AlexVonB	9cb940cbc0	bump to v0.8.0	2021-05-21 14:17:51 +02:00
AlexVonB	70ef9b6e48	added pre tag closes #15	2021-05-21 14:15:41 +02:00
AlexVonB	91d53ddd5a	refactor simple inline conversions	2021-05-21 13:53:00 +02:00
AlexVonB	079f32f6cd	added del and s tags	2021-05-21 12:27:49 +02:00
AlexVonB	89b577e91e	ordering functions alphabetically	2021-05-21 12:21:21 +02:00
AlexVonB	4bf2ea44fc	Merge branch 'AndrewCRichards-andrewcrichards/add_code_samp_kbd_tags' into develop	2021-05-21 12:13:48 +02:00
AlexVonB	77797ebb79	Merge branch 'andrewcrichards/add_code_samp_kbd_tags' of https://github.com/AndrewCRichards/python-markdownify into AndrewCRichards-andrewcrichards/add_code_samp_kbd_tags	2021-05-21 12:11:59 +02:00
AlexVonB	9f3c4c9fa0	bump to v0.7.4	2021-05-18 10:42:16 +02:00
AlexVonB	967db26b3a	Merge branch 'fix-headless-tables' into develop	2021-05-18 10:41:42 +02:00
AlexVonB	ea81407b87	implemented table parsing correctly instead of manually walking down the dom tree in a table, we now rely on the main descent loop and just implement conversion for rows and cells correctly. this enables the use of html inside a table cell.	2021-05-17 14:00:00 +02:00
AlexVonB	e6da15c173	allow tables with headers in first (or any) column	2021-05-17 12:36:48 +02:00
AlexVonB	7dac92e85e	Allow for tables without header row fixes #42	2021-05-16 19:02:04 +02:00
AlexVonB	fc29483899	bump to v0.7.3	2021-05-16 18:41:08 +02:00
AlexVonB	bd7a8d6990	Merge pull request #43 from jiulongw/develop Fix missing whitespaces in <li> node	2021-05-16 18:39:58 +02:00
Jiulong Wang	ddfbf6a364	Keep important spaces in <li> element	2021-05-10 16:07:54 -07:00
Jiulong Wang	91a64e3cd4	Fix missing whitespaces in <li> node	2021-05-10 14:42:05 -07:00
Andrew Richards	7685738344	Formatting tweak Change indent of continuation line; squashes a flake8 warning.	2020-11-27 14:18:08 +00:00
Andrew Richards	92a73c8dfe	Correct test_code_with_tricky_content() Result of previous test didn't check for the trailing ' ' that convert_br() adds: This is needed to ensure that the resulting markdown not only has \n for the <br> but also renders it as a newline.	2020-11-26 22:20:29 +00:00
Andrew Richards	3354f143d8	Add method for <code> tag Add method and tests for inline tag <code>.	2020-11-23 17:28:23 +00:00
`@@ -1,2 +1,2 @@`
	`[flake8]`	`[flake8]`
	`ignore = E501`	`ignore = E501 W503`