Compare commits

..

37 Commits
0.2.0 ... 0.5.0

Author SHA1 Message Date
AlexVonB
0c4b856b9c Bump to 0.5.0 2020-08-09 21:22:15 +02:00
AlexVonB
e9cc01938a Merge branch 'develop' 2020-08-09 21:20:44 +02:00
AlexVonB
aceced68eb cleaning up changes with help of linter 2020-08-09 21:17:39 +02:00
AlexVonB
3b049cdb9c added egg dirs to gitignore 2020-08-09 21:13:33 +02:00
AlexVonB
b747378b52 fixed nested lists and wrote correct tests
nested lists did not work: after a nested list was over,
a new line was inserted. this leads to a large gap before
the rest of the parent list.

lists are prefixed and suffixed with a single newline,
this is now represented in the tests.
2020-08-09 21:11:16 +02:00
AlexVonB
ee73d89879 Merge pull request #14 from AlexVonB/fix-inline-spaces
remove prefixed and suffixed spaces from inline tags
2020-08-09 20:24:23 +02:00
AlexVonB
5563161c86 remove needless checks for emtpy text 2019-07-12 10:23:17 +02:00
AlexVonB
28e447d9ae remove prefixed and suffixed spaces from inline tags
fixes matthewwithanm#13
2019-07-11 23:27:52 +02:00
Matthew Dapena-Tretter
89d14f4487 Merge pull request #11 from AlexVonB/AlexVonB-patch-1
Add newline before and after a markdown list
2019-07-04 08:53:25 -07:00
AlexVonB
5f9243d91d added tests for matthewwithanm#11 2019-07-04 16:32:21 +02:00
AlexVonB
d0f688d2e4 Add newline before and after a markdown list
Fixes matthewwithanm#5 as well as an issue where `<p>foo<p><ul><li>bar</li></ul>` gets converted to `foo * bar` which is not correct
2019-07-04 16:26:09 +02:00
Jonathan Vanasco
5ac08522be updating classifer to mit license
issue #9
2019-06-19 16:17:47 -07:00
Thomas Lange
78afcc173e Adding MIT license file 2018-10-16 19:11:02 -07:00
Steven Skoczen
b132a6f5b3 Updates to 0.4.1, pkgmeta included directly in setup. 2017-11-28 12:07:31 +13:00
Steven Skoczen
0abe0a29e8 Merge pull request #2 from crhallberg/html-parser
Suppress BeautifulSoup warning
2017-11-13 08:48:45 +13:00
Steven Skoczen
4932df631f Merge pull request #1 from dmpayton/develop
Fixes to get tests passing in Python 3.
2017-11-13 08:48:38 +13:00
Chris Hallberg
8696e2bde1 Suppress BeautifulSoup warning
by explicitly passing in the default parser as recommended by the error message:

```
/home/challberg/.local/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 35 of the file unroll.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))
```
2017-06-12 16:03:04 -04:00
dmpayton
ee53d85c41 Fixes to get tests passing in Python 3. 2016-02-23 15:15:29 -08:00
Matthew Tretter
53ba0daa77 Document options 2013-07-31 23:23:44 -04:00
Matthew Tretter
fb98e9878f Bump to 0.4.0 2013-07-31 23:12:53 -04:00
Matthew Tretter
aa10053fbb Test custom bullets 2013-07-31 23:11:39 -04:00
Matthew Tretter
253a34c2d7 Test nested unordered lists 2013-07-31 23:08:39 -04:00
Matthew Tretter
3ea09609e6 Add support for "bullets" option 2013-07-31 23:08:36 -04:00
Matthew Tretter
1cd8e56c47 Test ATX and ATX_CLOSED style headings 2013-07-31 22:19:41 -04:00
Matthew Tretter
891a4a8d08 Add "heading_style" option
Allow the user to specify a heading style.
2013-07-31 22:17:22 -04:00
Matthew Tretter
e5a1784f30 Remove unneeded raw string 2013-07-31 21:59:35 -04:00
Matthew Tretter
f60d910335 Add "autolinks" option
This option allows you to disable the creation of "autolink" style
links.
2013-07-31 21:58:48 -04:00
Matthew Tretter
d707d107f6 Support inner Options class 2013-07-31 21:55:30 -04:00
Matthew Tretter
1ef4dd1468 Add shortcut link syntax 2013-07-31 19:23:39 -04:00
Matthew Tretter
934c97b342 Test img tag conversion 2013-07-31 19:23:38 -04:00
Matthew Tretter
8a1e2d9403 Add simple img conversion 2013-07-31 19:23:36 -04:00
Matthew Tretter
5563723cbc Bump to 0.3.0 2013-07-31 18:16:02 -04:00
Matthew Tretter
a9c13a56da Identify and single out HTML fragment 2013-07-31 18:13:50 -04:00
Matthew Tretter
7bdeb15b18 Use bs4
This causes a lot more tests to fail. But it'll be worth it in the end.
2013-07-31 18:01:52 -04:00
Matthew Tretter
87c8f3bd5e Add development notes to README 2013-07-31 17:20:36 -04:00
Matthew Tretter
0211ac6619 Lint code 2013-07-31 17:20:36 -04:00
Matthew Tretter
2515e9e107 Add lint command 2013-07-31 17:20:32 -04:00
8 changed files with 339 additions and 58 deletions

2
.gitignore vendored
View File

@@ -1,5 +1,7 @@
*.pyc
*.egg
.eggs/
*.egg-info/
.DS_Store
/.env
/dist

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright 2012-2018 Matthew Tretter
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -27,3 +27,47 @@ Specify tags to exclude (blacklist):
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
Options
=======
Markdownify supports the following options:
strip
A list of tags to strip (blacklist). This option can't be used with the
``convert`` option.
convert
A list of tags to convert (whitelist). This option can't be used with the
``strip`` option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a ``a`` tag's contents match its href. Defaults to ``True``
heading_style
Defines how headings should be converted. Accepted values are ``ATX``,
``ATX_CLOSED``, ``SETEXT``, and ``UNDERLINED`` (which is an alias for
``SETEXT``). Defaults to ``UNDERLINED``.
bullets
An iterable (string, list, or tuple) of bullet styles to be used. If the
iterable only contains one item, it will be used regardless of how deeply
lists are nested. Otherwise, the bullet will alternate based on nesting
level. Defaults to ``'*+-'``.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Development
===========
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``

View File

@@ -1,10 +1,20 @@
from lxml.html.soupparser import fromstring
from bs4 import BeautifulSoup, NavigableString
import re
import six
convert_heading_re = re.compile(r'convert_h(\d+)')
line_beginning_re = re.compile(r'^', re.MULTILINE)
whitespace_re = re.compile(r'[\r\n\s\t ]+')
FRAGMENT_ID = '__MARKDOWNIFY_WRAPPER__'
wrapped = '<div id="%s">%%s</div>' % FRAGMENT_ID
# Heading styles
ATX = 'atx'
ATX_CLOSED = 'atx_closed'
UNDERLINED = 'underlined'
SETEXT = UNDERLINED
def escape(text):
@@ -13,30 +23,66 @@ def escape(text):
return text.replace('_', r'\_')
def chomp(text):
"""
If the text in an inline tag like b, a, or em contains a leading or trailing
space, strip the string and return a space as suffix of prefix, if needed.
This function is used to prevent conversions like
<b> foo</b> => ** foo**
"""
prefix = ' ' if text and text[0] == ' ' else ''
suffix = ' ' if text and text[-1] == ' ' else ''
text = text.strip()
return (prefix, suffix, text)
def _todict(obj):
return dict((k, getattr(obj, k)) for k in dir(obj) if not k.startswith('_'))
class MarkdownConverter(object):
def __init__(self, tags_to_strip=None, tags_to_convert=None):
if tags_to_strip is not None and tags_to_convert is not None:
class DefaultOptions:
strip = None
convert = None
autolinks = True
heading_style = UNDERLINED
bullets = '*+-' # An iterable of bullet types.
class Options(DefaultOptions):
pass
def __init__(self, **options):
# Create an options dictionary. Use DefaultOptions as a base so that
# it doesn't have to be extended.
self.options = _todict(self.DefaultOptions)
self.options.update(_todict(self.Options))
self.options.update(options)
if self.options['strip'] is not None and self.options['convert'] is not None:
raise ValueError('You may specify either tags to strip or tags to'
' convert, but not both.')
self.tags_to_strip = tags_to_strip
self.tags_to_convert = tags_to_convert
' convert, but not both.')
def convert(self, html):
soup = fromstring(html)
return self.process_tag(soup)
# We want to take advantage of the html5 parsing, but we don't actually
# want a full document. Therefore, we'll mark our fragment with an id,
# create the document, and extract the element with the id.
html = wrapped % html
soup = BeautifulSoup(html, 'html.parser')
return self.process_tag(soup.find(id=FRAGMENT_ID), children_only=True)
def process_tag(self, node):
text = self.process_text(node.text)
def process_tag(self, node, children_only=False):
text = ''
# Convert the children first
for el in node.findall('*'):
text += self.process_tag(el)
for el in node.children:
if isinstance(el, NavigableString):
text += self.process_text(six.text_type(el))
else:
text += self.process_tag(el)
convert_fn = getattr(self, 'convert_%s' % node.tag, None)
if convert_fn and self.should_convert_tag(node.tag):
text = convert_fn(node, text)
text += self.process_text(node.tail)
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text)
return text
@@ -44,7 +90,7 @@ class MarkdownConverter(object):
return escape(whitespace_re.sub(' ', text or ''))
def __getattr__(self, attr):
# Handle heading levels > 2
# Handle headings
m = convert_heading_re.match(attr)
if m:
n = int(m.group(1))
@@ -60,22 +106,33 @@ class MarkdownConverter(object):
def should_convert_tag(self, tag):
tag = tag.lower()
if self.tags_to_strip is not None:
return tag not in self.tags_to_strip
elif self.tags_to_convert is not None:
return tag in self.tags_to_convert
strip = self.options['strip']
convert = self.options['convert']
if strip is not None:
return tag not in strip
elif convert is not None:
return tag in convert
else:
return True
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
def underline(self, text, pad_char):
text = (text or '').rstrip()
return '%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
def convert_a(self, el, text):
prefix, suffix, text = chomp(text)
if not text:
return ''
href = el.get('href')
title = el.get('title')
if self.options['autolinks'] and text == href and not title:
# Shortcut syntax
return '<%s>' % href
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '[%s](%s%s)' % (text or '', href, title_part) if href else text or ''
return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
def convert_b(self, el, text):
return self.convert_strong(el, text)
@@ -87,35 +144,70 @@ class MarkdownConverter(object):
return ' \n'
def convert_em(self, el, text):
return '*%s*' % text if text else ''
def convert_h1(self, el, text):
return self.underline(text, '=')
def convert_h2(self, el, text):
return self.underline(text, '-')
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s*%s*%s' % (prefix, text, suffix)
def convert_hn(self, n, el, text):
return '%s %s\n\n' % ('#' * n, text.rstrip()) if text else ''
style = self.options['heading_style']
text = text.rstrip()
if style == UNDERLINED and n <= 2:
line = '=' if n == 1 else '-'
return self.underline(text, line)
hashes = '#' * n
if style == ATX_CLOSED:
return '%s %s %s\n\n' % (hashes, text, hashes)
return '%s %s\n\n' % (hashes, text)
def convert_i(self, el, text):
return self.convert_em(el, text)
def convert_list(self, el, text):
nested = False
while el:
if el.name == 'li':
nested = True
break
el = el.parent
if nested:
# remove trailing newline if nested
return '\n' + self.indent(text, 1).rstrip()
return '\n' + text + '\n'
convert_ul = convert_list
convert_ol = convert_list
def convert_li(self, el, text):
parent = el.getparent()
if parent is not None and parent.tag == 'ol':
parent = el.parent
if parent is not None and parent.name == 'ol':
bullet = '%s.' % (parent.index(el) + 1)
else:
bullet = '*'
depth = -1
while el:
if el.name == 'ul':
depth += 1
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
return '%s %s\n' % (bullet, text or '')
def convert_p(self, el, text):
return '%s\n\n' % text if text else ''
def convert_strong(self, el, text):
return '**%s**' % text if text else ''
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s**%s**%s' % (prefix, text, suffix)
def convert_img(self, el, text):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '![%s](%s%s)' % (alt, src, title_part)
def markdownify(html, strip=None, convert=None):
converter = MarkdownConverter(strip, convert)
return converter.convert(html)
def markdownify(html, **options):
return MarkdownConverter(**options).convert(html)

View File

@@ -1,8 +0,0 @@
pkgmeta = dict(
__title__='markdownify',
__author__='Matthew Tretter',
__version__='0.2.0',
)
globals().update(pkgmeta)
__all__ = pkgmeta.keys()

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[flake8]
ignore = E501

View File

@@ -2,15 +2,16 @@
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
pkgmeta = {}
execfile(os.path.join(os.path.dirname(__file__), 'markdownify', 'pkgmeta.py'),
pkgmeta)
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.5.0',
}
class PyTest(TestCommand):
@@ -25,6 +26,37 @@ class PyTest(TestCommand):
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.engine import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
setup(
name='markdownify',
description='Convert HTML to markdown.',
@@ -37,26 +69,28 @@ setup(
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8',
],
tests_require=[
'pytest',
],
install_requires=[
'lxml',
'BeautifulSoup',
'beautifulsoup4', 'six'
],
classifiers=[
'Environment :: Web Environment',
'Framework :: Django',
'Intended Audience :: Developers',
'License :: OSI Approved :: BSD License',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python :: 2.5',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Topic :: Utilities'
],
setup_requires=[],
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
)

View File

@@ -1,19 +1,75 @@
from markdownify import markdownify as md
from markdownify import markdownify as md, ATX, ATX_CLOSED
import re
nested_uls = re.sub(r'\s+', '', """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>""")
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_a():
assert md('<a href="http://google.com">Google</a>') == '[Google](http://google.com)'
def test_a_spaces():
assert md('foo <a href="http://google.com">Google</a> bar') == 'foo [Google](http://google.com) bar'
assert md('foo<a href="http://google.com"> Google</a> bar') == 'foo [Google](http://google.com) bar'
assert md('foo <a href="http://google.com">Google </a>bar') == 'foo [Google](http://google.com) bar'
assert md('foo <a href="http://google.com"></a> bar') == 'foo bar'
def test_a_with_title():
text = md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>')
assert text == r'[Google](http://google.com "The \"Goog\"")'
def test_a_shortcut():
text = md('<a href="http://google.com">http://google.com</a>')
assert text == '<http://google.com>'
def test_a_no_autolinks():
text = md('<a href="http://google.com">http://google.com</a>', autolinks=False)
assert text == '[http://google.com](http://google.com)'
def test_b():
assert md('<b>Hello</b>') == '**Hello**'
def test_b_spaces():
assert md('foo <b>Hello</b> bar') == 'foo **Hello** bar'
assert md('foo<b> Hello</b> bar') == 'foo **Hello** bar'
assert md('foo <b>Hello </b>bar') == 'foo **Hello** bar'
assert md('foo <b></b> bar') == 'foo bar'
def test_blockquote():
assert md('<blockquote>Hello</blockquote>').strip() == '> Hello'
@@ -31,6 +87,13 @@ def test_em():
assert md('<em>Hello</em>') == '*Hello*'
def test_em_spaces():
assert md('foo <em>Hello</em> bar') == 'foo *Hello* bar'
assert md('foo<em> Hello</em> bar') == 'foo *Hello* bar'
assert md('foo <em>Hello </em>bar') == 'foo *Hello* bar'
assert md('foo <em></em> bar') == 'foo bar'
def test_h1():
assert md('<h1>Hello</h1>') == 'Hello\n=====\n\n'
@@ -44,12 +107,22 @@ def test_hn():
assert md('<h6>Hello</h6>') == '###### Hello\n\n'
def test_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_i():
assert md('<i>Hello</i>') == '*Hello*'
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '1. a\n2. b\n'
assert md('<ol><li>a</li><li>b</li></ol>') == '\n1. a\n2. b\n\n'
def test_p():
@@ -61,4 +134,25 @@ def test_strong():
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '* a\n* b\n'
assert md('<ul><li>a</li><li>b</li></ul>') == '\n* a\n* b\n\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n\n'
def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'