Compare commits

...

126 Commits
0.1.0 ... 0.7.1

Author SHA1 Message Date
AlexVonB
21c0d034d0 Merge branch 'develop' 2021-05-02 10:51:00 +02:00
AlexVonB
f59f9f9a54 bump to v0.7.1 2021-05-02 10:50:49 +02:00
AlexVonB
bd22a16c9e Merge pull request #40 from jiulongw/jiulongw/hr
Add conversion for hr element
2021-05-02 10:47:32 +02:00
AlexVonB
55fb96e3c0 fix hr tests 2021-05-02 10:45:52 +02:00
Jiulong Wang
5f102d5223 Add conversion for hr element 2021-04-29 13:41:28 -07:00
AlexVonB
e3ddc789a2 Merge branch 'develop' 2021-04-22 12:43:27 +02:00
AlexVonB
651d5f00e8 bump to v0.7.0 2021-04-22 12:43:17 +02:00
AlexVonB
3cf324d03d Merge pull request #36 from BrunoMiguens/add-basic-support-for-tables
Add basic support for tables
2021-04-22 12:41:54 +02:00
AlexVonB
96f7e7d307 Merge branch 'develop' into add-basic-support-for-tables 2021-04-22 12:40:16 +02:00
AlexVonB
e1dbbfad42 guard table lines with pipes, resolves the empty header problem 2021-04-22 12:36:11 +02:00
AlexVonB
2d0cd97323 Merge branch 'develop' 2021-04-22 12:13:03 +02:00
AlexVonB
d4882b86b9 bump to v0.6.6 2021-04-22 12:12:51 +02:00
AlexVonB
b47d5f11c8 Merge pull request #37 from andredelft/develop
Add `strong_em_symbol` and `newline` options to the converter
2021-04-18 21:35:16 +02:00
André van Delft
29c794e17d Introduce OPTIONs for strong_em_symbol 2021-04-18 18:13:29 +02:00
André van Delft
e877602a5e Separate the strong_em_symbol and newline style tests 2021-04-05 11:28:42 +02:00
André van Delft
5580b0b51d Update README.rst 2021-04-05 11:13:52 +02:00
André van Delft
650f377b64 Fix linting 2021-04-05 11:13:19 +02:00
André van Delft
7ee87b1d32 Use .lower() on _style option fetching 2021-04-05 10:50:23 +02:00
André van Delft
16dbc471b9 Test newline_style 2021-04-05 10:47:55 +02:00
André van Delft
c04ec855dd Change option to newline_style and use variables like heading_style does 2021-04-05 10:44:20 +02:00
André van Delft
8da0bdf998 Test strong_em_symbol 2021-04-05 10:28:46 +02:00
AlexVonB
ec185e2e9c Merge branch 'develop' 2021-02-21 23:09:55 +01:00
AlexVonB
a59e4b9f48 bump to v0.6.5 2021-02-21 23:09:44 +01:00
AlexVonB
fd293a9714 use python 3.8 instead of 3.6 2021-02-21 23:08:49 +01:00
AlexVonB
99365de669 upgrading code for python 3.x
closes #38
2021-02-21 23:06:21 +01:00
AlexVonB
079d1721aa Merge branch 'develop' 2021-02-21 20:58:34 +01:00
AlexVonB
ed406d3206 bump to v0.6.4 2021-02-21 20:57:57 +01:00
AlexVonB
f320cf87ff closing #25 and #18
Adds newlines after blockquotes, allowing for paragraphs after a
blockquote.

Due to merging problems with @lucafrance 's code I had to quickly copy
and paste their code. Thanks for the contribution!
2021-02-21 20:53:44 +01:00
André van Delft
a79ed44ec3 Fix code ticks in README 2021-02-15 16:51:20 +01:00
André van Delft
29a4e551f7 Update README with the two new options 2021-02-15 16:37:13 +01:00
André van Delft
b3ac4606a6 Allow for the use of backslash for newlines 2021-02-15 16:29:14 +01:00
André van Delft
f093843f40 Allow for a custom strong or emphasis symbol 2021-02-15 16:19:19 +01:00
Bruno Miguens
de6f91af0e Revert header validation and leave possibility to empty column 2021-02-08 20:56:18 +00:00
Bruno Miguens
8c28ade348 Remove empty header validation to allow empty header 2021-02-08 20:50:15 +00:00
Bruno Miguens
a152c5b706 Fix lint 2021-02-08 19:32:35 +00:00
Bruno Miguens
292d64bbf4 Remove unnecessary tests 2021-02-08 19:26:27 +00:00
Bruno Miguens
db96eeb785 Add tests for basic and thead/tbody tables 2021-02-08 17:00:09 +00:00
Bruno Miguens
73f7644c0d Add basic support for HTML tables 2021-02-08 17:00:09 +00:00
AlexVonB
a4d134df97 Merge pull request #34 from BrunoMiguens/add-ignore-comment-tags
Add ignore comment tags
2021-02-07 19:46:49 +01:00
Bruno Miguens
457454c713 Add new line at the end of file 2021-02-05 19:49:57 +00:00
Bruno Miguens
321e9eb5f6 Add ignore comment tags 2021-02-05 19:40:43 +00:00
AlexVonB
bf24df3e2e bump to v0.6.3 2021-01-12 22:43:18 +01:00
AlexVonB
15329588b1 Merge branch 'develop' 2021-01-12 22:42:58 +01:00
AlexVonB
77d1e99bd5 satisfy linter 2021-01-12 22:42:06 +01:00
AlexVonB
34ad8485fa bump to v0.6.2 2021-01-12 22:40:03 +01:00
AlexVonB
f0ce934bf8 Merge branch 'develop' 2021-01-12 22:39:47 +01:00
AlexVonB
97c78ef55b Merge branch 'fix-extra-headline-whitespace' into develop 2021-01-12 22:38:59 +01:00
AlexVonB
99cd237f27 Merge branch 'develop' 2021-01-04 10:22:02 +01:00
AlexVonB
b7e1ab889d bump to v0.6.1 2021-01-04 10:21:27 +01:00
AlexVonB
29e86aec55 Merge branch 'fix-link-underscores' into develop 2021-01-04 10:18:05 +01:00
AlexVonB
453b604096 Fixing autolinks
When checking a links href and text for
equality, first un-escape the underscores
in the text -- because six escapes them.
This should fix #29.
2021-01-02 17:22:36 +01:00
AlexVonB
2bde8d3e8e Merge branch 'develop' 2021-01-02 16:49:28 +01:00
AlexVonB
4f8937810b dont replace newlines and tabs with spaces
this should fix #17, as all leading new lines
were replaced with a single space, which in turn
was rendered before the # of a headline
2020-12-29 10:28:50 +01:00
AlexVonB
3544322ed2 Bump Version 0.6.0 2020-12-13 23:41:56 +01:00
AlexVonB
c4d0a14ce5 Merge pull request #26 from idvorkin/develop
Add support for headings that include nested divs
2020-12-13 23:39:34 +01:00
Igor Dvorkin
05ea8dc58a Add many tests and support image tag 2020-12-13 17:40:53 +00:00
Igor Dvorkin
7780f82c30 Using a regexp to determine if a tag is a heading. 2020-12-11 16:54:14 -08:00
Igor Dvorkin
d558617cd7 Add support for headings that include nested block elements 2020-11-20 06:03:51 -08:00
AlexVonB
8c9b029756 Merge branch 'develop' 2020-09-01 18:10:07 +02:00
AlexVonB
25d68b4265 Bump version 0.5.3 2020-09-01 18:09:24 +02:00
AlexVonB
5561106991 Merge pull request #24 from SimonIT/fix-corrupt-html
Fix parsing corrupt html
2020-09-01 18:04:17 +02:00
SimonIT
1b3136ad04 Fix parsing corrupt html 2020-08-31 13:15:10 +02:00
AlexVonB
987a2a9cae Merge pull request #20 from SimonIT/badges
Add some fancy badges
2020-08-19 10:32:30 +02:00
SimonIT
a4461161bc Make badges inline 2020-08-19 10:06:21 +02:00
AlexVonB
ae50065872 Merge branch 'develop' 2020-08-18 18:53:10 +02:00
AlexVonB
19e2c3db0d Bump version 0.5.2 2020-08-18 18:52:53 +02:00
AlexVonB
ba51bbee12 Merge pull request #22 from SimonIT/ol-start-attribute
Support the start attribute for ordered lists
2020-08-18 18:44:59 +02:00
AlexVonB
9f3d497053 use python3.6 for linting 2020-08-18 18:41:46 +02:00
AlexVonB
d2fc689b66 set max flake8 version again3 2020-08-18 18:39:20 +02:00
AlexVonB
ab78385b56 set max flake8 version again2 2020-08-18 18:38:17 +02:00
AlexVonB
9ebf726e78 set max flake8 version again 2020-08-18 18:37:39 +02:00
AlexVonB
3f8403aa7a set max flake8 version 2020-08-18 18:35:31 +02:00
AlexVonB
5b6e76f984 Create python-app.yml 2020-08-18 18:30:55 +02:00
SimonIT
04711027e6 Replace downloads badge 2020-08-13 20:11:18 +02:00
SimonIT
ca98892953 Support the start attribute for ordered lists 2020-08-11 11:43:02 +02:00
AlexVonB
0dc281e6ea Bump version 0.5.1 2020-08-11 09:51:04 +02:00
AlexVonB
4e6e20e756 Merge pull request #21 from matthewwithanm/python-publish
Create python-publish.yml
2020-08-11 09:49:29 +02:00
Matthew Dapena-Tretter
9358522c73 Create python-publish.yml
Add workflow for publishing to PyPI.
2020-08-10 19:42:48 -07:00
SimonIT
28d7a22da3 Remove alt because it makes some trouble 2020-08-10 17:42:18 +02:00
SimonIT
8b882ca3c9 Add some fancy badges 2020-08-10 16:24:00 +02:00
AlexVonB
1078610066 ignore build folder 2020-08-10 13:03:12 +02:00
AlexVonB
d23dbc77e4 Merge branch 'master' into develop 2020-08-10 13:01:34 +02:00
AlexVonB
0c4b856b9c Bump to 0.5.0 2020-08-09 21:22:15 +02:00
AlexVonB
e9cc01938a Merge branch 'develop' 2020-08-09 21:20:44 +02:00
AlexVonB
aceced68eb cleaning up changes with help of linter 2020-08-09 21:17:39 +02:00
AlexVonB
3b049cdb9c added egg dirs to gitignore 2020-08-09 21:13:33 +02:00
AlexVonB
b747378b52 fixed nested lists and wrote correct tests
nested lists did not work: after a nested list was over,
a new line was inserted. this leads to a large gap before
the rest of the parent list.

lists are prefixed and suffixed with a single newline,
this is now represented in the tests.
2020-08-09 21:11:16 +02:00
AlexVonB
ee73d89879 Merge pull request #14 from AlexVonB/fix-inline-spaces
remove prefixed and suffixed spaces from inline tags
2020-08-09 20:24:23 +02:00
AlexVonB
5563161c86 remove needless checks for emtpy text 2019-07-12 10:23:17 +02:00
AlexVonB
28e447d9ae remove prefixed and suffixed spaces from inline tags
fixes matthewwithanm#13
2019-07-11 23:27:52 +02:00
Matthew Dapena-Tretter
89d14f4487 Merge pull request #11 from AlexVonB/AlexVonB-patch-1
Add newline before and after a markdown list
2019-07-04 08:53:25 -07:00
AlexVonB
5f9243d91d added tests for matthewwithanm#11 2019-07-04 16:32:21 +02:00
AlexVonB
d0f688d2e4 Add newline before and after a markdown list
Fixes matthewwithanm#5 as well as an issue where `<p>foo<p><ul><li>bar</li></ul>` gets converted to `foo * bar` which is not correct
2019-07-04 16:26:09 +02:00
Jonathan Vanasco
5ac08522be updating classifer to mit license
issue #9
2019-06-19 16:17:47 -07:00
Thomas Lange
78afcc173e Adding MIT license file 2018-10-16 19:11:02 -07:00
Steven Skoczen
b132a6f5b3 Updates to 0.4.1, pkgmeta included directly in setup. 2017-11-28 12:07:31 +13:00
Steven Skoczen
0abe0a29e8 Merge pull request #2 from crhallberg/html-parser
Suppress BeautifulSoup warning
2017-11-13 08:48:45 +13:00
Steven Skoczen
4932df631f Merge pull request #1 from dmpayton/develop
Fixes to get tests passing in Python 3.
2017-11-13 08:48:38 +13:00
Chris Hallberg
8696e2bde1 Suppress BeautifulSoup warning
by explicitly passing in the default parser as recommended by the error message:

```
/home/challberg/.local/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 35 of the file unroll.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))
```
2017-06-12 16:03:04 -04:00
dmpayton
ee53d85c41 Fixes to get tests passing in Python 3. 2016-02-23 15:15:29 -08:00
Matthew Tretter
53ba0daa77 Document options 2013-07-31 23:23:44 -04:00
Matthew Tretter
fb98e9878f Bump to 0.4.0 2013-07-31 23:12:53 -04:00
Matthew Tretter
aa10053fbb Test custom bullets 2013-07-31 23:11:39 -04:00
Matthew Tretter
253a34c2d7 Test nested unordered lists 2013-07-31 23:08:39 -04:00
Matthew Tretter
3ea09609e6 Add support for "bullets" option 2013-07-31 23:08:36 -04:00
Matthew Tretter
1cd8e56c47 Test ATX and ATX_CLOSED style headings 2013-07-31 22:19:41 -04:00
Matthew Tretter
891a4a8d08 Add "heading_style" option
Allow the user to specify a heading style.
2013-07-31 22:17:22 -04:00
Matthew Tretter
e5a1784f30 Remove unneeded raw string 2013-07-31 21:59:35 -04:00
Matthew Tretter
f60d910335 Add "autolinks" option
This option allows you to disable the creation of "autolink" style
links.
2013-07-31 21:58:48 -04:00
Matthew Tretter
d707d107f6 Support inner Options class 2013-07-31 21:55:30 -04:00
Matthew Tretter
1ef4dd1468 Add shortcut link syntax 2013-07-31 19:23:39 -04:00
Matthew Tretter
934c97b342 Test img tag conversion 2013-07-31 19:23:38 -04:00
Matthew Tretter
8a1e2d9403 Add simple img conversion 2013-07-31 19:23:36 -04:00
Matthew Tretter
5563723cbc Bump to 0.3.0 2013-07-31 18:16:02 -04:00
Matthew Tretter
a9c13a56da Identify and single out HTML fragment 2013-07-31 18:13:50 -04:00
Matthew Tretter
7bdeb15b18 Use bs4
This causes a lot more tests to fail. But it'll be worth it in the end.
2013-07-31 18:01:52 -04:00
Matthew Tretter
87c8f3bd5e Add development notes to README 2013-07-31 17:20:36 -04:00
Matthew Tretter
0211ac6619 Lint code 2013-07-31 17:20:36 -04:00
Matthew Tretter
2515e9e107 Add lint command 2013-07-31 17:20:32 -04:00
Matthew Tretter
ece61a5b1f Bump to 0.2.0 2013-07-31 17:11:12 -04:00
Matthew Tretter
f46fb8ebbb Add short description to README 2013-07-31 17:05:37 -04:00
Matthew Tretter
e521fd402f Add manifest template 2013-07-31 16:55:53 -04:00
Matthew Tretter
fd6f8db132 Add gitignore 2013-07-31 16:55:30 -04:00
Matthew Tretter
c2f32b8049 Switch to pytest 2013-07-31 16:54:37 -04:00
Matthew Tretter
b92428466d Change name to markdownify 2013-07-31 16:41:08 -04:00
Matthew Tretter
7f75b0bbce Update package meta 2013-07-31 16:40:56 -04:00
18 changed files with 868 additions and 198 deletions

33
.github/workflows/python-app.yml vendored Normal file
View File

@@ -0,0 +1,33 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Python application
on:
push:
branches: [ develop ]
pull_request:
branches: [ develop ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8==3.8.4 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
python setup.py lint
- name: Test with pytest
run: |
python setup.py test

31
.github/workflows/python-publish.yml vendored Normal file
View File

@@ -0,0 +1,31 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

10
.gitignore vendored Normal file
View File

@@ -0,0 +1,10 @@
*.pyc
*.egg
.eggs/
*.egg-info/
.DS_Store
/.env
/dist
/MANIFEST
/venv
build/

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright 2012-2018 Matthew Tretter
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

1
MANIFEST.in Normal file
View File

@@ -0,0 +1 @@
include README.rst

View File

@@ -0,0 +1,103 @@
|build| |version| |license| |downloads|
.. |build| image:: https://img.shields.io/github/workflow/status/matthewwithanm/python-markdownify/Python%20application/develop
:alt: GitHub Workflow Status
:target: https://github.com/matthewwithanm/python-markdownify/actions?query=workflow%3A%22Python+application%22
.. |version| image:: https://img.shields.io/pypi/v/markdownify
:alt: Pypi version
:target: https://pypi.org/project/markdownify/
.. |license| image:: https://img.shields.io/pypi/l/markdownify
:alt: License
:target: https://github.com/matthewwithanm/python-markdownify/blob/develop/LICENSE
.. |downloads| image:: https://pepy.tech/badge/markdownify
:alt: Pypi Downloads
:target: https://pepy.tech/project/markdownify
Installation
============
``pip install markdownify``
Usage
=====
Convert some HTML to Markdown:
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>') # > '**Yay** [GitHub](http://github.com)'
Specify tags to exclude (blacklist):
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a']) # > '**Yay** GitHub'
\...or specify the tags you want to include (whitelist):
.. code:: python
from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b']) # > '**Yay** GitHub'
Options
=======
Markdownify supports the following options:
strip
A list of tags to strip (blacklist). This option can't be used with the
``convert`` option.
convert
A list of tags to convert (whitelist). This option can't be used with the
``strip`` option.
autolinks
A boolean indicating whether the "automatic link" style should be used when
a ``a`` tag's contents match its href. Defaults to ``True``
heading_style
Defines how headings should be converted. Accepted values are ``ATX``,
``ATX_CLOSED``, ``SETEXT``, and ``UNDERLINED`` (which is an alias for
``SETEXT``). Defaults to ``UNDERLINED``.
bullets
An iterable (string, list, or tuple) of bullet styles to be used. If the
iterable only contains one item, it will be used regardless of how deeply
lists are nested. Otherwise, the bullet will alternate based on nesting
level. Defaults to ``'*+-'``.
strong_em_symbol
In markdown, both ``*`` and ``_`` are used to encode **strong** or
*emphasized* texts. Either of these symbols can be chosen by the options
``ASTERISK`` (default) or ``UNDERSCORE`` respectively.
newline_style
Defines the style of marking linebreaks (``<br>``) in markdown. The default
value ``SPACES`` of this option will adopt the usual two spaces and a newline,
while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash an a
newline). While the latter convention is non-standard, it is commonly
preferred and supported by a lot of interpreters.
Options may be specified as kwargs to the ``markdownify`` function, or as a
nested ``Options`` class in ``MarkdownConverter`` subclasses.
Development
===========
To run tests:
``python setup.py test``
To lint:
``python setup.py lint``

View File

@@ -1,10 +1,27 @@
from lxml.html.soupparser import fromstring
from bs4 import BeautifulSoup, NavigableString, Comment
import re
import six
convert_heading_re = re.compile(r'convert_h(\d+)')
line_beginning_re = re.compile(r'^', re.MULTILINE)
whitespace_re = re.compile(r'[\r\n\s\t ]+')
whitespace_re = re.compile(r'[\t ]+')
html_heading_re = re.compile(r'h[1-6]')
# Heading styles
ATX = 'atx'
ATX_CLOSED = 'atx_closed'
UNDERLINED = 'underlined'
SETEXT = UNDERLINED
# Newline style
SPACES = 'spaces'
BACKSLASH = 'backslash'
# Strong and emphasis style
ASTERISK = '*'
UNDERSCORE = '_'
def escape(text):
@@ -13,30 +30,72 @@ def escape(text):
return text.replace('_', r'\_')
def chomp(text):
"""
If the text in an inline tag like b, a, or em contains a leading or trailing
space, strip the string and return a space as suffix of prefix, if needed.
This function is used to prevent conversions like
<b> foo</b> => ** foo**
"""
prefix = ' ' if text and text[0] == ' ' else ''
suffix = ' ' if text and text[-1] == ' ' else ''
text = text.strip()
return (prefix, suffix, text)
def _todict(obj):
return dict((k, getattr(obj, k)) for k in dir(obj) if not k.startswith('_'))
class MarkdownConverter(object):
def __init__(self, tags_to_strip=None, tags_to_convert=None):
if tags_to_strip is not None and tags_to_convert is not None:
class DefaultOptions:
strip = None
convert = None
autolinks = True
heading_style = UNDERLINED
bullets = '*+-' # An iterable of bullet types.
strong_em_symbol = ASTERISK
newline_style = SPACES
class Options(DefaultOptions):
pass
def __init__(self, **options):
# Create an options dictionary. Use DefaultOptions as a base so that
# it doesn't have to be extended.
self.options = _todict(self.DefaultOptions)
self.options.update(_todict(self.Options))
self.options.update(options)
if self.options['strip'] is not None and self.options['convert'] is not None:
raise ValueError('You may specify either tags to strip or tags to'
' convert, but not both.')
self.tags_to_strip = tags_to_strip
self.tags_to_convert = tags_to_convert
' convert, but not both.')
def convert(self, html):
soup = fromstring(html)
return self.process_tag(soup)
soup = BeautifulSoup(html, 'html.parser')
return self.process_tag(soup, convert_as_inline=False, children_only=True)
def process_tag(self, node):
text = self.process_text(node.text)
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
# markdown headings can't include block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
convert_children_as_inline = convert_as_inline
if not children_only and isHeading:
convert_children_as_inline = True
# Convert the children first
for el in node.findall('*'):
text += self.process_tag(el)
for el in node.children:
if isinstance(el, Comment):
continue
elif isinstance(el, NavigableString):
text += self.process_text(six.text_type(el))
else:
text += self.process_tag(el, convert_children_as_inline)
convert_fn = getattr(self, 'convert_%s' % node.tag, None)
if convert_fn and self.should_convert_tag(node.tag):
text = convert_fn(node, text)
text += self.process_text(node.tail)
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)
return text
@@ -44,13 +103,13 @@ class MarkdownConverter(object):
return escape(whitespace_re.sub(' ', text or ''))
def __getattr__(self, attr):
# Handle heading levels > 2
# Handle headings
m = convert_heading_re.match(attr)
if m:
n = int(m.group(1))
def convert_tag(el, text):
return self.convert_hn(n, el, text)
def convert_tag(el, text, convert_as_inline):
return self.convert_hn(n, el, text, convert_as_inline)
convert_tag.__name__ = 'convert_h%s' % n
setattr(self, convert_tag.__name__, convert_tag)
@@ -60,62 +119,159 @@ class MarkdownConverter(object):
def should_convert_tag(self, tag):
tag = tag.lower()
if self.tags_to_strip is not None:
return tag not in self.tags_to_strip
elif self.tags_to_convert is not None:
return tag in self.tags_to_convert
strip = self.options['strip']
convert = self.options['convert']
if strip is not None:
return tag not in strip
elif convert is not None:
return tag in convert
else:
return True
def indent(self, text, level):
return line_beginning_re.sub('\t' * level, text) if text else ''
def underline(self, text, pad_char):
text = (text or '').rstrip()
return '%s\n%s\n\n' % (text, pad_char * len(text)) if text else ''
def convert_a(self, el, text):
def convert_a(self, el, text, convert_as_inline):
prefix, suffix, text = chomp(text)
if not text:
return ''
if convert_as_inline:
return text
href = el.get('href')
title = el.get('title')
# For the replacement see #29: text nodes underscores are escaped
if self.options['autolinks'] and text.replace(r'\_', '_') == href and not title:
# Shortcut syntax
return '<%s>' % href
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
return '[%s](%s%s)' % (text or '', href, title_part) if href else text or ''
return '%s[%s](%s%s)%s' % (prefix, text, href, title_part, suffix) if href else text
def convert_b(self, el, text):
return self.convert_strong(el, text)
def convert_b(self, el, text, convert_as_inline):
return self.convert_strong(el, text, convert_as_inline)
def convert_blockquote(self, el, text):
return '\n' + line_beginning_re.sub('> ', text) if text else ''
def convert_blockquote(self, el, text, convert_as_inline):
def convert_br(self, el, text):
return ' \n'
if convert_as_inline:
return text
def convert_em(self, el, text):
return '*%s*' % text if text else ''
return '\n' + (line_beginning_re.sub('> ', text) + '\n\n') if text else ''
def convert_h1(self, el, text):
return self.underline(text, '=')
def convert_br(self, el, text, convert_as_inline):
if convert_as_inline:
return ""
def convert_h2(self, el, text):
return self.underline(text, '-')
def convert_hn(self, n, el, text):
return '%s %s\n\n' % ('#' * n, text.rstrip()) if text else ''
def convert_i(self, el, text):
return self.convert_em(el, text)
def convert_li(self, el, text):
parent = el.getparent()
if parent is not None and parent.tag == 'ol':
bullet = '%s.' % (parent.index(el) + 1)
if self.options['newline_style'].lower() == BACKSLASH:
return '\\\n'
else:
bullet = '*'
return ' \n'
def convert_em(self, el, text, convert_as_inline):
em_tag = self.options['strong_em_symbol']
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s%s%s%s%s' % (prefix, em_tag, text, em_tag, suffix)
def convert_hn(self, n, el, text, convert_as_inline):
if convert_as_inline:
return text
style = self.options['heading_style'].lower()
text = text.rstrip()
if style == UNDERLINED and n <= 2:
line = '=' if n == 1 else '-'
return self.underline(text, line)
hashes = '#' * n
if style == ATX_CLOSED:
return '%s %s %s\n\n' % (hashes, text, hashes)
return '%s %s\n\n' % (hashes, text)
def convert_i(self, el, text, convert_as_inline):
return self.convert_em(el, text, convert_as_inline)
def convert_list(self, el, text, convert_as_inline):
# Converting a list to inline is undefined.
# Ignoring convert_to_inline for list.
nested = False
while el:
if el.name == 'li':
nested = True
break
el = el.parent
if nested:
# remove trailing newline if nested
return '\n' + self.indent(text, 1).rstrip()
return '\n' + text + '\n'
convert_ul = convert_list
convert_ol = convert_list
def convert_li(self, el, text, convert_as_inline):
parent = el.parent
if parent is not None and parent.name == 'ol':
if parent.get("start"):
start = int(parent.get("start"))
else:
start = 1
bullet = '%s.' % (start + parent.index(el))
else:
depth = -1
while el:
if el.name == 'ul':
depth += 1
el = el.parent
bullets = self.options['bullets']
bullet = bullets[depth % len(bullets)]
return '%s %s\n' % (bullet, text or '')
def convert_p(self, el, text):
def convert_p(self, el, text, convert_as_inline):
if convert_as_inline:
return text
return '%s\n\n' % text if text else ''
def convert_strong(self, el, text):
return '**%s**' % text if text else ''
def convert_strong(self, el, text, convert_as_inline):
strong_tag = 2 * self.options['strong_em_symbol']
prefix, suffix, text = chomp(text)
if not text:
return ''
return '%s%s%s%s%s' % (prefix, strong_tag, text, strong_tag, suffix)
def convert_img(self, el, text, convert_as_inline):
alt = el.attrs.get('alt', None) or ''
src = el.attrs.get('src', None) or ''
title = el.attrs.get('title', None) or ''
title_part = ' "%s"' % title.replace('"', r'\"') if title else ''
if convert_as_inline:
return alt
return '![%s](%s%s)' % (alt, src, title_part)
def convert_table(self, el, text, convert_as_inline):
rows = el.find_all('tr')
text_data = []
for row in rows:
headers = row.find_all('th')
columns = row.find_all('td')
if len(headers) > 0:
headers = [head.text.strip() for head in headers]
text_data.append('| ' + ' | '.join(headers) + ' |')
text_data.append('| ' + ' | '.join(['---'] * len(headers)) + ' |')
elif len(columns) > 0:
columns = [colm.text.strip() for colm in columns]
text_data.append('| ' + ' | '.join(columns) + ' |')
else:
continue
return '\n'.join(text_data)
def convert_hr(self, el, text, convert_as_inline):
return '\n\n---\n\n'
def markdownify(html, strip=None, convert=None):
converter = MarkdownConverter(strip, convert)
return converter.convert(html)
def markdownify(html, **options):
return MarkdownConverter(**options).convert(html)

View File

@@ -1 +0,0 @@
__version__ = '0.1.0'

View File

@@ -1,5 +0,0 @@
#!/usr/bin/env python
from nose.core import run, collector
if __name__ == '__main__':
run()

2
setup.cfg Normal file
View File

@@ -0,0 +1,2 @@
[flake8]
ignore = E501

View File

@@ -2,43 +2,98 @@
import codecs
import os
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand, Command
read = lambda filepath: codecs.open(filepath, 'r', 'utf-8').read()
execfile(os.path.join(os.path.dirname(__file__), 'markdownify', 'version.py'))
pkgmeta = {
'__title__': 'markdownify',
'__author__': 'Matthew Tretter',
'__version__': '0.7.1',
}
class PyTest(TestCommand):
def finalize_options(self):
TestCommand.finalize_options(self)
self.test_args = ['tests', '-s']
self.test_suite = True
def run_tests(self):
import pytest
errno = pytest.main(self.test_args)
raise SystemExit(errno)
class LintCommand(Command):
"""
A copy of flake8's Flake8Command
"""
description = "Run flake8 on modules registered in setuptools"
user_options = []
def initialize_options(self):
pass
def finalize_options(self):
pass
def distribution_files(self):
if self.distribution.packages:
for package in self.distribution.packages:
yield package.replace(".", os.path.sep)
if self.distribution.py_modules:
for filename in self.distribution.py_modules:
yield "%s.py" % filename
def run(self):
from flake8.api.legacy import get_style_guide
flake8_style = get_style_guide(config_file='setup.cfg')
paths = self.distribution_files()
report = flake8_style.check_files(paths)
raise SystemExit(report.total_errors > 0)
setup(
name='python-markdownify',
name='markdownify',
description='Convert HTML to markdown.',
long_description=read(os.path.join(os.path.dirname(__file__), 'README.rst')),
version=__version__,
author='Matthew Tretter',
author_email='matthew@exanimo.com',
version=pkgmeta['__version__'],
author=pkgmeta['__author__'],
author_email='m@tthewwithanm.com',
url='http://github.com/matthewwithanm/python-markdownify',
download_url='http://github.com/matthewwithanm/python-markdownify/tarball/master',
packages=find_packages(),
zip_safe=False,
include_package_data=True,
setup_requires=[
'flake8>=3.8,<4',
],
tests_require=[
'nose',
'unittest2',
'pytest>=6.2,<7',
],
install_requires=[
'lxml',
'BeautifulSoup',
'beautifulsoup4>=4.9,<5', 'six>=1.15,<2'
],
classifiers=[
'Environment :: Web Environment',
'Framework :: Django',
'Intended Audience :: Developers',
'License :: OSI Approved :: BSD License',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python :: 2.5',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Topic :: Utilities'
],
setup_requires=[],
test_suite='runtests.collector',
cmdclass={
'test': PyTest,
'lint': LintCommand,
},
)

123
tests.py
View File

@@ -1,123 +0,0 @@
import unittest
from markdownify import markdownify as md
class BasicTests(unittest.TestCase):
def test_single_tag(self):
self.assertEqual(md('<span>Hello</span>'), 'Hello')
def test_soup(self):
self.assertEqual(md('<div><span>Hello</div></span>'), 'Hello')
def test_whitespace(self):
self.assertEqual(md(' a b \n\n c '), ' a b c ')
class ArgTests(unittest.TestCase):
def test_strip(self):
self.assertEqual(
md('<a href="https://github.com/matthewwithanm">Some Text</a>', strip=['a']),
'Some Text')
def test_do_not_strip(self):
self.assertEqual(
md('<a href="https://github.com/matthewwithanm">Some Text</a>', strip=[]),
'[Some Text](https://github.com/matthewwithanm)')
def test_convert(self):
self.assertEqual(
md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=['a']),
'[Some Text](https://github.com/matthewwithanm)')
def test_do_not_convert(self):
self.assertEqual(
md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=[]),
'Some Text')
class EscapeTests(unittest.TestCase):
def test_underscore(self):
self.assertEqual(md('_hey_dude_'), '\_hey\_dude\_')
def test_xml_entities(self):
self.assertEqual(md('&amp;'), '&')
def test_named_entities(self):
self.assertEqual(md('&raquo;'), u'\xbb')
def test_hexadecimal_entities(self):
# This looks to be a bug in BeautifulSoup (fixed in bs4) that we have to work around.
self.assertEqual(md('&#x27;'), '\x27')
def test_single_escaping_entities(self):
self.assertEqual(md('&amp;amp;'), '&amp;')
class ConversionTests(unittest.TestCase):
def test_a(self):
self.assertEqual(
md('<a href="http://google.com">Google</a>'),
'[Google](http://google.com)'
)
def test_a_with_title(self):
self.assertEqual(
md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>'),
r'[Google](http://google.com "The \"Goog\"")'
)
def test_b(self):
self.assertEqual(md('<b>Hello</b>'), '**Hello**')
def test_blockquote(self):
self.assertEqual(md('<blockquote>Hello</blockquote>').strip(), '> Hello')
def test_nested_blockquote(self):
self.assertEqual(
md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>').strip(),
'> And she was like \n> > Hello'
)
def test_br(self):
self.assertEqual(md('a<br />b<br />c'), 'a \nb \nc')
def test_em(self):
self.assertEqual(md('<em>Hello</em>'), '*Hello*')
def test_h1(self):
self.assertEqual(md('<h1>Hello</h1>'), 'Hello\n=====\n\n')
def test_h2(self):
self.assertEqual(md('<h2>Hello</h2>'), 'Hello\n-----\n\n')
def test_hn(self):
self.assertEqual(md('<h3>Hello</h3>'), '### Hello\n\n')
self.assertEqual(md('<h6>Hello</h6>'), '###### Hello\n\n')
def test_i(self):
self.assertEqual(md('<i>Hello</i>'), '*Hello*')
def test_ol(self):
self.assertEqual(md('<ol><li>a</li><li>b</li></ol>'), '1. a\n2. b\n')
def test_p(self):
self.assertEqual(md('<p>hello</p>'), 'hello\n\n')
def test_strong(self):
self.assertEqual(md('<strong>Hello</strong>'), '**Hello**')
def test_ul(self):
self.assertEqual(md('<ul><li>a</li><li>b</li></ul>'), '* a\n* b\n')
class AdvancedTests(unittest.TestCase):
def test_nested(self):
self.assertEqual(
md('<p>This is an <a href="http://example.com/">example link</a>.</p>'),
'This is an [example link](http://example.com/).\n\n'
)

0
tests/__init__.py Normal file
View File

16
tests/test_advanced.py Normal file
View File

@@ -0,0 +1,16 @@
from markdownify import markdownify as md
def test_nested():
text = md('<p>This is an <a href="http://example.com/">example link</a>.</p>')
assert text == 'This is an [example link](http://example.com/).\n\n'
def test_ignore_comments():
text = md("<!-- This is a comment -->")
assert text == ""
def test_ignore_comments_with_other_tags():
text = md("<!-- This is a comment --><a href='http://example.com/'>example link</a>")
assert text == "[example link](http://example.com/)"

25
tests/test_args.py Normal file
View File

@@ -0,0 +1,25 @@
"""
Test whitelisting/blacklisting of specific tags.
"""
from markdownify import markdownify as md
def test_strip():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', strip=['a'])
assert text == 'Some Text'
def test_do_not_strip():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', strip=[])
assert text == '[Some Text](https://github.com/matthewwithanm)'
def test_convert():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=['a'])
assert text == '[Some Text](https://github.com/matthewwithanm)'
def test_do_not_convert():
text = md('<a href="https://github.com/matthewwithanm">Some Text</a>', convert=[])
assert text == 'Some Text'

13
tests/test_basic.py Normal file
View File

@@ -0,0 +1,13 @@
from markdownify import markdownify as md
def test_single_tag():
assert md('<span>Hello</span>') == 'Hello'
def test_soup():
assert md('<div><span>Hello</div></span>') == 'Hello'
def test_whitespace():
assert md(' a b \t\t c ') == ' a b c '

311
tests/test_conversions.py Normal file
View File

@@ -0,0 +1,311 @@
from markdownify import markdownify as md, ATX, ATX_CLOSED, BACKSLASH, UNDERSCORE
import re
nested_uls = re.sub(r'\s+', '', """
<ul>
<li>1
<ul>
<li>a
<ul>
<li>I</li>
<li>II</li>
<li>III</li>
</ul>
</li>
<li>b</li>
<li>c</li>
</ul>
</li>
<li>2</li>
<li>3</li>
</ul>""")
table = re.sub(r'\s+', '', """
<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
""")
table_head_body = re.sub(r'\s+', '', """
<table>
<thead>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>
""")
table_missing_text = re.sub(r'\s+', '', """
<table>
<thead>
<tr>
<th></th>
<th>Lastname</th>
<th>Age</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jill</td>
<td></td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>
""")
def test_chomp():
assert md(' <b></b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b> </b> ') == ' '
assert md(' <b>s </b> ') == ' **s** '
assert md(' <b> s</b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
assert md(' <b> s </b> ') == ' **s** '
def test_a():
assert md('<a href="https://google.com">Google</a>') == '[Google](https://google.com)'
assert md('<a href="https://google.com">https://google.com</a>', autolinks=False) == '[https://google.com](https://google.com)'
assert md('<a href="https://google.com">https://google.com</a>') == '<https://google.com>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>') == '<https://community.kde.org/Get_Involved>'
assert md('<a href="https://community.kde.org/Get_Involved">https://community.kde.org/Get_Involved</a>', autolinks=False) == '[https://community.kde.org/Get\\_Involved](https://community.kde.org/Get_Involved)'
def test_a_spaces():
assert md('foo <a href="http://google.com">Google</a> bar') == 'foo [Google](http://google.com) bar'
assert md('foo<a href="http://google.com"> Google</a> bar') == 'foo [Google](http://google.com) bar'
assert md('foo <a href="http://google.com">Google </a>bar') == 'foo [Google](http://google.com) bar'
assert md('foo <a href="http://google.com"></a> bar') == 'foo bar'
def test_a_with_title():
text = md('<a href="http://google.com" title="The &quot;Goog&quot;">Google</a>')
assert text == r'[Google](http://google.com "The \"Goog\"")'
def test_a_shortcut():
text = md('<a href="http://google.com">http://google.com</a>')
assert text == '<http://google.com>'
def test_a_no_autolinks():
text = md('<a href="http://google.com">http://google.com</a>', autolinks=False)
assert text == '[http://google.com](http://google.com)'
def test_b():
assert md('<b>Hello</b>') == '**Hello**'
def test_b_spaces():
assert md('foo <b>Hello</b> bar') == 'foo **Hello** bar'
assert md('foo<b> Hello</b> bar') == 'foo **Hello** bar'
assert md('foo <b>Hello </b>bar') == 'foo **Hello** bar'
assert md('foo <b></b> bar') == 'foo bar'
def test_blockquote():
assert md('<blockquote>Hello</blockquote>') == '\n> Hello\n\n'
def test_blockquote_with_paragraph():
assert md('<blockquote>Hello</blockquote><p>handsome</p>') == '\n> Hello\n\nhandsome\n\n'
def test_nested_blockquote():
text = md('<blockquote>And she was like <blockquote>Hello</blockquote></blockquote>')
assert text == '\n> And she was like \n> > Hello\n> \n> \n\n'
def test_br():
assert md('a<br />b<br />c') == 'a \nb \nc'
def test_em():
assert md('<em>Hello</em>') == '*Hello*'
def test_em_spaces():
assert md('foo <em>Hello</em> bar') == 'foo *Hello* bar'
assert md('foo<em> Hello</em> bar') == 'foo *Hello* bar'
assert md('foo <em>Hello </em>bar') == 'foo *Hello* bar'
assert md('foo <em></em> bar') == 'foo bar'
def test_h1():
assert md('<h1>Hello</h1>') == 'Hello\n=====\n\n'
def test_h2():
assert md('<h2>Hello</h2>') == 'Hello\n-----\n\n'
def test_hn():
assert md('<h3>Hello</h3>') == '### Hello\n\n'
assert md('<h6>Hello</h6>') == '###### Hello\n\n'
def test_hn_chained():
assert md('<h1>First</h1>\n<h2>Second</h2>\n<h3>Third</h3>', heading_style=ATX) == '# First\n\n\n## Second\n\n\n### Third\n\n'
assert md('X<h1>First</h1>', heading_style=ATX) == 'X# First\n\n'
def test_hn_nested_tag_heading_style():
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX_CLOSED) == '# A P C #\n\n'
assert md('<h1>A <p>P</p> C </h1>', heading_style=ATX) == '# A P C\n\n'
def test_hn_nested_simple_tag():
tag_to_markdown = [
("strong", "**strong**"),
("b", "**b**"),
("em", "*em*"),
("i", "*i*"),
("p", "p"),
("a", "a"),
("div", "div"),
("blockquote", "blockquote"),
]
for tag, markdown in tag_to_markdown:
assert md('<h3>A <' + tag + '>' + tag + '</' + tag + '> B</h3>') == '### A ' + markdown + ' B\n\n'
assert md('<h3>A <br>B</h3>', heading_style=ATX) == '### A B\n\n'
# Nested lists not supported
# assert md('<h3>A <ul><li>li1</i><li>l2</li></ul></h3>', heading_style=ATX) == '### A li1 li2 B\n\n'
def test_hn_nested_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
image_attributes_to_markdown = [
("", ""),
("alt='Alt Text'", "Alt Text"),
("alt='Alt Text' title='Optional title'", "Alt Text"),
]
for image_attributes, markdown in image_attributes_to_markdown:
assert md('<h3>A <img src="/path/to/img.jpg " ' + image_attributes + '/> B</h3>') == '### A ' + markdown + ' B\n\n'
def test_hr():
assert md('Hello<hr>World') == 'Hello\n\n---\n\nWorld'
assert md('Hello<hr />World') == 'Hello\n\n---\n\nWorld'
assert md('<p>Hello</p>\n<hr>\n<p>World</p>') == 'Hello\n\n\n\n\n---\n\n\nWorld\n\n'
def test_head():
assert md('<head>head</head>') == 'head'
def test_atx_headings():
assert md('<h1>Hello</h1>', heading_style=ATX) == '# Hello\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX) == '## Hello\n\n'
def test_atx_closed_headings():
assert md('<h1>Hello</h1>', heading_style=ATX_CLOSED) == '# Hello #\n\n'
assert md('<h2>Hello</h2>', heading_style=ATX_CLOSED) == '## Hello ##\n\n'
def test_i():
assert md('<i>Hello</i>') == '*Hello*'
def test_ol():
assert md('<ol><li>a</li><li>b</li></ol>') == '\n1. a\n2. b\n\n'
assert md('<ol start="3"><li>a</li><li>b</li></ol>') == '\n3. a\n4. b\n\n'
def test_p():
assert md('<p>hello</p>') == 'hello\n\n'
def test_strong():
assert md('<strong>Hello</strong>') == '**Hello**'
def test_ul():
assert md('<ul><li>a</li><li>b</li></ul>') == '\n* a\n* b\n\n'
def test_inline_ul():
assert md('<p>foo</p><ul><li>a</li><li>b</li></ul><p>bar</p>') == 'foo\n\n\n* a\n* b\n\nbar\n\n'
def test_nested_uls():
"""
Nested ULs should alternate bullet characters.
"""
assert md(nested_uls) == '\n* 1\n\t+ a\n\t\t- I\n\t\t- II\n\t\t- III\n\t+ b\n\t+ c\n* 2\n* 3\n\n'
def test_bullets():
assert md(nested_uls, bullets='-') == '\n- 1\n\t- a\n\t\t- I\n\t\t- II\n\t\t- III\n\t- b\n\t- c\n- 2\n- 3\n\n'
def test_img():
assert md('<img src="/path/to/img.jpg" alt="Alt text" title="Optional title" />') == '![Alt text](/path/to/img.jpg "Optional title")'
assert md('<img src="/path/to/img.jpg" alt="Alt text" />') == '![Alt text](/path/to/img.jpg)'
def test_div():
assert md('Hello</div> World') == 'Hello World'
def test_table():
assert md(table) == '| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |'
assert md(table_head_body) == '| Firstname | Lastname | Age |\n| --- | --- | --- |\n| Jill | Smith | 50 |\n| Eve | Jackson | 94 |'
assert md(table_missing_text) == '| | Lastname | Age |\n| --- | --- | --- |\n| Jill | | 50 |\n| Eve | Jackson | 94 |'
def test_strong_em_symbol():
assert md('<strong>Hello</strong>', strong_em_symbol=UNDERSCORE) == '__Hello__'
assert md('<b>Hello</b>', strong_em_symbol=UNDERSCORE) == '__Hello__'
assert md('<em>Hello</em>', strong_em_symbol=UNDERSCORE) == '_Hello_'
assert md('<i>Hello</i>', strong_em_symbol=UNDERSCORE) == '_Hello_'
def test_newline_style():
assert md('a<br />b<br />c', newline_style=BACKSLASH) == 'a\\\nb\\\nc'

22
tests/test_escaping.py Normal file
View File

@@ -0,0 +1,22 @@
from markdownify import markdownify as md
def test_underscore():
assert md('_hey_dude_') == r'\_hey\_dude\_'
def test_xml_entities():
assert md('&amp;') == '&'
def test_named_entities():
assert md('&raquo;') == u'\xbb'
def test_hexadecimal_entities():
# This looks to be a bug in BeautifulSoup (fixed in bs4) that we have to work around.
assert md('&#x27;') == '\x27'
def test_single_escaping_entities():
assert md('&amp;amp;') == '&amp;'