Python Markdown
As you know I’m working on improvements of Python Markdown implementation in terms of Google Summer of Code 2008. Markdown is a markup language, which was originally created by John Gruber and Aaron Swartz, and implemented in Perl. Today there are a lot of different Markdown implementations, for almost all popular languages. The main aim of markdown — maximum readability, markdown is simpler than (X)HTML and widely used in different services, where users can publish some texts(i.e blogs, in some blogs users can write comments/posts, using Markdown). For example, if you want some italics text, you need to write:
*italics text*
bold text will be:
**bold text**
link:
[Example site](http://example.com)
lists:
- First item
— Second item
1. First item
2. Second item
Other syntactical constructions can be found at daringfireball.net
Output of different Markdown implementations can be tested using babelmark
The main difference between Python Markdown and other Markdown implementations is that Python Markdown process data in that way: text -> DOM tree -> (X)HTML, and others do just text to (X)HTML conversions. Markdown uses own DOM implementation called NanoDOM. Of course because of it, Python Markdown slower then others, but it gives a lot of benefits and as well as some problems. One of the main problem with current Markdown — nested inline patterns. With current processing mechanism we can force only one of the variants works, but not both:
**_bold and emphasized text_**
_**bold and emphasized text**_
So, first thing I did with Markdown — developed new mechanism of processing inline patterns, and it also gave great performance boost. Another thing, I ported Python Markdown from NanoDOM to ElementTree, we also considered of adding lxml support, since it has the same API as ElementTree, and according to this benchmark it’s way way faster than ElementTree. But when I ran some tests, there were only 5% advantage in speed, but ElementTree wins lxml in memory usage(two times less). So I dropped lxml support for now. Of course in test cElementTree was used — it’s ElementTree implemented in C, but talking about pure Python ElementTree, it’s little bit faster then NanoDOM, and also wins it in memory usage. For now we use such import for ElementTree:
try:
# Python 2.5+
import xml.etree.cElementTree as etree
except ImportError:
try:
# Python 2.5+
import xml.etree.ElementTree as etree
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
except ImportError:
message(CRITICAL,
"Failed to import ElementTree from any known place")
sys.exit(1)
ElementTree was included in standard library since Python 2.5, but if you use older version of Python, you can install it by yourself, ElemetTree can be used with Python 1.5.2 and later.
So, now GSoC version of Python Markdown 20%-30% faster and it has lower memory usage. There are some other changes, but I think I’ll write about it later. The latest GSoC version of Markdown can be found in Git repository
Comments