A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions

Related tags

Web Crawling pegdown
Overview

:>>> DEPRECATION NOTE <<<:

Although still one of the most popular Markdown parsing libraries for the JVM, pegdown has reached its end of life.

The project is essentially unmaintained with tickets piling up and crucial bugs not being fixed.
pegdown's parsing performance isn't great. In some cases of pathological input runtime can even become exponential, which means that the parser either appears to "hang" completely or abort processing after a time-out.

Therefore pegdown is not recommended anymore for use in new projects requiring a markdown parser.
Instead I suggest you turn to @vsch's flexmark-java, which appears to be an excellent replacement for these reasons:

  • Modern parser architecture (based on commonmark-java), designed from the ground up as a pegdown replacement and supporting all its features and extensions
  • 30x better average parsing performance without pathological input cases
  • Configuration options for a multitude of markdown dialects (CommonMark, pegdown, MultiMarkdown, kramdown and Markdown.pl)
  • Actively maintained and used as the basis of an IntelliJ plugin with almost 2M downloads per year
  • The author (@vsch) has actively contributed to pegdown maintenance in the last two years and is intimately familiar with pegdown's internals and quirks.

In case you need support with migrating from pegdown to flexmark-java, @vsch welcomes inquiries in here or here.


Introduction

Maven Central Javadoc

pegdown is a pure Java library for clean and lightweight Markdown processing based on a parboiled PEG parser.

pegdown is nearly 100% compatible with the original Markdown specification and fully passes the original Markdown test suite. On top of the standard Markdown feature set pegdown implements a number of extensions similar to what other popular Markdown processors offer. You can also extend pegdown by your own plugins! Currently pegdown supports the following extensions over standard Markdown:

  • SMARTS: Beautifies apostrophes, ellipses ("..." and ". . .") and dashes ("--" and "---")
  • QUOTES: Beautifies single quotes, double quotes and double angle quotes (« and »)
  • SMARTYPANTS: Convenience extension enabling both, SMARTS and QUOTES, at once.
  • ABBREVIATIONS: Abbreviations in the way of PHP Markdown Extra.
  • ANCHORLINKS: Generate anchor links for headers by taking the first range of alphanumerics and spaces.
  • HARDWRAPS: Alternative handling of newlines, see Github-flavoured-Markdown
  • AUTOLINKS: Plain (undelimited) autolinks the way Github-flavoured-Markdown implements them.
  • TABLES: Tables similar to MultiMarkdown (which is in turn like the PHP Markdown Extra tables, but with colspan support).
  • DEFINITION LISTS: Definition lists in the way of PHP Markdown Extra.
  • FENCED CODE BLOCKS: Fenced Code Blocks in the way of PHP Markdown Extra or Github-flavoured-Markdown.
  • HTML BLOCK SUPPRESSION: Suppresses the output of HTML blocks.
  • INLINE HTML SUPPRESSION: Suppresses the output of inline HTML elements.
  • WIKILINKS: Support [[Wiki-style links]] with a customizable URL rendering logic.
  • STRIKETHROUGH: Support strikethroughs as supported in Pandoc and Github.
  • ATXHEADERSPACE: Require a space between the # and the header title text, as per Github-flavoured-Markdown. Frees up # without a space to be just plain text.
  • FORCELISTITEMPARA: Wrap a list item or definition term in <p> tags if it contains more than a simple paragraph.
  • RELAXEDHRULES: allow horizontal rules without a blank line following them.
  • TASKLISTITEMS: parses bullet lists of the form * [ ] and * [x] to create GitHub like task list items:
    • open task item
    • closed or completed task item.
    • also closed or completed task item.
  • EXTANCHORLINKS: Generate anchor links for headers using complete contents of the header.
    • Spaces and non-alphanumerics replaced by -, multiple dashes trimmed to one.
    • Anchor link is added as first element inside the header with empty content: <h1><a name="header"></a>header</h1>

Note: pegdown differs from the original Markdown in that it ignores in-word emphasis as in

> my_cool_file.txt
> 2*3*4=5

Currently this "extension" cannot be switched off.

Installation

You have two options:

  • Download the JAR for the latest version from here. pegdown 1.6.0 has only one dependency: parboiled for Java, version 1.1.7.

  • The pegdown artifact is also available from maven central with group id org.pegdown and artifact-id pegdown.

Usage

Using pegdown is very simple: Just create a new instance of a PegDownProcessor and call one of its markdownToHtml methods to convert the given Markdown source to an HTML string. If you'd like to customize the rendering of HTML links (Auto-Links, Explicit-Links, Mail-Links, Reference-Links and/or Wiki-Links), e.g. for adding rel="nofollow" attributes based on some logic you can supply your own instance of a LinkRenderer with the call to markdownToHtml.

You can also use pegdown only for the actual parsing of the Markdown source and do the serialization to the target format (e.g. XML) yourself. To do this just call the parseMarkdown method of the PegDownProcessor to obtain the root node of the Abstract Syntax Tree for the document. With a custom Visitor implementation you can do whatever serialization you want. As an example you might want to take a look at the sources of the ToHtmlSerializer.

Note that the first time you create a PegDownProcessor it can take up to a few hundred milliseconds to prepare the underlying parboiled parser instance. However, once the first processor has been built all further instantiations will be fast. Also, you can reuse an existing PegDownProcessor instance as often as you want, as long as you prevent concurrent accesses, since neither the PegDownProcessor nor the underlying parser is thread-safe.

See http://sirthias.github.com/pegdown/api for the pegdown API documentation.

Plugins

Since parsing and serialisation are two different things there are two different plugin mechanisms, one for the parser, and one for the ToHtmlSerializer. Most plugins would probably implement both, but it is possible that a plugin might just implement the parser plugin interface.

For the parser there are two plugin points, one for inline plugins (inside a paragraph) and one for block plugins. These are provided to the parser using the PegDownPlugins class. For convenience of use this comes with its own builder. You can either pass individual rules to this builder (which is what you probably would do if you were using Scala rules), but you can also pass it a parboiled Java parser class which implements either InlinePluginParser or BlockPluginParser or both. PegDownPlugins will enhance this parser for you, so as a user of a plugin you just need to pass the class to it (and the arguments for that classes constructor, if any). To implement the plugin, you would write a normal parboiled parser, and implement the appropriate parser plugin interface. You can extend the pegdown parser, this is useful if you want to reuse any of its rules.

For the serializer there is ToHtmlSerializerPlugin interface. It is called when a node that the ToHtmlSerializer doesn't know how to process is encountered (i.e. one produced by a parser plugin). Its accept method is passed the node, the visitor (so if the node contains child nodes they can be rendered using the parent) and the printer for the plugin to print to. The accept method returns true if it knew how to handle the node or false if otherwise and the ToHtmlSerializer loops through each plugin breaking when it reaches one that returns true and if it finds none throws an exception like it used to.

As an very simple example you might want to take a look at the sources of the PluginParser test class.

Parsing Timeouts

Since Markdown has no official grammar and contains a number of ambiguities the parsing of Markdown source, especially with enabled language extensions, can be "hard" and result, in certain corner cases, in exponential parsing time. In order to provide a somewhat predictable behavior pegdown therefore supports the specification of a parsing timeout, which you can supply to the PegDownProcessor constructor.

If the parser happens to run longer than the specified timeout period it terminates itself with an exception, which causes the markdownToHtml method to return null. Your application should then deal with this case accordingly and, for example, inform the user.

The default timeout, if not explicitly specified, is 2 seconds.

IDE Support

The excellent idea-markdown plugin for IntelliJ IDEA, RubyMine, PhpStorm, WebStorm, PyCharm and appCode uses pegdown as its underlying parsing engine. The plugin gives you proper syntax-highlighting for markdown source and shows you exactly, how pegdown will parse your texts.

Credits

A large part of the underlying PEG grammar was developed by John MacFarlane and made available with his tool peg-markdown.

License

pegdown is licensed under Apache License 2.0.

Patch Policy

Feedback and contributions to the project, no matter what kind, are always very welcome. However, patches can only be accepted from their original author. Along with any patches, please state that the patch is your original work and that you license the work to the pegdown project under the project’s open source license.

Comments
  • Extremely slow parsing for certain pathological input

    Extremely slow parsing for certain pathological input

    This appears to trigger exponential parsing time:

    Blaa
    
    **SELECT @default_late_fees =
         CASE @default_days_out % @default_product_price_days_out
                   <br>WHEN 0 <br>THEN (@default_days_out / @default_product_price_days_out) * @default_amount 
                   <br>ELSE (FLOOR(@default_days_out / @default_product_price_days_out) + 1) * @default_amount 
              <br> END**
    
    Thanks,
    <br>S
    
    Bug 
    opened by sirthias 15
  • Github Flavored Markdown Fenced Code Blocks

    Github Flavored Markdown Fenced Code Blocks

    It would be exceptionally great if the plugin supported Github Flavored Markdown's Fenced Code Blocks as well as the PHP Markdown Extra's approach.

    Github Flavored Markdown allows:

    ```scala
    class Example(name: String) {
      val field: Option[Int] = None
    }
    ```
    

    To render as:

    class Example(name: String) {
      val field: Option[Int] = None
    }
    

    It would be awesome if pegdown could treat blocks like this the same as:

    ~~~
    class Example(name: String) {
      val field: Option[Int] = None
    }
    ~~~
    

    and render as:

    class Example(name: String) {
      val field: Option[Int] = None
    }
    

    Syntax highlighting is a bigger deal and IMHO better delegated to SyntaxHighlighter on the client-side.

    Pegdown is great and thank you for building it :)

    Improvement 
    opened by AlainODea 15
  • Updated taglist

    Updated taglist

    Replaces PR #164

    I sorted the tag lists and added a test to check few tags. AST test is used as HTML tests are passed through jTidy which doesn't accept HTML5 tags.

    opened by Deraen 14
  • Added Code Syntax parsing to CodeNode

    Added Code Syntax parsing to CodeNode

    Added code syntax parsing to CodeNode. The main change was to add parsing after the ``` for synax.

    So

    ...
    

    will result in:

    <pre> <code data-code-syntax="java"> ... </code> </pre>

    opened by bradsdavis 14
  • Strong and emphasis tags in round brackets

    Strong and emphasis tags in round brackets

    The strong and emphasis tags are not applied with they are used inside round brackets and there is no space between the bracket and markdown tag.

    For example (_test_) is converted to <p>(_test_)</p> and similarly for **/strong tag.

    opened by md384 12
  • Process images with LinkRenderer (fixes #95)

    Process images with LinkRenderer (fixes #95)

    The proposed patch allows interception of image tag generation with the LinkRenderer allowing the same type of manipulation flexibility as already offered for explicit links, wiki links, etc.

    opened by gitblit 12
  • Invalid nested list rendering

    Invalid nested list rendering

    In a nested mixed-type list, numbered lists do not render correctly, and indentation is missing. Screenshot of the problem (expected vs actual):

    lists

    The example input:

    1. Numbered item 1
     * Sub-list
    1. Numbered item 2
     * Sub-list
    1. Numbered item 3
     * Sub-list
    

    Results in the following output:

    <ol>
        <li>Numbered item 1</li>
    </ol>
    <ul>
        <li>Sub-list</li>
    </ul>
    <ol>
        <li>Numbered item 2</li>
    </ol>
    <ul>
        <li>Sub-list</li>
    </ul>
    <ol>
        <li>Numbered item 3</li>
    </ol>
    <ul>
        <li>Sub-list</li>
    </ul></div>
    

    Compare with expected output (using Github as an example):

    <ol>
        <li>Numbered item 1
            <ul>
                <li>Sub-list</li>
            </ul>
        </li>
        <li>Numbered item 2
            <ul>
                <li>Sub-list</li>
            </ul>
        </li>
        <li>Numbered item 3
            <ul>
                <li>Sub-list</li>
            </ul>
        </li>
    </ol>
    

    In both HTML cases I changed whitespace indentation to be consistent for comparison. Note that changing the input's sub-list indentation from spaces to tabs works.

    Originally reported as a bug in Atlassian Stash: https://jira.atlassian.com/browse/STASH-3702

    opened by kofalt 10
  • A few fixes accumulated from my plugin releases.

    A few fixes accumulated from my plugin releases.

    …the header as part of the anchor name generator text, text accumulated, special chars ignored. Fix #194

    fix #190, two or more blank lines between list items would break the list into separate lists fix #191, completed task list items could have capital X not just lowercase. [X] and [x] are equivalent. fix #192, paragraph wrapping for list items would produce an AST different from how blank lines before list items. Now identical. fix #193 change private to protected methods in ToHtmlSerializer so that it can be extended without duplicating code

    opened by vsch 9
  • Fenced code block must be preceded by an empty line

    Fenced code block must be preceded by an empty line

    I'm using a PegDownProcessor created with Extensions.ALL as a replacement for ruby code that used the github-markdown gem. Most things work seamlessly but there's a difference in how fenced code blocks are handled. This text:

    
    Code example
    
    ```ruby
    a = b
    ```
    

    formats OK using PegDown and the resulting html contains a <code class="ruby"> element. This text however:

    
    Code example
    ```ruby
    a = b
    ```
    

    will instead yield <code>ruby e.g. the 'ruby' is no longer a class attribute, instead it becomes visible text in the output.

    The only difference between the two examples is that an empty line precedes the fenced block in the first one.

    opened by thallgren 9
  • Improved handling of emph/strong. Fixes #43, #65, #78

    Improved handling of emph/strong. Fixes #43, #65, #78

    • No more exponential parse times by the nature of emph/strong combined in PEG
    • Parsing of more complex (nested) emph/strong constructs is now really close to original Markdown
    opened by Elmervc 9
  • Adding extensible, type-specific serialization for code blocks.

    Adding extensible, type-specific serialization for code blocks.

    My goal was to allow special handling of specific code blocks. I think this is a pretty succinct solution. I toyed with the idea of using the ServiceLoader to automatically load any VerbatimSerializers registered on the classpath, but decided to go the more manual route. Let me know if you want me to add some more automated extension loading -- or if you have any other questions or concerns about it.

    opened by jbunting 9
  • Need help on confluence wiki link from pegdown to flexmark

    Need help on confluence wiki link from pegdown to flexmark

    @sirthias , I am in the process of converting our pegdown to flexmark . Struck up on one issue of Confluence wiki link rendering. I need your help in resolving the exact match our custom wiki link code which should work with flexmark as well. I raised the same issue with flexmark as well. CustomWikiLinkRenderUsingPegdown.txt

    You can find more details in that flexmark ticket as well. I knew that support has been stopped by pegdown, but this migration is crucial for us.hope you can understand and help us.

    opened by rkanumola 0
  • Packrat Parsrers resolve exponential parsing time

    Packrat Parsrers resolve exponential parsing time

    Hi, I studied Packrat Parsing/PEG a bit. As I saw exponential parsing time, I though that it maybe resolved by packrat parsing.

    To emulate packrat parsing, I added @MemoMismatches to all Rule returning methods that take no argument.

    As a result, the exponential parsing time problems such as #43 and #104 were resolved in my machine. Actually, it maybe OK if the number of @MemoMismatches methods decreases.

    opened by kmizu 2
  • How to Detect new lines

    How to Detect new lines

    Hi,

    I'm trying to create a custom pegdown serializer by implementing org.pegdown.ast.Visitor and I have a problem with parsing new line. for example I can't distinguish between this two paragraphs

    This is test.

    This is

    test.

    none of org.pegdown.ast.Visitor interface methods can not detect new line. is there anyway to catch this ?

    I'm new to pegdown. Thanks in advanced.

    opened by omidp 1
  • Rendering escaped dot in e-mail

    Rendering escaped dot in e-mail

    In PegDown 1.6.0

    This pegDownProcessor.markdownToHtml("someemail@gmail\\.com");

    does not remove the escaping , the result is

    [email protected]

    This some\[email protected]

    is rendered as

    some.[email protected]

    It works OK for non-link text: pegDownProcessor.markdownToHtml("someemailgmail\\.com");

    renders as

    someemailgmail.com

    I'm not sure if this is a bug or the dot in an e-mail is simply expected not to be escaped.

    EDIT: the processor is constructed with AUTOLINKS extension included: PegDownProcessor pegDownProcessor = new PegDownProcessor(Extensions.ALL)

    opened by ziacik 0
  • Nested lists are broken

    Nested lists are broken

    Pegdown 1.6.0 has problems rendering nested lists. This issue also occurs in 1.5.0, 1.4.0 and 1.3.0, so it appears to be a long-standing issue.

    * foo
      * bar
    
    <ul>
      <li>foo</li>
      <li>bar</li>
    </ul>
    

    Similarly:

    * foo
    
      bar
    
    <ul>
      <li>foo</li>
    </ul>
    <p>bar</p>
    

    However, if I double the indentation, it works:

    * foo
        * bar
    
    <ul>
      <li>foo
        <ul>
          <li>bar</li>
        </ul>
      </li>
    </ul>
    

    Looking at the code, it seems like Pegdown treats indentation as either a tab or four spaces, but for lists any whitespace should be treated as indentation.

    opened by weavejester 4
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR v4 Build status ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating

Antlr Project 13.6k Jan 3, 2023
Open Source Web Crawler for Java

crawler4j crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-thr

Yasser Ganjisaffar 4.3k Jan 3, 2023
A scalable web crawler framework for Java.

Readme in Chinese A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persiste

Yihua Huang 10.7k Jan 5, 2023
Concise UI Tests with Java!

Selenide = UI Testing Framework powered by Selenium WebDriver What is Selenide? Selenide is a framework for writing easy-to-read and easy-to-maintain

Selenide 1.6k Jan 4, 2023
jQuery-like cross-driver interface in Java for Selenium WebDriver

seleniumQuery Feature-rich jQuery-like Java interface for Selenium WebDriver seleniumQuery is a feature-rich cross-driver Java library that brings a j

null 69 Nov 27, 2022
Elegant parsing in Java and Scala - lightweight, easy-to-use, powerful.

Please see https://repo1.maven.org/maven2/org/parboiled/ for download access to the artifacts https://github.com/sirthias/parboiled/wiki for all docum

Mathias 1.2k Dec 21, 2022
My solution in Java for Advent of Code 2021.

advent-of-code-2021 My solution in Java for Advent of Code 2021. What is Advent of Code? Advent of Code (AoC) is an Advent calendar of small programmi

Phil Träger 3 Dec 2, 2021
Dicas , códigos e soluções para projetos desenvolvidos na linguagem Java

Digytal Code - Programação, Pesquisa e Educação www.digytal.com.br (11) 95894-0362 Autores Gleyson Sampaio Repositório repleto de desafios, componente

Digytal Code 13 Apr 15, 2022
An EFX translator written in Java.

This is an EFX translator written in Java. It supports multiple target languages. It includes an EFX expression translator to XPath. It is used to in the generation of the Schematron rules in the eForms SDK.

TED & EU Public Procurement 5 Oct 14, 2022
Microserver is a Java 8 native, zero configuration, standards based, battle hardened library to run Java Rest Microservices via a standard Java main class. Supporting pure Microservice or Micro-monolith styles.

Microserver A convenient modular engine for Microservices. Microserver plugins offer seamless integration with Spring (core), Jersey, Guava, Tomcat, G

AOL 936 Dec 19, 2022
Microserver is a Java 8 native, zero configuration, standards based, battle hardened library to run Java Rest Microservices via a standard Java main class. Supporting pure Microservice or Micro-monolith styles.

Microserver is a Java 8 native, zero configuration, standards based, battle hardened library to run Java Rest Microservices via a standard Java main class. Supporting pure Microservice or Micro-monolith styles.

AOL 936 Dec 19, 2022
Markdown4j - Use java to generate markdown file

Markdown4j - Use java to generate markdown file

Juntao Han 4 Nov 24, 2022
Benchmark testing number reading/writing in Java.

double-reader-writer Benchmark testing number reading/writing in Java. Relates to FasterXML/jackson-core#577 So far, FastDoubleParser looks useful if

PJ Fanning 2 Apr 12, 2022
Markdown language support for IntelliJ IDEA (abandonned).

idea-markdown Markdown language support for IntelliJ IDEA, RubyMine, PhpStorm, WebStorm, PyCharm, AppCode and Android Studio. This plugin is no more m

null 602 Dec 30, 2022
Methods in various programming languages to validate Croatian identification number called OIB

OIB validation [ENG] Methods in various programming languages to validate Croatian identification number called OIB. More info on www.oib.hr. Provjera

Domagoj 30 Nov 23, 2022
Markdown editor control for JavaFX

An advanced markdown-editor control for JavaFX.

Daniel Gyoerffy 23 Dec 28, 2022
neutriNote - Markdown + Math in Just 3 MB!

Official | FAQ | Documentation | Mastodon | XDA neutriNote (Community Edition) What is neutriNote? In a nutshell, all-in-one preservation of written t

AppML 186 Jan 3, 2023
IntelliJ Platform A full-featured WYSIWYG editor for markdown

markdown-editor IntelliJ Platform A full-featured WYSIWYG editor for markdown English Document 中文文档 Useful Links Custom Style Features Support three e

null 101 Dec 19, 2022
Spring boot application to display number of corona cases

Corona-Cases-Counter Spring boot application to display number of corona cases This application consumes data from a CSV file which was used to docume

Hudson Obai 3 Aug 29, 2021
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021