The wonderful monadic world of Pandoc

Written by Dave Gurnell

Pandoc is a wonderful tool for converting between different markup formats. It can read documents written in Markdown, Textile, Word, and a few other input formats, and output them to a ridiculous number of output formats including HTML, PDF, and ePub. Users can also plug in filters that alter the document structure mid-translation. This is useful for things like inserting cross-references and changing the document structure, and it provides an interesting look at some functional programming techniques.


This post was retroactively updated to Spandoc 0.2.1 on 20 March 2017.

We use Pandoc at Underscore to build our eBooks. We use several custom filters to do a variety of transforms, each originally hacked together by yours truly using Coffeescript and a lot of nasty mutable state. In this blog post I’ll investigate the types of filter we use and how they can be rewritten elegantly in Scala as pure functional transformations.

Playing with Pandoc

If you want to follow along at home, I recommend installing Pandoc using your favorite package manager. On OS X you can type the following. I imagine there are similar incantations on Linux and Windows:

bash$ brew install pandoc

Let’s see Pandoc in action by writing a short document on standard in. If we use a command like pandoc --to=<FORMAT>, Pandoc will read from standard in and print to standard out. Here’s an example that renders a document as HTML:

bash$ pandoc --to=html
Hello world

- Step 1
- Step 2
- Step 3
^D

<p>Hello world</p>
<ul>
<li>Step 1</li>
<li>Step 2</li>
<li>Step 3</li>
</ul>

If we use the command pandoc --to=latex instead, our document is returned as LaTeX instead of HTML:

pandoc --to=latex
Hello world

- Step 1
- Step 2
- Step 3
^D

\begin{itemize}
\tightlist
\item
  Step 1
\item
  Step 2
\item
  Step 3
\end{itemize}

When Pandoc reads an input file, it parses it to an internal syntax tree. We can write filters that operate on the AST before it is written out to the target format. Pandoc is written in Haskell, but we can interoperate from other languages by implementing our filters as shell scripts that read and write the ASTs as JSON. Here’s an example of the JSON dialect used:

pandoc --to=json
Hello world

- Step 1
- Step 2
- Step 3
^D

[
  {"unMeta": {}},
  [
    {"t": "Para", "c": [
      {"t": "Str",   "c": "Hello"},
      {"t": "Space", "c": []},
      {"t": "Str",   "c": "world"}]},
    {
      "t": "BulletList",
      "c": [
        [{"t": "Plain", "c": [
          {"t": "Str",   "c": "Step"},
          {"t": "Space", "c": []},
          {"t": "Str",   "c": "1"}
        ]}],
        [{"t": "Plain", "c": [
          {"t": "Str",   "c": "Step"},
          {"t": "Space", "c": []},
          {"t": "Str",   "c": "2"}
        ]}],
        [{"t": "Plain", "c": [
          {"t": "Str",   "c": "Step"},
          {"t": "Space", "c": []},
          {"t": "Str",   "c": "3"}
        ]}]
      ]
    }
  ]
]

The output is verbose but uniform. Every AST note has a field t storing its type and a field c storing and array of its fields. We can write a program in any language that reads and writes JSON in this format, and tell Pandoc to use it with the --filter command line switch. For example, the tee command copies standard input to standard out, so we can use it as a no-op filter:

bash$ echo 'Hello world' | pandoc --filter=tee --to=html
<p>Hello world</p>

Pandoc + Scala = Spandoc

I developed a little library called Spandoc that converts the unwieldy JSON ASTs to a “simple” Scala ADT. We can use Spandoc with Ammonite to create filters easily, without having to touch JSON directly form our code. Here’s the most basic of examples—a script that echoes the AST to standard error:

#!/usr/bin/env amm

interp.load.ivy("com.davegurnell" %% "spandoc"    % "0.2.1")

@

import spandoc._

transformStdin { ast =>
  Console.err.println(ast)
  ast
}

Spandoc represents a Pandoc document as a case class of type Pandoc. You can see the full data type in the Spandoc source code. The transformStdin method accepts a function of type Pandoc => Pandoc, and operates on standard input, reading the incoming AST as JSON, converting it to an instance of Pandoc, passing it through the user-supplied transform, serializing it back to JSON, and printing it to standard out. Our script here doesn’t actually change the document, but it does print the AST to standard error as a side effect:

bash$ pandoc --filter=./echo.amm --to=html
Hello world

- Step 1
- Step 2
- Step 3
^D

Pandoc(
  Meta(Map()),
  List(
    Para(List(Str(Hello), Space, Str(world))),
    BulletList(List(
      ListItem(List(Plain(List(Str(Step), Space, Str(1))))),
      ListItem(List(Plain(List(Str(Step), Space, Str(2))))),
      ListItem(List(Plain(List(Str(Step), Space, Str(3)))))))))

<p>Hello world</p>
<ul>
<li>Step 1</li>
<li>Step 2</li>
<li>Step 3</li>
</ul>

Now we know how Pandoc and Spandoc work, let’s look at some simple filters we can use to get stuff done.

Context-free filters

The simplest filters are one that implement a consistent mapping over the entire AST. For example, we can uppercase an entire document by rewriting the contents of Str nodes (the nodes that contain the body text):

#!/usr/bin/env amm

interp.load.ivy("com.davegurnell" %% "spandoc" % "0.2.1")

@

import spandoc._
import spandoc.transform.TopDown

val filter = TopDown.inline {
  case Str(text) =>
    Str(text.toUpperCase)
}

transformStdin(filter)

In this example, TopDown.inline method creates a function of type Pandoc => Pandoc that copies the input tree, changing only the nodes matched by the partial function. We pass most nodes in the tree through unchanged. We only modify nodes of type Str, which contain the text in our document. For example, uppercase would transform the following tree:

Pandoc(
  Meta(Map()),
  List(
    Para(List(Str(Hello), Space, Str(world))),
    BulletList(List(
      ListItem(List(Plain(List(Str(Step), Space, Str(1))))),
      ListItem(List(Plain(List(Str(Step), Space, Str(2))))),
      ListItem(List(Plain(List(Str(Step), Space, Str(3)))))))))

to an upper case result:

Pandoc(
  Meta(Map()),
  List(
    Para(List(Str(HELLO), Space, Str(WORLD))),
    BulletList(List(
      ListItem(List(Plain(List(Str(STEP), Space, Str(1))))),
      ListItem(List(Plain(List(Str(STEP), Space, Str(2))))),
      ListItem(List(Plain(List(Str(STEP), Space, Str(3)))))))))

resulting in an extremely shouty version of the input document:

bash$ pandoc --filter=./uppercase.amm --to=html
Hello world

- Step 1
- Step 2
- Step 3
^D

<p>HELLO WORLD</p>
<ul>
<li>STEP 1</li>
<li>STEP 2</li>
<li>STEP 3</li>
</ul>

The partial function in a TopDown.inline is of type PartialFunction[Inline, Inline], Inline being one of the two types of node that form the majority of the tree. Block nodes are the other main type. We can operate on them with a TopDown.block. For example, here’s a script that swaps bulleted lists for ordered lists and vice versa:

#!/usr/bin/env amm

interp.load.ivy("com.davegurnell" %% "spandoc" % "0.2.1")

@

import spandoc._
import spandoc.transform.TopDown

val filter = TopDown.block {
  case OrderedList(_, items) => BulletList(items)
  case BulletList(items)     => OrderedList(ListAttributes(1, Decimal, Period), items)
}

transformStdin(filter)

As you might expect, this causes the <ul> tags in our output to be replaced with <ol> tags and vice versa:

bash$ pandoc --filter=./swaplists.amm --to=html
Hello world

- Step 1
- Step 2
- Step 3
^D

<p>Hello world</p>
<ol style="list-style-type: decimal">
<li>Step 1</li>
<li>Step 2</li>
<li>Step 3</li>
</ol>

TopDown transforms are actually combinations of a TopDown.block and a TopDown.inline. In the examples above, the function we don’t specify simply passes all nodes through untouched. We can define a full transform as follows:

#!/usr/bin/env amm

interp.load.ivy("com.davegurnell" %% "spandoc" % "0.2.1")

@

import cats.Id
import spandoc._
import spandoc.transform.TopDown

object filter extends TopDown[Id] {
  def blockFilter = {
    // Block-level transformations
  }

  def inlineFilter = {
    // Inline transformations
  }
}

transformStdin(filter)

Context-sensitive filters

The filters we’ve seen so far are “context free”, in that they apply the same transformation to any given node independent of where that node appears in the tree. This is fine for trivial examples, but to do anything interesting we need context.

For example, rather than uppercasing the whole document, what if we wanted to uppercase just the headings? To do this we need to keep a note of whether we’re inside a heading.

One way of doing this would be by building two transforms, one for inside headings and one for outside. Another way would be using mutable state to track whether or not we’re inside a heading.

We’re going to look at another method, though, using the State monad. State allows us to thread a value along from step to step as we transform each node in the tree. Most steps ignore the state. Some read it. Others modify it. This lets us emulate mutable state without actually having any variables:

#!/usr/bin/env amm

interp.load.ivy("com.davegurnell" %% "spandoc" % "0.2.1")

@

import cats.data.State
import spandoc._
import spandoc.transform.TopDown

object filter extends TopDown[HeaderState] {
  def blockTransform = {
    case Header(level, attr, inlineNodes) =>
      for {
        _       <- State.set(true)
        inlines <- inlineNodes.map(this.apply).sequence
        _       <- State.set(false)
      } yield Header(level, attr, inlines)
  }

  def inlineTransform = {
    case str @ Str(text) =>
      State.inspect { inHeader =>
        if(inHeader) Str(text.toUpperCase) else str
      }
  }
}

The HeaderState type shows we’re threading a Boolean through our computation. The State.inspect step in the inline transform for Str reads the current value and either transforms or leaves the text as appropriate. The State.set steps in the block transform for Header set the state to true and false when we enter and exit a header.

This is a pretty neat pattern that can be used to do some powerful things, including number headings and equations, and even reorganize the sections of a book. See the examples folder in the source code for some more compelling use cases.