Indentation sensitive parsing the easy way
Rob Zinkov
2016-01-12
Recently, I had to write an parser for Hakaru. Writing parsers in Haskell is generally a treat as there are muliple parser libraries to choose from including Happy, parsec, attoparsec, megaparsec, trifecta, and many others. The trouble occurs when you want to parse and indentation-sensitive language like Python or Haskell. For that task the choices are more limited and far less documented. Which is unfortunate as my favorite library indentation of the bunch is the least documented. The following is how to use indentation
to write an indentation-sensitive parser.
For this tutorial, I will use indentation
and parsec
.
cabal install indentation parsec
To get started import Parsec as you normally would
module Indent.Demo where
import Data.Functor ((<$>), (<$))
import Control.Applicative (Applicative(..))
import qualified Control.Monad as M
import Data.Functor.Identity
import Data.Text (Text)
import qualified Data.Text as Text
import Text.Parsec hiding (Empty)
import Text.Parsec.Text () -- instances only
import qualified Text.Parsec.Expr as Ex
import qualified Text.Parsec.Token as Tok
And then add the following modules from indentation
import Text.Parsec.Indentation
import Text.Parsec.Indentation.Char
import qualified Text.Parsec.Indentation.Token as ITok
The key thing which needs to be changed is that the lexer needs to be indentation-sensitive. Sadly, there is no easy way to extend the existing LanguageDefs, so we make one from scratch.
style :: Tok.GenLanguageDef ParserStream st Identity
style = ITok.makeIndentLanguageDef $ Tok.LanguageDef
{ Tok.commentStart = ""
, Tok.commentEnd = ""
, Tok.nestedComments = True
, Tok.identStart = letter <|> char '_'
, Tok.identLetter = alphaNum <|> oneOf "_'"
, Tok.opStart = oneOf "!#$%&*+./<=>?@\\^|-~"
, Tok.opLetter = oneOf "!#$%&*+./<=>?@\\^|-~"
, Tok.caseSensitive = True
, Tok.commentLine = "#"
, Tok.reservedOpNames = [":"]
, Tok.reservedNames = ["def", "add"]
}
lexer :: Tok.GenTokenParser ParserStream () Identity
lexer = ITok.makeTokenParser style
Once you have an indentation-sensitive lexer, you can add the primitives you need in terms of it.
integer :: Parser Integer
integer = Tok.integer lexer
identifier :: Parser String
identifier = Tok.identifier lexer
reserved :: String -> Parser ()
reserved = Tok.reserved lexer
reservedOp :: String -> Parser ()
reservedOp = Tok.reservedOp lexer
parens :: Parser a -> Parser a
parens = Tok.parens lexer . localIndentation Any
commaSep :: Parser a -> Parser [a]
commaSep = Tok.commaSep lexer
All of these are boilerplate except for parens
. You will notice, for it we call localIndentation Any
before passing the input. This function indicates that indentation rules can be ignored when using this combinator. This gives parentheses the meaning they have in python which is to suspend indentation rules. We will go into more detail how the indentation primitives work, but for now let’s define AST for our language.
type Name = String
type Args = [Name]
type ParserStream = IndentStream (CharIndentStream String)
type Parser = ParsecT ParserStream () Identity
data Expr =
Func Name Args Expr
| Var Name
| App Expr [Expr]
| Add Expr Expr
| Lit Integer
deriving (Show)
Parsing this language doesn’t involve need to involve indentation rules
int :: Parser Expr
int = Lit <$> integer
add :: Parser Expr
add = reserved "add" *> (Add <$> expr <*> expr)
var :: Parser Expr
var = Var <$> identifier
app :: Parser Expr
app = App <$> var <*> parens (commaSep expr)
def :: Parser Expr
def = do
reserved "def"
name <- identifier
args <- parens (commaSep identifier)
body <- reservedOp ":" *> expr
return (Func name args body)
expr :: Parser Expr
expr = def
<|> try app
<|> try var
<|> try add
<|> int
<|> parens expr
Let’s add some helper code.
indentConfig :: String -> ParserStream
indentConfig =
mkIndentStream 0 infIndentation True Ge . mkCharIndentStream
parse :: String -> Either ParseError [Expr]
parse =
runParser (many expr <* eof) () "[input]" . indentConfig
And this parses programs just fine.
test1 = unlines
[ "def foo(x,y):"
, " add x y"
]
parse test1
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
The issue is also things which feel invalid.
test2 = unlines
[ "def foo(x,y):"
, "add x y"
]
parse test2
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
We need to change def
so that its body must be indented at strictly greater than character where it starts.
blockExpr :: Parser Expr
blockExpr = reservedOp ":" *> localIndentation Gt expr
def :: Parser Expr
def = do
reserved "def"
name <- identifier
args <- parens (commaSep identifier)
body <- blockExpr
return (Func name args body)
If you now look, we have defined a function for the body, blockExpr
, which says we must have the body strictly greater. Now when we parse test2
we get the following.
parse test2
-- Left "[input]" (line 2, column 2):
-- expecting identifier
--
-- Invalid indentation.
-- Found a token at indentation 1.
-- Expecting a token at an indentation greater than or equal to 2.
-- IndentStream { indentationState =
-- IndentationState { minIndentation = 2
-- , maxIndentation = 9223372036854775807
-- , absMode = False
-- , tokenRel = Ge}
-- , tokenStream = ""}
localIndentation
takes two arguments, what the indentation of an expression should be relative to the current indentation, and the expression itself. Relative indentations can be greater-than and equal (Ge), strictly greater-than (Gt), equal (Eq), a specific amount (Const 5), or anything (Any).
While it seems like this is the only primitive you should need, sometimes the indentation level you want can’t be defined in terms of the parent.
For example, the following is a valid program
test3 = unlines
[ "def foo(x, y):"
, " add x"
, " y"
]
parse test4
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
The issue is that “y” is indented greater than the “def” but, we really want it to be indented in terms of “add”. To do this we need to use absolute indentation. This mode says indentation is defined in terms of the first token parsed, and all indentation rules apply in terms of where that first token is found.
absBlockExpr :: Parser Expr
absBlockExpr = reservedOp ":" *> localIndentation Gt (absoluteIndentation expr)
def :: Parser Expr
def = do
reserved "def"
name <- identifier
args <- parens (commaSep identifier)
body <- absBlockExpr
return (Func name args body)
We define a function absBlockExpr. You’ll notice we also used a localIndentation
. The reason for that is absolutionIndentation
normally defaults to the first token of the parent. In our case, this is def
and we want instead for it to choose add
.
parse test3
-- Left "[input]" (line 3, column 3):
-- expecting identifier
--
-- Invalid indentation.
-- Found a token at indentation 2.
-- Expecting a token at an indentation greater than or equal to 5.
-- IndentStream { indentationState =
-- IndentationState { minIndentation = 5
-- , maxIndentation = 5
-- , absMode = False
-- , tokenRel = Ge}
-- , tokenStream = ""}
Now it works as expected
test4 = unlines
[ "def foo(x, y):"
, " add x"
, " y"
]
parse test4
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
parse test1
-- Right [Func "foo" ["x","y"] (Add (Var "x") (Var "y"))]
This library has other bits to it, but this should give enough to figure out, how to add indentation sensitivity to your language.
Special thanks to Aleksey Kliger for helping me understand this library.