.. _parserLibraryWhitespace:

Whitespace and Comments
=======================

The previous two pages introduced the lexer and parser but have not yet
discussed how to handle whitespace and comments.

In some grammars whitespace and comments would not be considered significant and
so we might be tempted not to generate any tokens for them. However, in the
running example, we may want to make spaces significant in the future. For
example, we might want to implement function application expressed by
juxtaposition as in Haskell and Idris like this: ``f x``.

So, on this page, we will add a token for whitespace and comments. We will
then consider two ways to process this:

- Filter all whitespace tokens from the lexer results before passing to
  the parser.
- Process the whitespace tokens in the parser.

Whitespace and Comments in Lexer
--------------------------------

To start with we can use the same token for both whitespace and comments. Here
it is called ``Comment`` and added to the ``ExpressionToken`` data structure
like this:

.. code-block:: idris

  public export
  data ExpressionToken = Number Integer
           | Operator String
           | OParen
           | CParen
           | Comment String
           | EndInput

This is added to the ``TokenMap`` like this:

.. code-block:: idris

  ||| from https://github.com/edwinb/Idris2/blob/master/src/Parser/Lexer.idr
  comment : Lexer
  comment = is '-' <+> is '-' <+> many (isNot '\n')

  expressionTokens : TokenMap ExpressionToken
  expressionTokens =
    [(digits, \x => Number (toInt' x)),
     (operator, \x => Operator x),
     (is '(' ,\x => OParen),
     (is ')' ,\x => CParen),
     (spaces, Comment),
     (comment, Comment)]

As you can see, the comment is defined like a single line Idris comment,
it starts with ``--`` and continues for the remainder of the line.

We don't need to define ``spaces`` because it is already defined in
: https://github.com/idris-lang/Idris-dev/blob/master/libs/contrib/Text/Lexer.idr
like this:

.. code-block:: idris

  ||| Recognise a single whitespace character
  ||| /\\s/
  space : Lexer
  space = pred isSpace

  ||| Recognise one or more whitespace characters
  ||| /\\s+/
  spaces : Lexer
  spaces = some space

If these whitespace tokens are not significant, that is, they can appear
anywhere and they are optional then we can filter them out before the parser
gets them like this:

.. code-block:: idris

  processWhitespace : (List (TokenData ExpressionToken), Int, Int, String)
                  -> (List (TokenData ExpressionToken), Int, Int, String)
  processWhitespace (x,l,c,s) = ((filter notComment x),l,c,s) where
      notComment : TokenData ExpressionToken -> Bool
      notComment t = case tok t of
                          Comment _ => False
                          _ => True

  calc : String -> Either (ParseError (TokenData ExpressionToken))
                        (Integer, List (TokenData ExpressionToken))
  calc s = parse expr (fst (processWhitespace (lex expressionTokens s)))


Whitespace and Comments in Parser
---------------------------------

If we sometimes require whitespace to be significant then we can't filter
them out as above. In this case the ``Comment`` tokens are sent to the parser
which now needs to be able to handle them.

.. code-block:: idris

  commentSpace : Rule Integer
  commentSpace = terminal (\x => case tok x of
                           Comment s => Just 0
                           _ => Nothing)

So far we don't have any syntax that requires spaces to be significant so we
need to define the grammar so that it will parse with, or without, spaces.
This needs to be done in a systematic way, here I have defined the grammar so
that there is an optional space to the right of every atom or operator.
First add versions of ``intLiteral`` , ``openParen`` , ``closeParen`` 
and ``op`` that allow optional spaces/comments to the right of them:

.. code-block:: idris

  intLiteralC : Rule Integer
  intLiteralC = (intLiteral <* commentSpace) <|> intLiteral

  openParenC : Rule Integer
  openParenC = (openParen <* commentSpace) <|> openParen

  closeParenC : Rule Integer
  closeParenC = (closeParen <* commentSpace) <|> closeParen

  ||| like op but followed by optional comment or space
  opC : String -> Rule Integer
  opC s = ((op s) <* commentSpace) <|> (op s)

Then just use these functions instead of the original functions:

.. code-block:: idris

  expr : Rule Integer

  factor : Rule Integer
  factor = intLiteralC <|> do
                openParenC
                r <- expr
                closeParenC
                pure r

  term : Rule Integer
  term = map multInt factor <*> (
          (opC "*")
          *> factor)
       <|> factor

  expr = map addInt term <*> (
          (opC "+")
          *> term)
       <|> map subInt term <*> (
          (opC "-")
          *> term)
       <|> term

  calc : String -> Either (ParseError (TokenData ExpressionToken))
                        (Integer, List (TokenData ExpressionToken))
  calc s = parse expr (fst (lex expressionTokens s))

  lft : (ParseError (TokenData ExpressionToken)) -> IO ()
  lft (Error s lst) = putStrLn ("error:"++s)

  rht : (Integer, List (TokenData ExpressionToken)) -> IO ()
  rht i = putStrLn ("right " ++ (show i))

  main : IO ()
  main = do
    putStr "alg>"
    x <- getLine
    either lft rht (calc x) -- eliminator for Either


Defining Block Structure using Indents
--------------------------------------

Many languages such as Python, Haskell and Idris use indents to delimit
the block structure of the language.

We can see how Idris2 does it here
: https://github.com/edwinb/Idris2/blob/master/src/Parser/Support.idr

.. code-block:: idris

  export
  IndentInfo : Type
  IndentInfo = Int

  export
  init : IndentInfo
  init = 0

EndInput Token
--------------

So far, the parser will return a successful result even if the full input
is not consumed. To ensure that the top level syntax is fully matched we
add a ``EndInput`` token to indicate the last token.

``EndInput`` is added to the other tokens like this:

.. code-block:: idris

  data ExpressionToken = Number Integer
           | Operator String
           | OParen
           | CParen
           | Comment String
           | EndInput

A rule to consume this token is added:

.. code-block:: idris

  eoi : Rule Integer
  eoi = terminal (\x => case tok x of
                           EndInput => Just 0
                           _ => Nothing)

Instead of using ``expr`` at the top level of the syntax we can now define
``exprFull`` as shown here. This will make sure that only when ``EndInput``
is consumed will the parse be successful:

.. code-block:: idris

  exprFull : Rule Integer
  exprFull = expr <* eoi

The following code makes sure ``EndInput`` is added to the end of the token
list:

.. code-block:: idris

  processWhitespace : (List (TokenData ExpressionToken), Int, Int, String)
                  -> (List (TokenData ExpressionToken), Int, Int, String)
  processWhitespace (x,l,c,s) = ((filter notComment x)++
                                      [MkToken l c EndInput],l,c,s) where
      notComment : TokenData ExpressionToken -> Bool
      notComment t = case tok t of
                          Comment _ => False
                          _ => True

All we have to do now is use ``exprFull`` instead of ``expr``:

.. code-block:: idris

  calc : String -> Either (ParseError (TokenData ExpressionToken))
                        (Integer, List (TokenData ExpressionToken))
  calc s = parse exprFull (fst (processWhitespace (lex expressionTokens s)))