ABNF Grammars in Elixir
TweetIntroduction
ABNF grammars are widely used in the Internet today. They serve as the basic building blocks for a lot of highly used protocols, like HTTP, SIP, SMTP, FTP, etc. And they are also very useful to design DSLs (Domain Specific Languages).
In this article we'll learn how to use a tool called ex_abnf to quickly create grammar parsers using the Elixir language.
What is ABNF
ABNF is defined in the RFC2234, which is obsoleted by RFC4234 which in turn is obsoleted by the RFC5234. There's also an update in the RFC7405
ABNF is a notation based on Backus Naur Form (BNF) and its purpose is to be used to define Context Free Grammars. ABNF is usually the syntax of (some.. ) languages and (lot of) protocols.
The ABNF specification defines how to write a set of rules so a parser can reduce a set of given input symbols into something that our code can use. For example, a rule named "email-address" could be defined as:
and is telling us that an "email address" is composed of a "local part", followed by the "@" symbol and then followed by the "domain part".
Also, there are some other rules involved, for example the rule ALPHA defines a range of allowed characters that serve as the basis for the other rules. We can also specify how many repetitions we'd like for a given rule, by using the * symbol, in this case "1*ALPHA" means "at least 1 ALPHA, but as many as needed".
The "domain-part" rule is telling us that a domain name is composed of one or more ALPHA characters, followed by zero or more of a group, that is composed of a dot followed by at least one ALPHA character. This means that allowed values are: "domain", "domain.com", "sub.domain.com", etc.
Our code could then parse the input "user@domain.com" and give us a map structure like:
In the same way the input could be something very complex, like a SIP INVITE request, or an HTTP REQUEST, etc.
What is ex_abnf: ABNF and Elixir altogether
ex_abnf is a library for elixir that you can use in your own applications and/or libraries to parse ABNF grammars in a very simple way. You can also write your own code inside the grammar so you can reduce/transform the parsed input into something else more useful (like structs, etc).
The idea behind ex_abnf is to let you focus your efforts in implementing the language or protocol in question, instead of writing the parser.
ex_abnf implements the latest definition (RFC5234) (with erratas #3076, and #2968), and RFC7405.
ex_abnf has been used in wide set of propietary solutions and there are also some projects that use it, like:
- ex_rfc3986: A library for elixir that parses URIs as defined in the RFC3986
- ex_rfc3966: A library for elixir that parses TEL URIs as defined in the RFC3966
Installing
ex_abnf is available on github at https://github.com/marcelog/ex_abnf but it is also available at hex.pm at https://hex.pm/packages/ex_abnf.
To use it in your application just add to your mix.exs file:
How it works
ex_abnf is an interpreter instead of a code generator. The difference is that tools like lex and yacc will generate code that will parse a given input, while ex_abnf will only generate a structure that represents the grammar, and then will apply a logic so that applying this structure to a given input will return the desired result, in an efficient way.
The first step is to load a file by using the ABNF.load_file/1 function, that will accept a string with the path to a text file where the ABNF grammar is. This text file consists of two sections:
- Optional code section
- Grammar
NOTE:The grammar file (as defined in the RFC) MUST have newlines in the DOS format (\r\n).
When ex_abnf parses the grammar file, it will automatically put all the code that it finds into a dynamically created module, so all the code is precompiled, and also the optional code section is very useful to declare helper functions or "require" any other modules that are used from other pieces of code inside the grammar. For example, our previous example could also be written as:
Once the file is loaded and parsed, a structure will be returned that you can later on use to apply the grammar to an input with the function ABNF.apply/4:
The arguments for the function are:
- The grammar obtained after calling ABNF.load_file/1
- The name of the rule to use (in this case email-address)
- The input (as a char list)
- An optional last argument used as "state". The state will be passed from rule to rule and must be returned in every rule result (you will notice that in the code for the rule "email-address" we return {:ok, state, something}, "state" is precisely this variable)
The result is a structure like:
Where:
- input: Is the original input for the rule.
- rest: What the rule couldn't parse, the leftovers, in this case everything was parsed.
- state: The state after the parsing is done.
- string_text: The value for this rule as text, what got parsed.
- string_tokens: Same as above, but for each component of the rule (in this case the rule is composed of three elements, "local-part", "@", and "domain-part").
- values: The values for this rule, in this case only one value is returned that is a map structure with the parsed elements.
Your reduce code: What you get
You might have noticed that we used some variables in the code for the rule "email-address". ex_abnf makes some variables available in those code chunks, these are:
- state: The "global state". This is initially what you pass as the last argument to the call to ABNF.apply/4, and is nil by default. It must be always returned, however by your code.
- rule: This is the "text value" for the current rule.
- string_values: This is a list where each element represents, in order, the "text value" of each one of the tokens that compose the current rule.
- values: As above, but as a deep nested list of lists. The nesting level will highly depend on how the grammar is defined (groups, repetitions, etc).
Your reduce code: What you give
From your code you can return 4 different things:
- {:ok, state, value}: The parsing continues, the given state is used for the next rules, and the returned value is the one specified by "value".
- {:ok, state}: The parsing continues, the given state is used for the next rules, and the returned result is the value of the predefined variable "rule".
- {:error, some_error}: The parsing is aborted, and the given error is returned ad result.
- throw or raise an error
Example application: US Postal Address Parser
NOTE: All the code presented here is available at GitHub at: https://github.com/marcelog/ex_abnf_example.
Let's jump right into it by implementing a grammar parser for the grammar given as an example in the ABNF wiki entry. This grammar will define (in a very simple way) how US Postal codes are written. So we start by creating our grammar file (available at github at https://github.com/marcelog/ex_abnf_example/blob/master/priv/postal_code.abnf):
The are only 2 differences from the original grammar posted in the wiki, they are:
- The syntax =/ used to define incremental alternatives for a rule, is currently not supported by ex_abnf. However, we can rewrite that as "alternative1 / alternative2 / alternative3". This is used in the rule name-part.
- The original grammar has an ambiguity in the definition for the rule name-part, so it is rewritten to require at least one name before the suffix.
Here is the grammar file that includes a little bit of everything to achieve a complete parsing. Note how we define (just for fun) a macro and a "global" function used throughout the reduce code in many of the rules.
Also, all the predefined variables are used as a demo (state, rule, values, string_values).
A simple test shows us a simple use case:
The first element in result.values will then be:
Start using ABNF grammars in your Elixir code right away!
ex_abnf (although a very simple tool in its inner workings) has been tested with protocols like SDP, SIP, and others. It has proven to be really useful to just copy'n'paste the grammar from a RFC and have a protocol working in a matter of minutes, hours, or days, but most important, that time is used to implement the protocol or language itself instead of the parser. Hopefully it will be useful for you too! :)
Don't hesitate to open pull requests or issues with any suggestions, bug reports, or improvements needed!