Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Create Domain-Specific Languages with ANTLR : Page 3

The latest version of ANTLR provides the tools you need to build a parser for special-purpose languages.

Building a Grammar with TDD
Basic SPARQL RDF graph queries take the form of:

SELECT ?subject WHERE { ?subject <#named> "Joe" . }

The document SPARQL-spec contains a specification of the SPARQL language in Backus-Naur Form (BNF), a notation for representing context-free language grammars—grammars that recognize syntactically correct sentences without making any judgments about the sentence's semantics. This notation describes the hierarchy of a given language's constructs. There is typically a top-level rule representing an entire sentence or sequence of sentences. The top-level rule will be matched by particular sequences of input matching lower-level rules. These lower-level rules, in turn, break down into lower-level matches—all the way down to tokens, the most atomic bits of a language.

The task here is to create an ANTLR grammar that can recognize legal SPARQL queries. As it happens, ANTLR grammars specify the rules of the languages they recognize in a BNF-like form, which usually makes for a smooth translation from a written language specification to an ANTLR grammar specification.

To develop the ANTLR-based recognizer in a test-driven manner, if you just mechanically translate the entire SPARQL BNF spec to ANTLR and then try to test it, it will be more difficult to track down errors. When you present a particular legal SPARQL query to the recognizer, and it fails, it can be difficult to ascertain the source of the error: Is it your fault, ANTLR's fault, or both? Which rule is the problem? How can you be sure? By instead building up an ANTLR grammar bit by bit, and testing repeatedly to validate any assumptions made along the way, you can be confident that the recognizer is working as intended, and make it much easier to detect whether future changes invalidate or support those assumptions.

When building a recognizer in TDD fashion, first examine the language's specification and find the lowest-level rules—those with as few (if any) dependencies on other rules as possible. Typically these are the language tokens. When you have built enough of the ANTLR grammar to recognize these atomic bits, with tests to support your claims that the recognizer exhibits the desired behavior, you can tackle successively higher-level rules.

Before you begin writing lexer and parser rules, there are some things about ANTLR grammar files that you should know. The salient parts of the SPARQL grammar created for this article can illustrate:

// src/main/antlr/com/devx/sparql/Sparql.g grammar Sparql; options { output = AST; k = 1; } @header { package com.devx.sparql; } @lexer::header { package com.devx.sparql; } @members { protected void mismatch( IntStream input, int tokenType, BitSet follow ) throws RecognitionException { throw new MismatchedTokenException( tokenType, input ); } public void recoverFromMismatchedSet( IntStream input, RecognitionException ex, BitSet follow ) throws RecognitionException { throw ex; } } @rulecatch { catch ( RecognitionException ex ) { throw ex; } }

ANTLR grammars typically reside in files with a .g suffix. The first declaration (grammar Sparql;) indicates the name of the ensuing grammar. Processing this grammar file will generate the Java classes SparqlLexer for the lexer, and SparqlParser for the parser.

The options section allows you to declare certain options for the grammar. In the example above, output = AST means that ANTLR will generate a parser whose rules each yield an abstract syntax tree. k = 1 means that the parser will use only one token worth of look-ahead into the token stream to decide which rule to try to match the input against. ANTLR parsers are capable of using more look-ahead; in fact, by default, ANTLR 3 can use an arbitrary amount of look-ahead using the LL(*) recognition strategy, as opposed to LL(k) for some fixed value of k. However, because SPARQL language is declared to be an LL(1) language, an explicit k = 1 setting in the ANTLR grammar file will create a more efficient parser.

@header and @lexer::header let you specify constructs in the language in which the lexer and parser will be generated; these should occur at the beginning of the generated source for the lexer and parser. The example uses those sections to place the generated lexer and parser in a Java package other than the default.

The @members command allows you to write code in the same language as the generated parser; these will be treated as fields or methods of the generated parser class.

The @rulecatch command lets you specify an exception-handling strategy. In the example, whenever the parser or lexer raises a RecognitionException (ANTLR's top-level parsing exception), the application will simply propagate it rather than attempting error recovery. More robust grammars attempt more sophisticated error handling and reporting.

With this boilerplate out of the way, here's a first test and the corresponding lexer rule. The SPARQL specification states that the rule PN_CHARS_BASE describes which Unicode characters can be used in SPARQL-prefixed names, and that the rule depends on no other rules:

PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD]

Translated to ANTLR, the lexer rule becomes:

PN_CHARS_BASE : ( 'A' .. 'Z' | 'a' .. 'z' | '\u00C0' .. '\u00D6' | '\u00D8' .. '\u00F6' | '\u00F8' .. '\u02FF' | '\u0370' .. '\u037D' | '\u037F' .. '\u1FFF' | '\u200C' .. '\u200D' | '\u2070' .. '\u218F' | '\u2C00' .. '\u2FEF' | '\u3001' .. '\uD7FF' | '\uF900' .. '\uFDCF' | '\uFDF0' .. '\uFFFD' ) ;

This rule will match any single character in any of the ranges listed above.

Rules in ANTLR grammars are realized as methods on a generated lexer or parser. By convention, when a rule name begins with an uppercase letter, the rule is a lexer rule; those that begin with a lowercase letter are parser rules. Recall that lexers emit tokens that parsers consume to match higher-level grammatical structures. In this case, the generated lexer will have a method called mPN_CHARS_BASE(), which will read a character of input and check whether it is in any of the characters in the ranges specified. If the character matches, the method consumes the character and returns a token representing the character. Otherwise, ANTLR raises a RecognitionException.

Here's how the first test for this rule might look. Let's start out asserting that each of the letters "a" through "z" are recognized as a PN_CHARS_BASE:

// src/test/java/com/devx/sparql/lexer/PnameCharactersBaseTest.java import org.antlr.runtime.*; import org.junit.*; import static org.junit.Assert.*; public class PnameCharactersBaseTest { @Test public void shouldRecognizeLowercaseLetters() throws Exception { for ( char ch = 'a'; ch <= 'z'; ++ch ) { String input = String.valueOf( ch ); SparqlLexer lexer = lexerOver( input ); lexer.mPN_CHARS_BASE(); // void method Token token = lexerOver( input).nextToken(); assertEquals( "couldn't match " + input + "?", input, token.getText() ); assertEquals( "token type for " + input + "?", SparqlLexer.PN_CHARS_BASE, token.getType() ); } } private SparqlLexer lexerOver( String input ) { SparqlLexer lexer = new SparqlLexer(); lexer.setCharStream( new ANTLRStringStream( input ) ); return lexer; } }

This test exercises the lexer rule in two ways. The first, which is typically used by generated lexers and parsers, is to invoke the void method corresponding to the rule name. This method will consume input if a match occurs, and set other lexer states that are difficult to sense in a test. The second way is to invoke the nextToken() method on the lexer, and ensure that it produces a token with the desired type and text. Both are effective ways of ensuring that the correct inputs are either matched or rejected.

Here's a negative test to ensure that an exception is raised when the rule detects a non-matching character:

@Test( expected = RecognitionException.class ) public void shouldRejectTheZeroCharacter() throws Exception { lexerOver( "\0" ).mPN_CHARS_BASE(); }

Testing that the remaining character ranges match, and that characters outside of the legal ranges are rejected, is left as a reader exercise. You can check out the full ANTLR grammar and accompanying tests in the downloadable code accompanying this article.

As you build up your suite of tests, you will find many opportunities to factor out common code into helper methods. The lexerOver() method above is one such example of a refactoring.

You can further leverage existing tests when certain inputs for rules they test also match other rules. For example, consider the lexer rule PN_CHARS_U:


Translated to ANTLR:


This rule matches any single character that would match PN_CHARS_BASE, plus the underscore character. Given their similarity, it would be a shame to have to duplicate all the tests written to verify PN_CHARS_BASE in a test class for PN_CHARS_U. Fortunately, you don't have to—you can leverage inheritance and exploit the way JUnit discovers tests on a test class to eliminate the duplication. Listing 1 shows a refactored abstract LexerTest, a PnameCharactersBaseTest that uses LexerTest's helpers and overrides, and a PnameCharactersPlusUnderscoreTest that derives in turn from PnameCharactersBaseTest.

You follow a similar pattern to write tests for parser rules. Start by finding language rules with the fewest dependencies, test those, and work upward. Here's a snapshot description of the test code for the parser rule triplesSameSubject.

The SPARQL grammar rule is:

TriplesSameSubject ::= VarOrTerm PropertyListNotEmpty | TriplesNode PropertyList

Translated to ANTLR (definitions of lower-level rules omitted here):

triplesSameSubject : varOrTerm propertyListNotEmpty | triplesNode propertyList;

Listing 2 shows the test code.

As with lexer rules, you will find that you can use the inheritance technique above to reuse parser rule tests. For example, note that in SPARQL, the rule NumericExpression matches everything that the rule AdditiveExpression matches. If you already have tests for the rule AdditiveExpression, you can use them to recognize and test NumericExpressions as well, just by firing a different parser rule in the tests for NumericExpression.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.