Parsing and Syntax

APIs for parsing SystemVerilog source code

Like most production compilers, slang's parser is hand-written recursive descent. It is designed to be robust in the face of errors in the source text; no matter how poorly formed, the parser should gracefully consume the text and continue working. This makes the parser usable in editing scenarios, where 90% of the time it will be operating on broken source since the user is in the middle of typing it.

The parsing process is implemented via several layers that work together to produce the final parse tree. The sections here describe each layer in turn.

Note that everything described here is operating on the level of a "parse tree", or equivalently a "concrete syntax tree". The abstract syntax tree is at a higher level of abstraction and is built in a separate step, entirely distinct from the parsing process.

Syntax Nodes

The actual parse tree is literally a tree of slang::syntax::SyntaxNode derived classes. Each derived class represents a single piece of the SystemVerilog grammar. All of the derived classes are described via a DSL in the scripts/syntax.txt file and turned into C++ classes via a Python script during the build.

Each syntax node provides members and methods for accessing its member nodes, as well as to iterate through them in a generic fashion. They also each contain a link to their parent node, which will be set for all nodes except the root.

At the leaves of the tree are slang::parsing::Token instances. Tokens are produced by lexing the source text, and represent individual syntactic elements of the language, such as keywords, identifiers, and punctuation. They can carry with them a variety of data; for example, an integer literal might carry the numeric value represented by the token's text. Each token also carries location information about where it came from in the original source, along with any preceding "trivia". Trivia is extra bits of source text that don't contribute to the final program. This might include, for example, whitespace and code comments. These are attached to the next valid token produced by the lexer. This makes it possible for any piece of the parse tree to be converted back to the original source code that created it by doing a simple depth-first traversal and examining each token and its trivia.

Other parts of the parser can create new trivia. If there are unknown or unexpected tokens in the source, for example, the parser can take them and turn them into skipped trivia and continue parsing once it reaches a valid state. Trivia is represented by the slang::parsing::Trivia class.

There are three additional special syntax nodes that are worth mentioning here. The slang::syntax::SyntaxList, slang::syntax::TokenList, and slang::syntax::SeparatedSyntaxList nodes represent lists of child nodes and provide various APIs for accessing child elements by offset. SyntaxList contains a simple list of child nodes, TokenList contains a list only of source code tokens, and SeparatedSyntaxList contains a token-delimited list of child nodes (typically the delimiters are commas).

Lexing

The slang::parsing::Lexer class implements the process of lexing SystemVerilog source code; that is, turning the source text into a stream of tokens (and trivia). The Lexer is created with a slang::SourceBuffer and then expects you to repeatedly call slang::parsing::Lexer::lex to produce each successive token.

It also contains some useful static methods that are used elsewhere in the parsing process. For example, slang::parsing::Lexer::stringify can take a list of tokens and turn them into a string literal token, and slang::parsing::Lexer::concatenateTokens can combine two tokens into one.

The lexer is fairly low level and unlikely to be useful on its own; you probably want to use it in combination with other parsing layers.

Preprocessor

The slang::parsing::Preprocessor class implements all of the SystemVerilog preprocessing rules (referred to in the spec as "compiler directives"). The preprocessor takes as input a stream of tokens (via one or more Lexer instances) and produces as output a single coherent stream of tokens with all of the directives removed and the various actions implied by those directives performed.

As an example, when an ifdef token arrives the preprocessor does not emit it downstream but instead checks whether the given macro is defined. If it's not, it will consume tokens and hide them away inside SkippedToken trivia until it sees the corresponding endif.

The preprocessor fully handles all macro expansion (which is a complicated process) and also handles include directives. Whenever such a directive is seen, the corresponding file is searched for and opened, and a Lexer for it pushed onto the preprocessor's stack of open files. Once all tokens from the include file are drained, that Lexer is popped and the original file is continued.

There are some useful methods on the preprocessor class that can be used to control its operation. slang::parsing::Preprocessor::predefine lets you programmatically define a macro, slang::parsing::Preprocessor::isDefined lets you check whether a given macro is defined at the current point in the token stream, and slang::parsing::Preprocessor::getDefinedMacros will return a list of all macros currently defined.

The output stream of tokens from the preprocessor, obtained by continually calling slang::parsing::Preprocessor::next, can actually be parsed according to the SystemVerilog language grammar.

Parser

The slang::parsing::Parser class implements a full SystemVerilog language parser. It takes as input a slang::parsing::Preprocessor instance, repeatedly obtains tokens from it, and from those tokens creates a tree of slang::syntax::SyntaxNode objects that represent the entire source text.

In a typical scenario where you're operating on a complete source file, you can simply call slang::parsing::Parser::parseCompilationUnit to get the full tree. If you know you only have a specific snippet of source in the token stream (possibly because you constructed it programmatically or for a unit test) you can call one of the more specialized methods like slang::parsing::Parser::parseExpression.

The slang::parsing::Parser::parseGuess method is useful to note; it tries to figure out the most likely code construct by examining the first few tokens in the stream. This is useful in a REPL scenario where the user is typing random bits of code and expecting the tool to figure out what they mean.

SyntaxTree

The whole lex / preprocess / parser operation, along with a BumpAllocator for arena based allocation and a Diagnostics buffer for collecting diagnostics, is wrapped up inside the slang::syntax::SyntaxTree class to provide a convenient API. While the individual components involved are designed to be usable on their own to a certain extent, the SyntaxTree class is the easiest way to just parse some code and look at the result.

The class has several static methods for constructing a syntax tree from various sources; for example, slang::syntax::SyntaxTree::fromFile for constructing from a source file on disk and slang::syntax::SyntaxTree::fromText for constructing from source text in memory. The result is a std::shared_ptr to the SyntaxTree object, which owns all of the memory for the parsed syntax tree and any diagnostics that occurred.

Parsing text is then as easy as:

auto tree = SyntaxTree::fromText("module m; endmodule");

When loading from a file the result also includes (via a std::expected object) a system error code if the load fails.

auto result = SyntaxTree::fromFile("path/to/file.sv");
if (result) {
    /* do something with *result */
}
else {
    /* do something with result.error() */
}

Manipulating the syntax tree

Once you have a syntax tree you can examine it or manipulate it in various ways as you see fit. To make the process of walking the tree easier, the slang::syntax::SyntaxVisitor class exists. You can derive from this class and implement a handle() method for the specific kinds of syntax nodes you care about, and the process of actually walking the tree and invoking the right methods will be handled automatically.

If you want to walk the tree and modify the tree, say as part of some automated refactoring process, you can instead derive from slang::syntax::SyntaxRewriter. This class adds useful helper methods for replacing parts of the tree or inserting new nodes into the tree on the fly.

A syntax tree, or subsets of a syntax tree, can be turned back into source text via the slang::syntax::SyntaxPrinter class. This class has a variety of options for how you'd like the text printed, and then will print any tokens, trivia, or syntax nodes you give it.

See the tests/unittests/VisitorTests.cpp file for an example of using all of these mechanisms together.