I am currently learning how to implement an Interpreter/Compiler. I want to document this process so this will be my first post which will cover the general frontend. This frontend is source-language independent because which language at the end I will compile or interpret to does not matter at that point.
It is more about the concept which has to be specialized for any language I want to compile to.
The task of the frontend
A frontend takes some input and transforms this input into a data structure( Abstract Syntax Tree) we can then either execute or compile into machine language.
The input comes from some source which provides us with a stream of characters. This is important. Any source file of any programming language is simply a sequential collection of characters
which themselves have no meaning exempt their numerical value. These numerical values do not hold a lot of information.
Like our alphabet does not hold a lot of information except their pronunciation but if we compose letters into groups, also known as words, suddenly we have encoded information.
The exact same thing we have to do with our input/source of characters. We have to identify different groups of characters which make up the words of the language we try to compile/interpret. So we scan each character and like in our natural written language, words will be differentiated by whitespaces between them.
These words have in computer science a different name. They called tokens (sometimes also referred to as lexemes). These tokens are then stored sequentially for the next stage of our transformation in the same order as we have them encountered from our scan over the stream of characters of our source.
Our parser then reads sequentially token by token our collection of tokens and constructs a so-called Abstract Syntax Tree which not only encodes our token but also their relationship between each other.
- A Source provides as with a stream of characters
- A Scanner uses a stream of characters to identify tokens
- A Token is constructed from the same source as the scanner.
- While constructing, the token moves the stream forwards and consumes each character which is part of the token the scanner has identified.
- Part of the scanning process is moved away from the scanner to an individual token implementation
- The constructed token is stored within our scanner.
- A Parser request from a Scanner a new token till the scanner cannot provide any new tokens
- Each token provided by a scanner is put into an AST
- A concrete implementation of a parser states the rules how to insert tokens into an AST
How to express this in Code?
The source-language independent part of the frontend can be split into four parts to achieve its goal. These four subtasks are the basis
then for future source-language specific parts. The diagram from earlier gives us already an
idea of how the relationship between the different components will be at runtime. To your left, you see a simplified UML class diagram. It states which class stores an instance of another class. By that we implicit say what attributes every class at least must have.
It makes clear that every instance of a Parser needs an instance of a Scanner. This very same Scanner needs an instance for one Token and one Source. And every Token needs one instance of a Source.
Let’s take a look at each class and start with the source. The source provides us with a stream of characters. Therefore we need to ask for the current character within the stream we work on as well as a way to ask for the next character. Besides that, we probably want to know where we find the character within the file, the stream originated from. This requires us to keep books at which line we are and on which position within that line and probably the line itself. This becomes increasingly important when I start to implement the backend. If a syntax error occurred I want to be able to present the position where that might be. Only the scanner can tell me where i am within our File.
The next class I want to design is the scanner. A scanner provides any parser with one token at a time. Similar to the source class it provides a stream but not of characters. It is a stream of tokens, therefore, it must be able like the source to present the currently extracted token from the source as well as the means to ask to for the next token. The Scanner will be an abstract class because we only know the behaviour, not the concrete implementation of that behaviour, we want to ensure every source-language scanner must implement but also ensure that we have an attribute of type Source. If we would not want to enforce the attribute an interface would probably suffice.
The next class of interest will be the Token itself. A token represents the smallest unit of information our frontend can provide after scanning each individual character from our source file. Any token no matter what it might represent in our source-language shall store the characters representing this token within our source as well as the position within our source. This is important because certain kind of tokens which are source-language specific might be allowed to appear multiple times. To distinguish between several occurrences of the same type of token we need more information. Some Tokens also might hold some value. For instances, any numerical kind of Token wants to store the numerical value itself in the token for later use. Also as we can see from the previous diagrams Token holds a reference to a source and the Scanner holds a reference to a token. These two references are still missing in these diagrams.
The diagram contains now the information to implement a token. But as you might notice a token is now able to do stuff. It can ask for a character or extract something. The reason for that I have mentioned at the beginning of this post. Each created Token shares the same reference as the one within our instance of the scanner. This means whenever the Scanner tries to extract a Token from our source, we shift the responsibility HOW to construct a token away from the scanner. therefore we decouple the actual creation of a token. Every concrete implementation of a scanner then is only responsible to identify the start of a token the “kind” of the token. The token itself then tries to construct itself from the source. This gives us the ability append a source-language more easily.
The last class we are missing in our diagram is the Parser. The parsers responsibility is to enforce the rules of the source-language and creating an abstract syntax tree with tokens in a way which follows these rules. The rules are also known under the name “production rules”.
Because this parser is just an abstract class it does not hold any language-specific implementation. But it describes what methods any Parser must have. The only new thing you will effectively discover are the two Interfaces the Parser holds as attributes. These Interfaces will be a topic for themselves and therefore I will not further explain them for now.