Tokenizer implemented in C#

Introduction

For a project of mine i had to implement a Tokenizer. This article discusses the code i created and Why i choose in my i eyes the most general way of implementing such an Tokenizer.

A Tokenizer

What does a Tokenizer? In one sentence: “A Tokenizer converts a stream of characters into a stream of so called Tokens!” You then ask what is a Token? And the answer therefor is “an marker” which tells me what kind of characters i had encountered while reading through stream of characters at a later point in time for further analysis or whatever i wanna do with this kind of information. They are  representation of a state with no further meaning attached to it.

Tokens

in the simplest representation are simply some number representing one or more characters connected with a not yet defined meaning.  For example you have the following arithmetic expression:

\(1 + 2 + 3\)

 

They Would result into the following tokens:

  • Number
  • Operator
  • Number
  • Operator
  • Number

Where Number stands for the numerical value of lets say 1 and Operator for the numerical value 2. With this information we can store the whole arithmetic expression as sequence of numbers.

You might already have thought: “Well, we have stored that we found a number or an operator but we lost the information of what specific number or operator it was, how do we deal with this?”

The answer is: it depends on your specific needs. I will show case the way i choose. My Tokens aren’t merely numbers, they are classes where the instance of that class represent state and sometimes contains value e.g for numbers or strings

For every possible token i want to represent in my stream of tokens i have to write a class specifically for that token.

In the code snippet above i have wrote three kinds of possible tokens. They do almost nothing except for just being there! They do not implement any real behaviour except some service functionality which in my eyes does not really count!

In total i have written 22 different kind of tokens. Because they are just a means of storing a certain found within a character stream I need an object which recognizes my tokens and that is responsibility of an Tokenizer:

My design which you can see at the bottom of this post is my first attempt of generalizing such an Tokenizer. It consist of a stream, a list containing the tokens i might find and a table of rules, which is filled in the Constructor of my class.

Table is a Dictionary  which associates a key to an rule where the rule is an simple delegate called for any character given.

Quite simple, you only need to write a teeny tiny bit of code the rule often called the production which then can be called by the rule’s name for the given character.

 

It is nothing sophisticated but it gets the job done.

 

Leave a Reply

Your email address will not be published. Required fields are marked *