Monday, April 9, 2007
Tokenize a String with C# Regular Expressions
Using C# and .net regular expressions it easy to parse even the most complex string into tokens very easily.
For example convert the following string:
Here are the steps.
1. Define a Token helper class.
" give the group a name. The "\s" matches on a white space character, with the "\s*" subexpression matching on zero or more whitespace characters in a row.
Overall the pattern directs to both match on whitespace and on one of the following groups or subexpressions such as "variable", "integer". Notice the "" character (shift backslash on keyboard) after each group which is the logical 'or' in C#; however in regular expressions it is known as the 'alternation' character with each subexpression known as the alternative. When the first alternative match is found then matching stops.
The "invalid" group will match on character string not matched by any of the other alternative groups. This supports simple syntax check.
3. Generate the regular expression code.
Note: When using regular expressions it is very easy for a single character to change the meaning of a regular expression. Techniques such as code inspection rarely will reveal a problem. It is important with all code, but especially code using regular expressions, to build a good set of unit tests that exercise most, if not all, of the combinations.
For example convert the following string:
"365 + 6 *(6.3 + Count)"into something like this:
Token[0], Integer, "365"With out using regular expressions it becomes quite a programming exercise. With regular expressions it becomes quite simple. The regular expression technique even supports simple syntax checking for invalid characters or character sequences.
Token[1], Plus, "+"
Token[2], Integer, "6"
Token[3], Mulitply, "*"
Token[4], OpenBracket, "("
Token[5], Double, "6.3"
Token[6], Plus, "+"
Token[7], Variable, "Count"
Token[8], CloseBracket, ")"
Here are the steps.
1. Define a Token helper class.
public class Token2. Define the regular expression pattern using named groups looking something like this.
{
public readonly string Name;
public readonly string Value;
public Token(string name, string value)
{
Name = name;
Value = value;
} >
}
private static string pattern =The regular expression explained: The round brackets"(...)" define a group that support matching a subexpression. The "?
@"(?<whitespace>\s*)|" +
@"(?<variable>[a-zA-Z_$][a-zA-Z0-9_$]*)|" +
@"(?<integer>[0-9]+)|" +
@"(?<plus>\+)|" +
@"(?<minus>-)|" +
@"(?<multiply>\*)|" +
@"(?<invalid>[^\s]+)";
Overall the pattern directs to both match on whitespace and on one of the following groups or subexpressions such as "variable", "integer". Notice the "" character (shift backslash on keyboard) after each group which is the logical 'or' in C#; however in regular expressions it is known as the 'alternation' character with each subexpression known as the alternative. When the first alternative match is found then matching stops.
The "invalid" group will match on character string not matched by any of the other alternative groups. This supports simple syntax check.
3. Generate the regular expression code.
Regex regexPattern = new Regex(pattern);4. Perform a "foreach" on matches to generate the tokens.
MatchCollection matches = regexPattern.Matches("365 + 6 * Count");
ListThat is all that is to it. A very small amount of code to quickly parse a string.tokenList = new List ();
foreach (Match match in matches)
{
int i = 0;
foreach (Group group in match.Groups)
{
string matchValue = group.Value;
bool success = group.Success;
// ignore capture index 0 and 1 (general and WhiteSpace)
if ( success && i > 1)
{
string groupName = regexPattern.GroupNameFromNumber(i);
tokenList.Add(new Token(groupName , matchValue));
}
i++;
}
}
Note: When using regular expressions it is very easy for a single character to change the meaning of a regular expression. Techniques such as code inspection rarely will reveal a problem. It is important with all code, but especially code using regular expressions, to build a good set of unit tests that exercise most, if not all, of the combinations.
Labels: C Sharp, Regular Expressions
Subscribe to Posts [Atom]
