A Keyword Matching Algorithm
There are a number of well known algorithms out there for taking in a set of keywords and matching them against test. Aho and Corasick comes to mind, as does the Wu Manber algorithm (the latter I've implemented, and the code resides elsewhere on this site).
For another project, I didn't need something quite so fancy. Actually two projects come to mind. One is that I have a input comma separated value file which includes stock symbols, a description, and the associated exchange. I wanted to keep statitics on what is read in on an exchange basis. My first kick at the can on this was to implement a string look up table using from the Standard Template Library, with the name of the exchange being the lookup key. Another area I could use this, but with some modifications, is in longest substring matches when trying to do rate table lookups based upon country codes and area codes in voip based call control. In another part of the file, to do some special computing, and in a fit of late night programming, I went so far as to implementing a multi-stage strcmp on each exchange to do something special for each. I knew from the outset, that would be horribly inefficient, but I didn't have a nice algorithm at hand to do it with. After being reminded of Aho Corasick in another context, I decided to take another stab at improving efficiency. I ended up implementing half of Aho Corasick's algorithm, the 'go' function. Since I'm always matching from the beginning of the string, and am not stepping through text, the exclusion of the 'fail' function works quite well, and keeps the code smaller. At some point in time the code can be coverted to something using templates in order to accept various string types, various element matches, and return codes. Since the 'fail' function isn't implemented, the lookup table can be updated dynamically without being rebuilt from scratch. The library is quite easy to use. With AddPattern, add new keywords along with some sort of index or object. When using FindMatch with the target keyword, the appropriate object will be returned on a successful match. If no match is found NULL will be returned. A modification to the class would add a default object for use when no matches are found. Here is the header file. #pragma once // Copyright (2008) Ray Burkholder // Matching against multiple keywords #include <vector> #include <string> class CKeyWordMatch { public: CKeyWordMatch(void); virtual ~CKeyWordMatch(void); void ClearPatterns( void ); void AddPattern( const std::string &sPattern, void *object ); void *FindMatch( const std::string &sMatch ); protected: struct structNode { size_t ixLinkToNextLevel; // next letter of same word size_t ixLinkAtSameLevel; // look for other letters at same location void *object; // upon match, (returned when keyword found) char chLetter; // the letter at this node structNode() : ixLinkToNextLevel( 0 ), ixLinkAtSameLevel( 0 ), object( NULL ), chLetter( 0 ) {}; }; std::vector<structNode> m_vNodes; private: }; Here is the code file: #include "StdAfx.h" #include "KeyWordMatch.h" // Copyright (2008) Ray Burkholder // Matching against multiple keywords #include <stdexcept> CKeyWordMatch::CKeyWordMatch(void) { ClearPatterns(); } CKeyWordMatch::CKeyWordMatch(size_t size) { m_vNodes.reserve( size ); ClearPatterns(); } CKeyWordMatch::~CKeyWordMatch(void) { m_vNodes.clear(); } void CKeyWordMatch::ClearPatterns() { m_vNodes.clear(); structNode node; m_vNodes.push_back( node ); // root node with nothing } void CKeyWordMatch::AddPattern( const std::string &sPattern, void *object ) { std::string::const_iterator iter = sPattern.begin(); if ( sPattern.end() == iter ) { throw std::runtime_error( "zero length pattern" ); } size_t ixNode = 0; size_t ix; bool bDone = false; while ( !bDone ) { char ch = *iter; ix = m_vNodes[ ixNode ].ixLinkToNextLevel; if ( 0 == ix ) { // end of chain, so add letter structNode node; node.chLetter = ch; m_vNodes.push_back( node ); ix = m_vNodes.size() - 1; m_vNodes[ ixNode ].ixLinkToNextLevel = ix; ixNode = ix; } else { // find letter at this level //ix = m_vNodes[ ixNode ].ixLinkToNextLetter; // already set bool bLevelDone = false; size_t ixLevel = ix; // set from above while ( !bLevelDone ) { if ( ch == m_vNodes[ ixLevel ].chLetter ) { // found matching character ixNode = ixLevel; bLevelDone = true; } else { // move onto next node at this level to find character size_t ixLinkAtNextSameLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel; if ( 0 == ixLinkAtNextSameLevel ) { // add a new node at this level structNode node; node.chLetter = ch; m_vNodes.push_back( node ); ix = m_vNodes.size() - 1; m_vNodes[ ixLevel ].ixLinkAtSameLevel = ix; ixNode = ix; bLevelDone = true; } else { // check the new node, nothing to do here // check next in sequence ixLevel = ixLinkAtNextSameLevel; } } } } ++iter; if ( sPattern.end() == iter ) { if ( NULL != m_vNodes[ ixNode ].object ) { std::runtime_error( "Pattern already present" ); } m_vNodes[ ixNode ].object = object; // assign and finish bDone = true; } } } void *CKeyWordMatch::FindMatch( const std::string &sPattern ) { // traverse structure looking for matches std::string::const_iterator iter = sPattern.begin(); if ( sPattern.end() == iter ) { throw std::runtime_error( "zero length pattern" ); } void *object = NULL; size_t ixNode = 0; size_t ix; bool bOpFound = true; bool bDone = false; while ( !bDone ) { char ch = *iter; ix = m_vNodes[ ixNode ].ixLinkToNextLevel; if ( 0 == ix ) { bDone = true; // no more matches to be found so exit } else { // compare characters at this level bool bLevelDone = false; size_t ixLevel = ix; // set from above while ( !bLevelDone ) { if ( ch == m_vNodes[ ixLevel ].chLetter ) { ixNode = ixLevel; bLevelDone = true; } else { ixLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel; if ( 0 == ixLevel ) { // no match so end bLevelDone = true; bDone = true; } } } } ++iter; if ( sPattern.end() == iter ) { object = m_vNodes[ ixNode ].object; bDone = true; } } return object; }
In another part of the file, to do some special computing, and in a fit of late night programming, I went so far as to implementing a multi-stage strcmp on each exchange to do something special for each.
I knew from the outset, that would be horribly inefficient, but I didn't have a nice algorithm at hand to do it with.
After being reminded of Aho Corasick in another context, I decided to take another stab at improving efficiency. I ended up implementing half of Aho Corasick's algorithm, the 'go' function. Since I'm always matching from the beginning of the string, and am not stepping through text, the exclusion of the 'fail' function works quite well, and keeps the code smaller.
At some point in time the code can be coverted to something using templates in order to accept various string types, various element matches, and return codes.
Since the 'fail' function isn't implemented, the lookup table can be updated dynamically without being rebuilt from scratch.
The library is quite easy to use. With AddPattern, add new keywords along with some sort of index or object. When using FindMatch with the target keyword, the appropriate object will be returned on a successful match. If no match is found NULL will be returned. A modification to the class would add a default object for use when no matches are found.
Here is the header file.
#pragma once // Copyright (2008) Ray Burkholder // Matching against multiple keywords #include <vector> #include <string> class CKeyWordMatch { public: CKeyWordMatch(void); virtual ~CKeyWordMatch(void); void ClearPatterns( void ); void AddPattern( const std::string &sPattern, void *object ); void *FindMatch( const std::string &sMatch ); protected: struct structNode { size_t ixLinkToNextLevel; // next letter of same word size_t ixLinkAtSameLevel; // look for other letters at same location void *object; // upon match, (returned when keyword found) char chLetter; // the letter at this node structNode() : ixLinkToNextLevel( 0 ), ixLinkAtSameLevel( 0 ), object( NULL ), chLetter( 0 ) {}; }; std::vector<structNode> m_vNodes; private: };
Here is the code file:
#include "StdAfx.h" #include "KeyWordMatch.h" // Copyright (2008) Ray Burkholder // Matching against multiple keywords #include <stdexcept> CKeyWordMatch::CKeyWordMatch(void) { ClearPatterns(); } CKeyWordMatch::CKeyWordMatch(size_t size) { m_vNodes.reserve( size ); ClearPatterns(); } CKeyWordMatch::~CKeyWordMatch(void) { m_vNodes.clear(); } void CKeyWordMatch::ClearPatterns() { m_vNodes.clear(); structNode node; m_vNodes.push_back( node ); // root node with nothing } void CKeyWordMatch::AddPattern( const std::string &sPattern, void *object ) { std::string::const_iterator iter = sPattern.begin(); if ( sPattern.end() == iter ) { throw std::runtime_error( "zero length pattern" ); } size_t ixNode = 0; size_t ix; bool bDone = false; while ( !bDone ) { char ch = *iter; ix = m_vNodes[ ixNode ].ixLinkToNextLevel; if ( 0 == ix ) { // end of chain, so add letter structNode node; node.chLetter = ch; m_vNodes.push_back( node ); ix = m_vNodes.size() - 1; m_vNodes[ ixNode ].ixLinkToNextLevel = ix; ixNode = ix; } else { // find letter at this level //ix = m_vNodes[ ixNode ].ixLinkToNextLetter; // already set bool bLevelDone = false; size_t ixLevel = ix; // set from above while ( !bLevelDone ) { if ( ch == m_vNodes[ ixLevel ].chLetter ) { // found matching character ixNode = ixLevel; bLevelDone = true; } else { // move onto next node at this level to find character size_t ixLinkAtNextSameLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel; if ( 0 == ixLinkAtNextSameLevel ) { // add a new node at this level structNode node; node.chLetter = ch; m_vNodes.push_back( node ); ix = m_vNodes.size() - 1; m_vNodes[ ixLevel ].ixLinkAtSameLevel = ix; ixNode = ix; bLevelDone = true; } else { // check the new node, nothing to do here // check next in sequence ixLevel = ixLinkAtNextSameLevel; } } } } ++iter; if ( sPattern.end() == iter ) { if ( NULL != m_vNodes[ ixNode ].object ) { std::runtime_error( "Pattern already present" ); } m_vNodes[ ixNode ].object = object; // assign and finish bDone = true; } } } void *CKeyWordMatch::FindMatch( const std::string &sPattern ) { // traverse structure looking for matches std::string::const_iterator iter = sPattern.begin(); if ( sPattern.end() == iter ) { throw std::runtime_error( "zero length pattern" ); } void *object = NULL; size_t ixNode = 0; size_t ix; bool bOpFound = true; bool bDone = false; while ( !bDone ) { char ch = *iter; ix = m_vNodes[ ixNode ].ixLinkToNextLevel; if ( 0 == ix ) { bDone = true; // no more matches to be found so exit } else { // compare characters at this level bool bLevelDone = false; size_t ixLevel = ix; // set from above while ( !bLevelDone ) { if ( ch == m_vNodes[ ixLevel ].chLetter ) { ixNode = ixLevel; bLevelDone = true; } else { ixLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel; if ( 0 == ixLevel ) { // no match so end bLevelDone = true; bDone = true; } } } } ++iter; if ( sPattern.end() == iter ) { object = m_vNodes[ ixNode ].object; bDone = true; } } return object; }