One Unified Global Perspective
Communications with a Global Perspective
Home
Intro
Contact Us
Voice over IP
PBX Solutions
Services
Support
Glossary
Open Source
Blog
Forum

WebMail





2008 May 24 - Sat

A Keyword Matching Algorithm

There are a number of well known algorithms out there for taking in a set of keywords and matching them against test. Aho and Corasick comes to mind, as does the Wu Manber algorithm (the latter I've implemented, and the code resides elsewhere on this site).

For another project, I didn't need something quite so fancy. Actually two projects come to mind. One is that I have a input comma separated value file which includes stock symbols, a description, and the associated exchange. I wanted to keep statitics on what is read in on an exchange basis. My first kick at the can on this was to implement a string look up table using from the Standard Template Library, with the name of the exchange being the lookup key. Another area I could use this, but with some modifications, is in longest substring matches when trying to do rate table lookups based upon country codes and area codes in voip based call control.

In another part of the file, to do some special computing, and in a fit of late night programming, I went so far as to implementing a multi-stage strcmp on each exchange to do something special for each.

I knew from the outset, that would be horribly inefficient, but I didn't have a nice algorithm at hand to do it with.

After being reminded of Aho Corasick in another context, I decided to take another stab at improving efficiency. I ended up implementing half of Aho Corasick's algorithm, the 'go' function. Since I'm always matching from the beginning of the string, and am not stepping through text, the exclusion of the 'fail' function works quite well, and keeps the code smaller.

At some point in time the code can be coverted to something using templates in order to accept various string types, various element matches, and return codes.

Since the 'fail' function isn't implemented, the lookup table can be updated dynamically without being rebuilt from scratch.

The library is quite easy to use. With AddPattern, add new keywords along with some sort of index or object. When using FindMatch with the target keyword, the appropriate object will be returned on a successful match. If no match is found NULL will be returned. A modification to the class would add a default object for use when no matches are found.

Here is the header file.


#pragma once

// Copyright (2008) Ray Burkholder
// Matching against multiple keywords

#include <vector>
#include <string>

class CKeyWordMatch {
public:
  CKeyWordMatch(void);
  virtual ~CKeyWordMatch(void);
  void ClearPatterns( void );
  void AddPattern( const std::string 
&sPattern, void *object );
  void *FindMatch( const std::string 
&sMatch );
protected:
  struct structNode {
    size_t ixLinkToNextLevel;  // next letter of same word
    size_t ixLinkAtSameLevel;  // look for other letters at same location
    void *object;  // upon match, (returned when 
keyword found)
    char chLetter;  // the letter at this node
    structNode() : ixLinkToNextLevel( 0 ), ixLinkAtSameLevel( 0 ), 
      object( NULL ), chLetter( 0 ) {};
  };
  std::vector<structNode> m_vNodes;
private:
};


Here is the code file:


#include "StdAfx.h"
#include "KeyWordMatch.h"

// Copyright (2008) Ray Burkholder
// Matching against multiple keywords

#include <stdexcept>

CKeyWordMatch::CKeyWordMatch(void) {
  ClearPatterns();
}

CKeyWordMatch::CKeyWordMatch(size_t size) {
  m_vNodes.reserve( size );
  ClearPatterns();
}

CKeyWordMatch::~CKeyWordMatch(void) {
  m_vNodes.clear();
}

void CKeyWordMatch::ClearPatterns() {
  m_vNodes.clear();
  structNode node;
  m_vNodes.push_back( node ); // root node with nothing
}

void CKeyWordMatch::AddPattern( 
              const std::string &sPattern, void *object ) {
  std::string::const_iterator iter = sPattern.begin(); 
  if ( sPattern.end() == iter ) {
    throw std::runtime_error( "zero length pattern" );
  }
  size_t ixNode = 0;
  size_t ix;
  bool bDone = false;
  while ( !bDone ) {
    char ch = *iter;
    ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
    if ( 0 == ix ) { // end of chain, so add letter
      structNode node;
      node.chLetter = ch;
      m_vNodes.push_back( node );
      ix = m_vNodes.size() - 1;
      m_vNodes[ ixNode ].ixLinkToNextLevel = ix;
      ixNode = ix;
    }
    else { // find letter at this level
      //ix = m_vNodes[ ixNode ].ixLinkToNextLetter;  // already set
      bool bLevelDone = false;
      size_t ixLevel = ix;  // set from above
      while ( !bLevelDone ) {
        if ( ch == m_vNodes[ ixLevel ].chLetter ) { 
          // found matching character
          ixNode = ixLevel;
          bLevelDone = true;
        }
        else {
          // move onto next node at this level to find character
          size_t ixLinkAtNextSameLevel 
            = m_vNodes[ ixLevel ].ixLinkAtSameLevel;
          if ( 0 == ixLinkAtNextSameLevel ) {
            // add a new node at this level
            structNode node;
            node.chLetter = ch;
            m_vNodes.push_back( node );
            ix = m_vNodes.size() - 1;
            m_vNodes[ ixLevel ].ixLinkAtSameLevel = ix;
            ixNode = ix;
            bLevelDone = true;
          }
          else {
            // check the new node, nothing to do here
            // check next in sequence
            ixLevel = ixLinkAtNextSameLevel;
          }
        }
      }
    }
    ++iter;
    if ( sPattern.end() == iter ) {
      if ( NULL != m_vNodes[ ixNode ].object ) {
        std::runtime_error( "Pattern already present" );
      }
      m_vNodes[ ixNode ].object = object;  // 
assign and finish
      bDone = true;
    }
  }
}

void *CKeyWordMatch::FindMatch( const std::string &sPattern ) {
  // traverse structure looking for matches
  std::string::const_iterator iter = sPattern.begin(); 
  if ( sPattern.end() == iter ) {
    throw std::runtime_error( "zero length pattern" );
  }
  void *object = NULL;
  size_t ixNode = 0;
  size_t ix;
  bool bOpFound = true;
  bool bDone = false;
  while ( !bDone ) {
    char ch = *iter;
    ix = m_vNodes[ ixNode ].ixLinkToNextLevel;
    if ( 0 == ix ) {
      bDone = true;  // no more matches to be found so exit
    }
    else {
      // compare characters at this level
      bool bLevelDone = false;
      size_t ixLevel = ix;  // set from above
      while ( !bLevelDone ) {
        if ( ch == m_vNodes[ ixLevel ].chLetter ) {
          ixNode = ixLevel;
          bLevelDone = true;
        }
        else {
          ixLevel = m_vNodes[ ixLevel ].ixLinkAtSameLevel;
          if ( 0 == ixLevel ) {  // no match so end 
            bLevelDone = true;
            bDone = true;
          }
        }
      }
    }
    ++iter;
    if ( sPattern.end() == iter ) {
      object = m_vNodes[ ixNode ].object;
      bDone = true;
    }
  }
  return object;
}




Blog Content ©2008
Ray Burkholder
All Rights Reserved
ray@oneunified.net
(441) 505 7293
Available for Contract Work
Resume

RSS: Click to see the XML version of this web page.

View Ray 
Burkholder's profile on LinkedIn
technorati
Add to Technorati Favorites



May
Su Mo Tu We Th Fr Sa
       
24


Main Links:
Monitoring Server
SSH Tools
QuantDeveloper Code

Special Links:
Frink

Blog Links:
Sergey Solyanik
Marc Andreessen
HotGigs
Micro Persuasion
... Reasonable ...
Chris Donnan
BeyondVC
lifehacker
Trader Mike
Ticker Sense
HeadRush
TraderFeed
Stock Bandit
The Daily WTF
Guy Kawaski
J. Brant Arseneau
Steve Pavlina
Matt Cutts
Kevin Scaldeferri
Joel On Software
Quant Recruiter
Blosxom User Group
Wesner Moise
Julian Dunn
Steve Yegge

2008
Months
May
Sep
Oct Nov Dec




Mason HQ

Disclaimer: This site may include market analysis. All ideas, opinions, and/or forecasts, expressed or implied herein, are for informational purposes only and should not be construed as a recommendation to invest, trade, and/or speculate in the markets. Any investments, trades, and/or speculations made in light of the ideas, opinions, and/or forecasts, expressed or implied herein, are committed at your own risk, financial or otherwise.