Many algorithms exist for searching volumes of a body of text
for a specific string. Prime examples
of this include the UNIX “grep” command and the search features included in
word processing packages such as Microsoft Word. String matching of this sort relies on a well-defined description
of the target string. This description
is defined either as a regular expression or as a text string modified by
wild-card operator. In this manner, the
search capability is expanded beyond a strict one-to-one match, and can find
matches for similar words. For example,
in Microsoft Word, a search for “?ouse” will find word or word parts that
contain the phrase “ouse” preceeded by at least one non-whitespace character.
While this is sufficient for some applications, there are times
when we do not know the exact phrase or do not want to trouble a user with
learning how to properly format regular expressions. As such, I have endeavored to explore various methods to perform
a best-match search for a given target string.
Before examining specific algorithms, it may be
advantageous to examine why simple string matching fails. Simple string matching performs a strict
comparison of the target text to substrings of the text. As such, a search for the word “goose” will
fail to find the word “geese”, as a search for the word "separate” will
miss the common misspelling “seperate.”
I have listed below factors that cause a simple string-matching
algorithm to fail:
Transposition: switching of one or more letters
Deletion
/ Addition: some
misspellings occur from adding an extra character or leaving a character out,
thus lengthening or shortening the search string.
Singular
/ Plural forms:
searching for “puppy” will fail to match an instance of the word “puppies”
Tense: searching for “he walked down the road” may
fail to find an occurrence of “he walks down the road”
Phonetic
Misspellings: searching
for “cemetery” will fail to match a misspelled, yet phonetically correct,
instance of “cemetary”
Improper
Hyphenation / word separation: searches for “weekend” may fail for instances of improperly
spelled matches of “week end” and “week-end”
Substitution: replacing a correct
character with an incorrect one, such as typing the numeral zero instead of the
letter “o”.
One benefit of using a simple string-matching algorithm is in its simplicity. Some algorithms can be devised to compensate for some of the reasons above, yet they suffer from performance problems as well as necessitating a more complex algorithm. I have coded a “simple” string matching class that performs a straight character-to-character match and returns a copy of the matched string, as well as a method “pointer2match” that returns a pointer to the match as found in the text. This particular class is used as a base class for partial match algorithms I have implemented.
The second class I have included is an improved partial match algorithm. This particular class compensates for two of the cases listed above, namely transposition and substitution. It does not address phonetic misspellings, or erroneous deletions and additions of text. It also does not address the case of mistaken plurality. I have decided not to implement an algorithm that attempts to identify matches based on plurality or tense, as I want to try and create a general method for partial string matching. In most cases, we indicate plurality or singularity by appending a “s”, “es”, or a similar modification of the last syllable. However, the English language has exceptions to this (“goose” and “geese”, etc.) and the rules are different for other languages. Other rules translate well – transpositions surely occur in other languages. The same is true for errors in tense.
This improved class compares characters in the search string to characters in the text. A running total of matches is carried through each iteration, adding one to the sum for each correct match. The sum is then divided by the length of the search string to return an average number of characters found at the end of each iteration. If this average is greater than or equal to the specified accuracy, we flag this at a match.
As mentioned, this class also handles transpositions. This is done by comparing the previous and current character in the text, with the current and previous character in the search string. Providing the current and previous characters are not the same, we increment the sum by 1.75. This indicates that we are fairly sure that this is simply an erroneous transposition.
The primary problem with this algorithm is that it does not handle cases of a string that contains extra characters, or is missing characters. The previous search method relies on the search string and the evaluated substring of the document to be the same size. For example, consider how it handles searching for the word “cemetery” and the search comes across the misspelling “cemetary”.
Search string C E M E T E R Y
Potential match C E M E T A R Y
Match 1 1 1 1 1 0 1 1
We’ve matched seven out of eight characters – 87.5% correct. However, if the middle E is dropped out giving the misspelling “cemntery” the algorithm evaluates the string as:
Search string C E M E T E R Y
Potential match C E M T A R Y
Match 1 1 1 0 0 0 0 0
…matching three out of eight characters – 37.5% correct. This is also true for the opposite – the case in which the target string contains extra characters. It should be apparent that this algorithm works for certain situations, but does not work well for all errors.
We can look at this as two separate problems. Strings may contain extra characters, and they may be missing characters. The algorithm itself works by comparing each individual character to each individual character in the target string, and walks down the list. Each time we make a comparison, we either advance both the pointers from the search string and the target string, or, advance only the target string.
A correct match of the word “ABCD”
Iteration 1 Source ABCD
Target ABCD
Iteration 2 Source ABCD
Target ABCD
Iteration 3 Source ABCD
Target ABCD
Iteration 4 Source ABCD
Target ABCD
Handling
Extra Characters
Iteration 1 Source ABCD
Target ABXCD
Iteration 2 Source ABCD
Target ABXCD
Iteration 3 Source ABCD
Target ABXCD
Iteration 4 Source ABCD
Target ABXCD
Iteration 5 Source ABCD
Target ABXCD
During iteration 3, we expect to find C but instead have found X. We assume that this is an extra character, and do not advance the target string’s pointer, as we expect the next character to be the correct match.
We can handle misspellings such as the “cemetery” “cemetary” misspelling by advancing both pointers.
The third scenario is the missing character problem. If we are searching for the word “cemetery” and we find a string “cemetry” in the text, we need to advance the search pointer first while not advancing the target pointer:
A match missing characters
Iteration 1 Source ABCD
Target ABD
Iteration 2 Source ABCD
Target ABD
Iteration 3 Source ABCD
Target ABD
Iteration 4 Source ABCD
Target ABD
In iteration 3, we fail when comparing “C” to “D”. By advancing the search string’s pointer, we then carry on our way.
This gives us the three fundamental rules we need to perform a best-match search of a text. These methods can be applied recursively to substrings of the text.
The algorithm examines two strings; the search string, and a “window” of the text. The size of the window is determined by the specified accuracy. Specifically, the size is determined by
len(string) + ( len(string) - len(string)
* accuracy )
So, if we search for “The Quick Brown Fox” with a specified accuracy of .8, our window size becomes:
19 + ( 19 – (19*.8) )
= 19 + ( 19 – 15.2)
= 19 + 3.8
= 19 + 3
= 22
The choice to round down is arbitrary.
Given this search string and the window, we then recursively compare each character in the search string with the character in the window. If we both characters match, we repeat the process. If the characters do not match, we recursively call the match routine again, except we advance the pointers in three different methods.
1.
int
2.
variable::checkstring(
char* searchstr, char* substr, int depth, int curwrong )
3.
{
4.
int
res1;
5.
int
res2;
6.
int
res3;
7.
res1
= res2 = res3 = 0;
8.
if(
depth == (searchlen + 1) )
9.
{
10.
return searchlen + curwrong;
11.
}
12.
if(
*searchstr != *substr )
13.
{
14.
curwrong++;
15.
}
16.
if(
curwrong <= maxwrong)
17.
{
18.
if(
*searchstr == *substr)
19.
res1 = checkstring( searchstr+1, substr +
1, depth+1, curwrong);
20.
else
21.
{
22.
// assume extra character
23.
res1 = checkstring( searchstr, substr +
1, depth, curwrong);
24.
// assume correct position, but wrong
character
25.
res2 = checkstring( searchstr+1, substr +
1, depth+1, curwrong);
26.
// assume missing character;
27.
res3 = checkstring( searchstr+1, substr,
depth+1, curwrong);
28.
}
29.
if(
res1 > 0 ) return res1;
30.
if(
res2 > 0) return res2;
31.
if(
res3 > 0) return res3;
32.
return
0;
33.
}
34.
else
35.
{
36.
return 0;
37.
}
38.
}
Declaration and Initialization (lines 1 – 8):
The method takes as arguments a pointer to the search string searchstr, a pointer to a substring substr the current depth of recursion depth, and the current number of wrong characters seen curwrong. We then declare three integers res1-3 to hold results from three recursive calls to the same function. Substr is a window on the text that contains a possible match. The length of the search string and the desired accuracy decide the exact size of the window.
Base Case (lines 9 – 11):
We know when we’ve made a successful match when our depth is greater than the length of the string. The depth is roughly equivalent to the number of correct (and assumed correct) characters we’ve seen as we do our comparisons.
Error Detection and Compensation (lines 22-27):
Each call to this function compares two characters – one from the search string, and one from the search window. If these characters do not match, we make the following assumptions:
1.) An extra character in the window,
2.) A missing character in the window
3.) An erroneous character in the window
The Extra Character scenario is handled by advancing the window pointer, while not advancing the search string pointer, and recursively calling the checkstring function. The Missing Character scenario is addressed by advancing the search string pointer, while not advancing the search string pointer. Advancing both pointers solves the Erroneous Character scenario.
Complexity
Typical string matching algorithms will perform much faster than the described algorithm. The Rabin-Karp algorithm, for example, runs in O(m) = O(n-m+1)m. Even a non-optimized naďve approach will perform better, at q(n-m+1)m.
Complexity for this algorithm is in part determined by the desired accuracy. The less accuracy we desire, the longer the search will take. Each additional erroneous character we are willing to accept adds an extra iteration to each failed match. For example, if we are searching for the string “1234” and we are willing to accept an erroneous character, we don’t immediately reject the string “abcd”. Instead, we compare “1” with “a”, and then need to recurs, and check for the above cases. Specifically, we then check and see if “1” = “b”, “2” = “a” and “2” = “b” by way of recursion. The worst case scenario is when we have a string that almost matches, but fails at the end of the window rather than the beginning.
In the previous example, comparing “1234” to “abcd” requires four iterations: one to check if “1” equals “a”, and then the three recursions to handle a errors. If we are willing to accept two errors, we have our initial comparision, followed by three recursions, which in turn each call their own three comparisions. As such, each erroneous string takes 1 + 3acceptable_error.
Most of the time recursion will quickly come to a conclusion. There are some exceptions. Imagine we are looking for the string “11111” (five ones) and we are willing to except one erroneous character. If we come across the string “21111”, we run into a problem. According to our rules, when we do our initial comparision of “1” to “2”, we recurse with the assumptions that
1.) “1” is absent from the window
2.) “2” is an extra character in the window
3.) “2” is an erroneous character
The first recursion to handle (1.) ends quickly, as we advance the searchstring pointer, and end up comparing the next “1” with a “2” which fails. The second and third recursions to handle (2.) and (3.) both are accurate assumptions, and both paths work until their eventual conclusion. This situation gives us our first initial comparision, one dead-end recurse, and two recursions of depth 3 to handle case (2.) and (3.) for a total of 8 recursions. This sort of exception only occurs when there is repetition within the remainder of the string.
With that said, the complexity of the worst case comes down to (length(searchstring) – acceptable_error + 3^acceptable_error) * length(document)).