Phonetic String Matching : Soundex

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result.

A phonetic algorithm is an algorithm for indexing of words by their pronunciation.

A phonetic algorithm is an algorithm for indexing of words by their pronunciation.

Typically in our applications we will try to compare literal strings, which only gets you so far. Sometimes we need to be a little cleverer and compare strings based on how they sound as opposed to how they are spelt. In this series of articles I want to show some implementations for some different phonetic comparison algorithms that you can use in your applications. Feel free to take the code from these articles and use them in your software.

The algorithms I am going to cover are Soundex, Levenshtein Distance, Metaphone and Double Metaphone.

There are quite a few different uses for phonetic comparison algorithms including:

    • Spell checkers can often contain phonetic algorithms. The Metaphone algorithm, for example, can take an incorrectly spelled word and create a code. The code is then looked up in directory for words with the same or similar Metaphone. Words that have the same or similar Metaphone become possible alternative spellings.
    • Search functionality will often use phonetic algorithms to find results that don’t match exactly the term(s) used in the search. Searching for names can be difficult as there are often multiple alternative spellings for names. An example is the name Claire. It has two alternatives, Clare/Clair, which are both pronounced the same. Searching for one spelling wouldn’t show results for the two others. Using Soundex all three variations produce the same Soundex code, C460. By searching names based on the Soundex code all three variations will be returned.
    • Record Linkage and Fraud Detection refers to the task of finding records in a dataset that refer to the same entity across different data sources. You can use the soundex or metaphone algorithms to link records with similarities in names. Fraudsters may often sign up for accounts using similar names or misspelt name. Using phonetic matching you can attempt to link these records together.

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms.

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Similar sounding consonants share the same digit so, for example, the consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

    • Retain the first letter of the name and drop other occurrences of a, e, i, o, u, y, h, w.
    • Replace consonants with digits as follows (after the first letter):
      • b, f, p, v => 1
      • c, g, j, k, q, s, x, z => 2
      • d, t => 3
      • l => 4
      • m, n => 5
      • r => 6
    • If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
    • Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

Using this algorithm, both “Robert” and “Rupert” return the same string “R163″ while “Rubin” yields “R150″. “Ashcraft” and “Ashcroft” both yield “A261″ and not “A226″ (the chars ‘s’ and ‘c’ in the name would receive a single number of 2 and not 22 since an ‘h’ lies in between them).

Implementation

Now it is time to look at an implementation of this algorithm. First of all I have started with an interface.

namespace PhoneticMatch
{
    public interface IPhoneticMatch
    {
        string CreateToken(string word);
    }
}

Then we have the following soundex implementation that implements the IPhoneticMatch interface.

using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace PhoneticMatch
{
    public class Soundex : IPhoneticMatch
    {
        public string CreateToken(string word)
        {
            if (string.IsNullOrEmpty(word))
            {
                throw new ArgumentNullException("word");
            }

            const int maxSoundexCodeLength = 4;

            var soundexCode = new StringBuilder();

            word = Regex.Replace(word.ToUpper(), @"[^\w\s]", string.Empty);

            if (string.IsNullOrEmpty(word))
            {
                return string.Empty.PadRight(maxSoundexCodeLength, '0');
            }

            soundexCode.Append(word.First());
            ApplySoundexRules(word, soundexCode);

            return PadAndFormatReturnCode(soundexCode, maxSoundexCodeLength);
        }

        private static string PadAndFormatReturnCode(StringBuilder soundexCode, int maxSoundexCodeLength)
        {
            var returnValue = soundexCode.Replace("0", string.Empty).ToString();
            returnValue = returnValue.PadRight(maxSoundexCodeLength, '0');
            returnValue = returnValue.Substring(0, maxSoundexCodeLength);
            return returnValue;
        }

        private void ApplySoundexRules(string word, StringBuilder soundexCode)
        {
            var previousWasHorW = false;

            for (var i = 1; i < word.Length; i++)
            {
                var numberCharForCurrentLetter = GetCharNumberForLetter(word[i]);

                if (i == 1 && numberCharForCurrentLetter == GetCharNumberForLetter(soundexCode[0]))
                {
                    continue;
                }

                if (soundexCode.Length > 2 && previousWasHorW &&
                    numberCharForCurrentLetter == soundexCode[soundexCode.Length - 2])
                {
                    continue;
                }

                if (soundexCode.Length > 0 && numberCharForCurrentLetter == soundexCode[soundexCode.Length - 1])
                {
                    continue;
                }

                soundexCode.Append(numberCharForCurrentLetter);

                previousWasHorW = "HW".Contains(word[i]);
            }
        }

        private char GetCharNumberForLetter(char letter)
        {
            if ("BFPV".Contains(letter))
            {
                return '1';
            }

            if ("CGJKQSXZ".Contains(letter))
            {
                return '2';
            }

            if ("DT".Contains(letter))
            {
                return '3';
            }

            if ('L' == letter)
            {
                return '4';
            }

            if ("MN".Contains(letter))
            {
                return '5';
            }

            return 'R' == letter ? '6' : '0';
        }
    }
}

I have also created a few basic unit tests for this Soundex implementation written using MSTest. This prove that the algorithm works and also shows the basic usage.

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using PhoneticMatch;

namespace PhoneticMatchTests
{
    [TestClass]
    public class SoundExTests
    {
        [TestMethod]
        [ExpectedException(typeof(ArgumentNullException))]
        public void CreateTokenThrowsArgumentNullExceptionIfWordIsNullOrEmpty()
        {
            IPhoneticMatch phonetic = new Soundex();
            phonetic.CreateToken(null);
        }

        [TestMethod]
        public void CreateTokenCreatesTokenForStephen()
        {
            IPhoneticMatch phonetic = new Soundex();
            string token = phonetic.CreateToken("stephen");
            Assert.AreEqual("S315", token);
        }

        [TestMethod]
        public void CreateTokenCreatesTokenForRobert()
        {
            IPhoneticMatch phonetic = new Soundex();
            string token = phonetic.CreateToken("Robert");
            Assert.AreEqual("R163", token);
        }

        [TestMethod]
        public void CreateTokenCreatesTokenForRupert()
        {
            IPhoneticMatch phonetic = new Soundex();
            string token = phonetic.CreateToken("Rupert");
            Assert.AreEqual("R163", token);
        }

        [TestMethod]
        public void CreateTokenCreatesTokenForStephentoMatchSteven()
        {
            IPhoneticMatch phonetic = new Soundex();
            string token1 = phonetic.CreateToken("Steven");
            string token2 = phonetic.CreateToken("Stephen");
            Assert.AreEqual(token1, token2);
        }

        [TestMethod]
        public void CreateTokenCreatesTokenForRobertMatchRupert()
        {
            IPhoneticMatch phonetic = new Soundex();
            string token1 = phonetic.CreateToken("Robert");
            string token2 = phonetic.CreateToken("Rupert");
            Assert.AreEqual(token1, token2);
        }

    }
}

The next article in this series is about the  Levenshtein Distance algorithm which takes a completely different approach.

Participate with Coding in the Trenches on Facebook

Participate with Coding in the Trenches on Facebook by Click the button above.

This entry was posted in Coding, Phonetic String Matching and tagged , . Bookmark the permalink.

2 Responses to Phonetic String Matching : Soundex

  1. Pingback: Phonetic String Matching : Levenshtein Distance | Stephen Haunts { Coding in the Trenches }

  2. Pingback: Phonetic String Matching : Metaphone | Stephen Haunts { Coding in the Trenches }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s