Use Commons Codec's Soundex.
Supply a surname or a word, and Soundex will produce a phonetic
encoding:
// Required import declaration import org.apache.commons.codec.language.Soundex; // Code body Soundex soundex = new Soundex( ); String obrienSoundex = soundex.soundex( "O'Brien" ); String obrianSoundex = soundex.soundex( "O'Brian" ); String obryanSoundex = soundex.soundex( "O'Bryan" ); System.out.println( "O'Brien soundex: " + obrienSoundex ); System.out.println( "O'Brian soundex: " + obrianSoundex ); System.out.println( "O'Bryan soundex: " + obryanSoundex );
This will produce the following output for three similar surnames:
O'Brien soundex: O165 O'Brian soundex: O165 O'Bryan soundex: O165
Soundex.soundex( ) takes a
string, preserves the first letter as a letter code, and proceeds to
calculate a code based on consonants contained in a string. So, names
such as "O'Bryan," "O'Brien," and "O'Brian," all being common variants
of the Irish surname, are given the same encoding: "O165." The 1
corresponds to the B, the 6 corresponds to the R, and the 5 corresponds
to the N; vowels are discarded from a string before the Soundex code is generated.
The Soundex algorithm can be
used in a number of situations, but Soundex is usually associated with surnames,
as the United States historical census records are indexed using
Soundex. In addition to the role
Soundex plays in the census, Soundex is also used in the health care
industry to index medical records and report statistics to the
government. A system to access individual records should allow a user to
search for a person by the Soundex
code of a surname. If a user types in the name "Boswell" to search for a
patient in a hospital, the search result should include patients named
"Buswell" and "Baswol;" you can use Soundex to provide this capability if an
application needs to locate individuals by the sound of a
surname.
The Soundex of a word or name
can also be used as a primitive method to find out if two small words
rhyme. Commons Codec contains other phonetic encodings, such as RefinedSoundex, Metaphone, and DoubleMetaphone. All of these alternatives
solve similar problems—capturing the phonemes or
sounds contained in a word.
For more information on the Soundex encoding, take a look at the
Dictionary of Algorithms and Data Structures at the National Institute
of Standards and Technology (NIST), http://www.nist.gov/dads/HTML/soundex.html.
There you will find links to a C implementation of the Soundex algorithm.
For more information about alternatives to Soundex encoding, read "The Double Metaphone
Search Algorithm" by Lawrence Philips (http://www.cuj.com/documents/s=8038/cuj0006philips/).
Or take a look at one of Lawrence Philips's original Metaphone algorithm
implementations at http://aspell.sourceforge.net/metaphone/.
Both the Metaphone and Double Metaphone algorithms capture the sound of
an English word; implementations of these algorithms are available in
Commons Codec as Metaphone and
DoubleMetaphone.
