In the field of text mining, one of the useful tools is to find the similarity percentage between two words for clustering or other purposes. Actually, I am not so familiar with text mining but it sounds quite interesting topic and I would like to do more study to find out about this field. However, I was working on a piece of code to find how much two names similar to each other and if the percentage is more than X (e.g. 80 %)
, it is considered the names almost identical, else it prompts user.
I did little bit Googling and found bunch of useful materials as well as a sample Java code from Stack Overflow which I copied here with slight modifications,
public class FindSimilarityPercentage {
public static void main(String[] args) {
System.out.println("Similarity between Hello and Yellow is " + similarity("Hello", "Yellow"));
}
public static int similarity(String s1, String s2) {
String longer = s1, shorter = s2;
// longer should always have greater length
if (s1.length() < s2.length()) {
longer = s2;
shorter = s1;
}
int longerLength = longer.length();
// both strings are zero length
if (longerLength == 0) {
return 1;
}
double dValue = (longerLength - editDistance(longer, shorter)) / (double) longerLength;
return (int) (dValue * 100);
}
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue), costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}
}
For further reading regarding text mining please refer to the following links,