Monday, November 10, 2008

The Repeating Substring Problem

Here's a fun algorithm problem.

Given a string, find the longest repeating substring (if more than one exists of the same length, return the substring that is ordered first in accord with the natural ordering of Strings)

Example: "to be or not to be, that is the question"

Result: to be

There are various ways to solve this problem.

Here is one solution that includes an application of a suffix tree.

---------------------------------------------------------

import java.util.Arrays;

public class CommonString {
static int loopCnt = 0;
public static void main(String[] args) {
String a = "to be or not to be, that is the question";
String ret = commonString(a);
System.out.println(ret);
}
public static String commonString(String s) {

int totalLength = s.length();
String[] suffixTree = new String[totalLength];
for (int i = 0; i < totalLength; i++) {
suffixTree[i] = s.substring(i, totalLength);
}
Arrays.sort(suffixTree);
String ret = "";
for (int i = 0; i < totalLength - 1; i++) {
loopCnt++;
String temp = commonPrefix(suffixTree[i], suffixTree[i+1]);
if (temp.length() > ret.length())
ret = temp;
}
System.out.println("Length = " + totalLength + " loopCnt = " + loopCnt);
return ret;
}

public static String commonPrefix(String s, String t) {
int lower = s.length();
if(t.length() lower=t.length();
for (int i = 0; i < lower; i++) {
if (s.charAt(i) != t.charAt(i))
return s.substring(0, i);
}
return s.substring(0, lower);
}

}

-------------------------------------

Arrays.sort(suffixTree);

Since it's used in the above program, lets briefly discuss the Arrays class.

This class contains various methods for manipulating arrays (such as sorting and searching). Arrays is also a member of the Java Collection Framework. In this case, we're invoking the sort method while passing the method an array of Objects, so this method will sort the specified array of objects into ascending order, according to the natural ordering of its elements. I haven't double checked this, but I'm guessing that for Strings, this relates to the ASCII values or Unicode values of its characters.

After reflecting a bit on this problem, it seems to me that you could also solve it by hashing, as follows:

------------------------------------------------------------------------

for (i =1; i {
select all substrings of length string.length()-i and hash them, with the string as the key, and the frequencies as the value.

as soon as you find a substring where frequency = 2, finish hashing the substrings of this length and then break out of the for loop.
}

Traverse the substrings of length (string.length-i), to build a list of all substrings with frequency >=2, sort these according to the natural ordering of strings, and return the first one in your list.
------------------------------------------------------------------------

In total, there are (n(n+1))/2 possible substrings. The hashing is constant time, the traversals are linear time, and I assume that the final sorting at the end is n*log(n), so the total complexity remains n squared in a worst case.

I haven't thought about it too much, but my impression is that the above algorithm is probably just as fast as the suffix tree approach. Anyone care to comment on this?


No comments: