A simple implementation of simhash algorithm by java.
- compute the simhash of a string
- compute the similarity between all the strings by building smart index, so we can deal with big data.
- 
run Main with inputfile and outputfile. 
- 
The format of inputfile(see src/test_in): one doc eachline with the utf8 charset. 
- 
The format of outputfile(see src/test_out): 
- 
start //start flag 
- 
first line // doc 
- 
sencode lien // doc1\tdist where dist is the hamming distance between doc and doc1 
- 
end //end flag 
- Build the project to a runnable jar.
- Improve the performace under big data.
- Before run Main.java, you should choose a better analyzer instead of BinaryWordSeg!