Locality Sensitive Hash

There are cryptographic hashes, whose output are very sensitive to any change in the input given, and then then there are hashes which try to convey similarity between given inputs. One such hash is TLSH (github page here)

Here I make a quick demonstration on how simple it is to make TLSH generate a false positive, i.e give a far better similarity score to a text that has no resemblance to the original, than to a text that is almost identical to the original.

Three texts are compared – the original text t1, a second text t2, whose only difference is changing one character from a lowercase to an uppercase B, and a third text t3 which has no resemblance to either t1 or t2.

The text t3 was generated very quickly using a genetic algorithm…

import tlsh 

# two texts with the only difference being that the second text (t2) uses a capital B for the first occurance of the word "building"
t1 = "Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks"
t2 = "Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: Building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks"

# A third text, designed to generate a good Trend Micro Locality Sensitive Hash (TLSH) to the first string
t3 = "DYoFBaxhRRRzivXGOrBSglJZ,R.mEkbsf pbANUkMjnIeCwHaqdLWtyu, gqGlmiMpRwqzQmjayETZtxOfYP!Wnl..fsKHVQIzykSvJsUcbfXrLuFJmX.iBEAvDNZPcaqRVFZf ,PYQsMGkUnHKnlSgyXNtbqW.iLiMowTk!MJCNcp,AjqU!jhbGLbOenLnYElxuGKaSgmXdKMsZWtoyHRVTHPzKci.Q!H,aflYkKIFKyZgMENOsnzjKcdLtvVGDBiqCCxnPweTuPmbWShUodpZrJwWVWKGzEpgktsCproAbXufQe,LcxwPivaeFZOOZYHKnkVOPhIqDbxd.sbyzdpQQzWERIQJsvSogXutiINfJM.whcBYVCKEbx npXAFjmqQKkWe!MyzjTFBO,qTZc.ZjnAIptaxMYmSwGOx eXLuBhfoN!thQ"

h1 = tlsh.hash(t1.encode())
h2 = tlsh.hash(t2.encode())
h3 = tlsh.hash(t3.encode())

print("text 1\n--------\n%s\ntext 2\n--------\n%s\ntext 3\n--------\n%s\n" % (t1, t2, t3))
print("h1=%s\nh2=%s\nh3=%s\n" % (h1, h2, h3))

print("hash diff h1,h2 = %d\nhash diff h1,h3 = %d" % (tlsh.diff(h1, h2), tlsh.diff(h1, h3)))

the output of the above program is:

text 1
--------
Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks
text 2
--------
Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: Building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks
text 3
--------
DYoFBaxhRRRzivXGOrBSglJZ,R.mEkbsf pbANUkMjnIeCwHaqdLWtyu, gqGlmiMpRwqzQmjayETZtxOfYP!Wnl..fsKHVQIzykSvJsUcbfXrLuFJmX.iBEAvDNZPcaqRVFZf ,PYQsMGkUnHKnlSgyXNtbqW.iLiMowTk!MJCNcp,AjqU!jhbGLbOenLnYElxuGKaSgmXdKMsZWtoyHRVTHPzKci.Q!H,aflYkKIFKyZgMENOsnzjKcdLtvVGDBiqCCxnPweTuPmbWShUodpZrJwWVWKGzEpgktsCproAbXufQe,LcxwPivaeFZOOZYHKnkVOPhIqDbxd.sbyzdpQQzWERIQJsvSogXutiINfJM.whcBYVCKEbx npXAFjmqQKkWe!MyzjTFBO,qTZc.ZjnAIptaxMYmSwGOx eXLuBhfoN!thQ

h1=50E0DC15DE2103E00AD2082113E8284263228498022001C1C0F8883021DFDB905BEBDE
h2=32E0D515EE2203E00AD2082113E8288263228898022002C2C0F8883022EEDB906BEBEE
h3=86E0DC15DE2203E01BD2402212E8284263228898032001C4C0F8882021DFDBA15BEBEE

hash diff h1,h2 = 84
hash diff h1,h3 = 17

4 Comments

How does this technique perform against implementations that inject their own salt value into the Pearson hash, or use a proprietary lookup table?

Reply to this comment

amnon.david@gmail.com

August 13, 2018 at 7:15 pm

The technique is not meant to be an attack against a hash (whether cryptographic hash or otherwise) but more to show how easy it is to find similar hashes for different texts when using hash functions whose output does not change dramatically with small changes in the input text. Pearson hash is not such a function as a small change in the input text can result in a large change in the output hash.

I also assume you mean some modification based on Pearson hash where the hash size is larger than one byte (as it is in the original implementation).

Reply to this comment

if you measure the edit distance between the hashed you would notice that the distance between h1 and h2 is less that h1 and h3.

However, I would like to know how did you generate h3

Reply to this comment

amnon.david@gmail.com

August 29, 2018 at 10:14 am

Thanks for pointing out the edit distance – perhaps I should run the same process optimising for edit distance so that also the edit distance between h1 and h3 would be smaller.

Regarding generating the hash, the method is quite simple – basically it’s just running a genetic algorithm where the fitness function is a measure of the distance between the hash of an evolving text and a hash of the target text. I expect that if the fitness function would calculate the edit distance then it could find a string for which the edit distance between h1 and h3 would be less than the edit distance between h1 and h2

Update: Out of curiosity, I ran the process to generate a string for which the edit distance (using Levenshtein distance) of h1,h2 would be smaller than the edit distance for h1,h3. This proved to be more challenging and (at least for the little time I let it run) the genetic algorithm could not get bellow the edit distance of h1,h2. Having said that, I’m quite certain that with more a more refined algorithm and some time it is achievable.

Reply to this comment

Maslow's Hammer and Occam's Razor

A mess of random, mostly tech stuff

4 Comments

Leave a Reply Cancel reply