Playing with TLSH – Trend Micro Locality Sensitive Hash

There are cryptographic hashes, whose output are very sensitive to any change in the input given, and then then there are hashes which try to convey similarity between given inputs. One such hash is TLSH (github page here)

Here I make a quick demonstration on how simple it is to make TLSH generate a false positive, i.e give a far better similarity score to a text that has no resemblance to the original, than to a text that is almost identical to the original.

Three texts are compared – the original text t1, a second text t2, whose only difference is changing one character from a lowercase to an uppercase B, and a third text t3 which has no resemblance to either t1 or t2.

The text t3 was generated in very quickly using a genetic algorithm…

import tlsh 

# two texts with the only difference being that the second text (t2) uses a capital B for the first occurance of the word "building"
t1 = "Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks"
t2 = "Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: Building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks"

# A third text, designed to generate a good Trend Micro Locality Sensitive Hash (TLSH) to the first string
t3 = "DYoFBaxhRRRzivXGOrBSglJZ,R.mEkbsf pbANUkMjnIeCwHaqdLWtyu, gqGlmiMpRwqzQmjayETZtxOfYP!Wnl..fsKHVQIzykSvJsUcbfXrLuFJmX.iBEAvDNZPcaqRVFZf ,PYQsMGkUnHKnlSgyXNtbqW.iLiMowTk!MJCNcp,AjqU!jhbGLbOenLnYElxuGKaSgmXdKMsZWtoyHRVTHPzKci.Q!H,aflYkKIFKyZgMENOsnzjKcdLtvVGDBiqCCxnPweTuPmbWShUodpZrJwWVWKGzEpgktsCproAbXufQe,LcxwPivaeFZOOZYHKnkVOPhIqDbxd.sbyzdpQQzWERIQJsvSogXutiINfJM.whcBYVCKEbx npXAFjmqQKkWe!MyzjTFBO,qTZc.ZjnAIptaxMYmSwGOx eXLuBhfoN!thQ"

h1 = tlsh.hash(t1.encode())
h2 = tlsh.hash(t2.encode())
h3 = tlsh.hash(t3.encode())

print("text 1\n--------\n%s\ntext 2\n--------\n%s\ntext 3\n--------\n%s\n" % (t1, t2, t3))
print("h1=%s\nh2=%s\nh3=%s\n" % (h1, h2, h3))

print("hash diff h1,h2 = %d\nhash diff h1,h3 = %d" % (tlsh.diff(h1, h2), tlsh.diff(h1, h3)))

the output of the above program is:

text 1
--------
Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks
text 2
--------
Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: Building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks
text 3
--------
DYoFBaxhRRRzivXGOrBSglJZ,R.mEkbsf pbANUkMjnIeCwHaqdLWtyu, gqGlmiMpRwqzQmjayETZtxOfYP!Wnl..fsKHVQIzykSvJsUcbfXrLuFJmX.iBEAvDNZPcaqRVFZf ,PYQsMGkUnHKnlSgyXNtbqW.iLiMowTk!MJCNcp,AjqU!jhbGLbOenLnYElxuGKaSgmXdKMsZWtoyHRVTHPzKci.Q!H,aflYkKIFKyZgMENOsnzjKcdLtvVGDBiqCCxnPweTuPmbWShUodpZrJwWVWKGzEpgktsCproAbXufQe,LcxwPivaeFZOOZYHKnkVOPhIqDbxd.sbyzdpQQzWERIQJsvSogXutiINfJM.whcBYVCKEbx npXAFjmqQKkWe!MyzjTFBO,qTZc.ZjnAIptaxMYmSwGOx eXLuBhfoN!thQ

h1=50E0DC15DE2103E00AD2082113E8284263228498022001C1C0F8883021DFDB905BEBDE
h2=32E0D515EE2203E00AD2082113E8288263228898022002C2C0F8883022EEDB906BEBEE
h3=86E0DC15DE2203E01BD2402212E8284263228898032001C4C0F8882021DFDBA15BEBEE

hash diff h1,h2 = 84
hash diff h1,h3 = 17

Leave a Reply

Your email address will not be published. Required fields are marked *