Locality Sensitive Hashing (LSH) is a probabilistic hashing algorithm to find similar items from the given dataset efficiently. This algorithm is used in near duplicate detection, anomaly detection, audio fingerprinting, in search engines etc. This is my favourite algorithm among many favourites. I had implemented a variation of this at work before. This is the classical approach which I will illustrate using DreamLisp.

(defmodule lsh
  (export all))

(def dataset 
  ["hello world"
   "hello there"
   "this is a fine world"])

(def shingle-size 2)

(defun list->string (xs)
  "Converts a list of individual string characters to a string word."
  (apply str xs))

(defun range (n acc)
  (if (= n 0)
    (reverse acc)
    (range (dec n) (conj acc n))))

(defun range (n)
  "Generates a sequence of number starting from 1 till the given upper bound."
  (range n []))

(defun shingle (txt n acc)
  (if (= (count txt) n)
    (conj acc txt)
    (shingle (rest txt) n (conj acc (nth-tail 0 (dec n) txt)))))

(defun shingle (txt n)
  "Generate k-grams which is a group of k sequential tokens."
  (->> (shingle (seq txt) n [])
       (map (fn (x) (list->string x)))))

(def shingles-xs (map (fn (x) (apply set (shingle x shingle-size))) dataset))

(def vocab (into [] (apply union shingles-xs)))

(defun one-hot-encode (kgrams vocab)
  (foldl (fn (x acc) 
           (if (contains? x kgrams)
             (conj acc 1)
             (conj acc 0)))
         [] vocab))

(def one-hot-xs (map (fn (x) (one-hot-encode x vocab)) shingles-xs))

(defun get-shuffled-rand-vec ()
  "Returns a list of random indices with length same as the vocab."
  (map (fn (_) (shuffle (range (count vocab)))) (range 20)))

(defun minhash (rhash shingle-xs i)
  (if (empty? rhash)
    acc
    (let ((hash (first rhash))
          (idx-val (nth i shingle-xs)))
      (if (= idx-val 1)
        hash
        (minhash (rest rhash) shingle-xs (inc i))))))

(defun minhash (rhash shingles-xs)
  "Returns the index where signature hash has a 1 in shingles list."
  (minhash rhash shingles-xs 0))

(defun signature (encoded-xs rand-vec) 
  (map (fn (x) (minhash x encoded-xs)) rand-vec))

(def rand-vec (get-shuffled-rand-vec))
(def sig-a (signature (nth 0 one-hot-xs) rand-vec))
(def sig-b (signature (nth 1 one-hot-xs) rand-vec))
(def sig-c (signature (nth 2 one-hot-xs) rand-vec))

(print "\nsig-a: " sig-a)
(print "\nsig-b: " sig-b)
(print "\nsig-c: " sig-c "\n")

(defun jaccard (a b)
  "Find Jaccard similarity between two sets."
  (/ (count (intersect a b)) (count (union a b))))

(def j-ab (jaccard (into (set) [(nth 0 dataset)]) (into (set) [(nth 1 dataset)])))
(def j-sig-ab (jaccard (into (set) sig-a) (into (set) sig-b)))
(println "\nj(a, b), j(sig-a, sig-b)")
(println j-ab j-sig-ab)

(def j-ac (jaccard (into (set) [(nth 0 dataset)]) (into (set) [(nth 2 dataset)])))
(def j-sig-ac (jaccard (into (set) sig-a) (into (set) sig-c)))
(println "j(a, c), j(sig-a, sig-c)")
(println j-ac j-sig-ac)

(def j-bc (jaccard (into (set) [(nth 1 dataset)]) (into (set) [(nth 2 dataset)])))
(def j-sig-bc (jaccard (into (set) sig-b) (into (set) sig-c)))
(println "j(b, c), j(sig-b, sig-c)")
(println j-bc j-sig-bc)

(defun band (sig-xs size len acc)
  (if (= size len)
    (conj acc (nth-tail 0 (dec len) sig-xs))
    (band (nth-tail size (dec len) sig-xs)
          size 
          (- len size) 
          (conj acc (take size sig-xs)))))

(defun band (sig-xs size)
  "Split the signature vector into rows or bands of the given size."
  (band sig-xs size (count sig-xs) []))

(def band-a (band sig-a 2))
(def band-b (band sig-b 2))
(def band-c (band sig-c 2))

(println "\nband-a" band-a)
(println "band-b" band-b)
(println "band-c" band-c "\n")

(defun lsh (band-x band-y)
  "Checks for any collision for elements in the given bands."
  (if (empty? band-x) 
    false
    (if (= (first band-x) (first band-y))
      true
      (lsh (rest band-x) (rest band-y)))))

(println (lsh band-a band-b))  ; true
(println (lsh band-a band-c))  ; false
(println (lsh band-b band-c))  ; false

There are few parts to this algorithm.

  1. Shingling or generating k-grams for each item in the dataset.
  2. Doing one-hot encoding to reduce the dimensionality.
  3. Combining all of these to a vocab set.
  4. Minhashing to generate the signature.
  5. Banding the signature and finding matching pair which gives probable matches.

Shingling

Shingling is the process of generating k-grams for the given string length. Means, if we have a text "hello" and k as 2, we will have ["he" "el" "ll" "lo"].

The shingles list from the above example is:

[#{lo  w o  rl wo ld or he ll el}
 #{lo th o  er re  t he ll el} 
 #{in hi is a  or th fi ne  i  w wo e  rl ld  a s   f}]

Vocab

Shingles are all combined into a set called vocab. Since it's a set there are no duplicate k-grams.

The vocab here is:

[fi el  i  a or he o  is ne a  ll  t ld rl wo s  re in th hi  f er lo  w e ]

One-hot encoding

Next we do one-hot encoding for each k-grams. Here we go through each item in the vocab and checks if that k-gram is in the given shingles list. If present we add a 1 to the list else 0. We do this for each shingles list. This reduces the dimensionality of the k-grams vector.

The one-hot encoded list:

[[0 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0] 
 [0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 0 0] 
 [1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 1 1]]

Minhashing

Here first we generate a list of indices starting from 1 to the length of the vocab. Since there are 25 items in the vocab, our list will be from [1, 25] inclusive.

Then we shuffle this vector to get a shuffled random vector. Minhash signature is a list of indices. One random vector will give only one index. We take the random vector and loop through the index. For an index we check if the item at that index is a 1 in the shingles list. If 1, that index is one hash in the signature list. We need more signatures to make the hash to collide. So we will construct 20 hashes. These are the parameters along with shingle size that we can tune.

So we generate 20 random index vectors and find the hash where the first index that is a 1 for the given shingle set. This is our signature list.

sig-a: [24 5 16 24 16 13 8 18 18 19 16 2 12 24 9 10 17 17 16 20]
sig-b: [24 5 16 24 16 13 8 18 18 19 16 2 12 24 9 10 17 17 16 20]
sig-c: [2 12 10 18 17 2 11 12 11 23 20 16 6 2 5 22 16 24 18 13]

Jaccard Similarity

We can use Jaccard Similarity to see how close the signature set is from the shingles set. This is only used as a measure and is not directly used in the algorithm.

          |A ∩ B|
J(A, B) = ———————
          |A ∪ B|
j(a, b), j(sig-a, sig-b)
= 0.5, 1

j(a, c), j(sig-a, sig-c)
= 0.5, 0.76470588235294117647058823529411764705

j(b, c), j(sig-b, sig-c)
= 0.5, 0.76470588235294117647058823529411764705

Banding

Now that we have our signature hash vectors, we need to split them into bands for comparison. If we find two bands to be the same, means this is a hash collision, which means a probable match.

For this we first construct our bands by splitting the signature vector into into n rows. Here we are choosing 2. This can be adjusted as needed.

band-a:
[[24 5] [16 24] [16 13] [8 18] [18 19] [16 2] [12 24] [9 10] [17 17] [16 20]]

band-b:
[[24 5] [16 24] [16 13] [8 18] [18 19] [16 2] [12 24] [9 10] [17 17] [16 20]]

band-c:
[[2 12] [10 18] [17 2] [11 12] [11 23] [20 16] [6 2] [5 22] [16 24] [18 13]]

Then we compare each row to find the probable match. We need to find only one match to infer that there is match between the two data elements.

Here we can see band-a and band-b has a collision on the first row. So the strings "hello world" and "hello there" are similar. It's not an exact match but that's what this algorithm is about. This reduces the search space. We can then do other comparisons on the resultant set to further refine as per our use case.

Locality Sensitive Hashing is like k-Nearest Neighbors but probabilistic in nature.