The smlar Plug-in for Effective Retrieval of Massive Volumes of SimHash Data

Published in

DataDrivenInvestor

14 min readFeb 11, 2020

Background

SimHash is an algorithm that determines the similarity between data sets. Google Crawler uses SimHash to find similar pages and avoid content duplication.

With reference to the article mentioned above, SimHash algorithm helps to generate a vector fingerprint for each web page. So, if you are wondering how can we determine the similarity between documents using SimHash, then let’s take a quick look at the following steps to get an overview of the application using Hamming distance:

What Is Hamming Distance?
The Hamming distance between codewords is the number of bits with different values. The minimum Hamming distance between codewords in a valid code set is the Hamming distance of the code set. For example, 10101 and 00110 have different values in the first, fourth, and fifth bits, and therefore the Hamming distance is 3.

Geometric Significance of the Hamming Distance
An n-bit codeword can be presented by a hypercube vertex in the n-dimensional space. The Hamming distance between codewords is the edge of the shortest distance between hypercube vertices.

Hamming Distance Application Scenarios
The Hamming distance applies to code error check and correction.

Once fingerprints are extracted by using the SimHash algorithm, the Hamming distance is used to calculate the similarity. The SimHash algorithm is more suitable for long documents with more than 500 words. For short documents, the difference may be large, depending on the application scenario. According to data published in Google’s paper, two documents are considered similar or duplicate if the signature is 64-bit and the Hamming distance is 3. In this example, the Hamming distance of 3 is only for reference; different test values may be provided in different application scenarios.

Infographic: Journey to the Clouds | Data Driven Investor

Smart business leaders understand the value of utilizing the cloud. As data storage needs grow, as they have been…

www.datadriveninvestor.com

The technique explained above helps to obtain similarity. However, there is an efficiency issue when there are tens of billions of data records. When data is added continuously, the efficiency deteriorates with each new data record being compared with all the existing data in the database.

To improve deduplication efficiency for massive data, we can divide 64-bit fingerprints into four 16-bit data blocks. Based on the drawer principle, if two documents are similar when the Hamming distance is 3, they must have the same data in a block.

The next question that arises here is whether there are any databases that efficiently support such retrieval of massive data volumes.

Thanks to the smlar plug-in, the PostgreSQL database supports such efficient retrieval.

1. Requirements

To begin with, quickly search for SimHash data records that are similar to a specific SimHash data record from the pool of massive SimHash data.

2. Architecture Design

PostgreSQL supports several design methods to search for data whose Hamming distance is less than N within the massive data pool. Each design method has a different energy efficiency ratio (EER). You can select any of the following methods based on your own requirements.

a. Brute Force Computing

Single-node Multi-core Parallel Computing for Brute Force Scanning
The multi-core parallel computing provided by Alibaba Cloud ApsaraDB RDS for PostgreSQL 10 is used for brute force scanning.

Multi-node Multi-core Parallel Computing for Brute Force Scanning
The multi-level concurrent computing provided by Alibaba Cloud HybridDB for PostgreSQL is used for brute force scanning.

GPU and FPGA Capabilities for Brute Force Computing Acceleration
PostgreSQL provides extended interfaces that allow you to use the GPU and FPGA capabilities for data computing.

CPU Vector Computing Commands for Brute Force Computing
PostgreSQL provides extended interfaces that allow you to use CPU vector computing commands to accelerate computing.

b. Indexing

Indexing is an efficient retrieval method. For example, the PostgreSQL smlar plug-in uses indexes on the Alibaba shopping guide platform to query similar shopping guide articles from the pool of massive shopping guide articles in a timely manner.

To enable the smlar plug-in to accelerate data search based on the Hamming distance, it is required to employ more scientific methods, such as slicing.

Directly using positions causes problems because the first process of the smlar plug-in is block-level convergence. However, the Hamming code is 64-bit. A data block contains several records, and 0 and 1 may occur in any position at the same time. Since all data blocks contain 0 and 1, filtering fails in the first process.

We can use slicing to reduce the possibility of filtering failures. For example, every two, four, or more bits are classified as a slice.

Typically, if the Hamming distance between records is greater than 3, both records are not correlated with each other.

3. Demo and Performance

a. Brute Force Computing

To illustrate full scanning and parallel scanning, first, create a test table.

create table hm (  
  id int,        -- id  
  hmval bit(64)  -- 海明HASH  
);

Now, write 10 million test data records to the table.

postgres=# insert into hm select id, val::int8::bit(64) from (select id, sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric as val from generate_series(1,10000000) t(id)) t;  
INSERT 0 10000000  
  
postgres=# select * from hm limit 10;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
  1 | 0000101001110110110101010111101011100110101010000111100011110111  
  2 | 0110011100110101101000001010101111010001011101100111111011001110  
  3 | 1010110111001011011110110000111111101101101111010111111100101110  
  4 | 0110011110110000001011000010010000101011100101010100111000101001  
  5 | 0101110100101111010110010110000000101110000010001011010110110000  
  6 | 0011010000100000101011011100000101111110010110111101100001100001  
  7 | 1011110011101101101000011101011101010111011001011010110111101000  
  8 | 1110010011000101001101110010001111110100001101010101111101110010  
  9 | 0110111111110011101001001000101101011011111100010010111010001111  
 10 | 0011100011000010101011010001111000000110100011100100111011011001  
(10 rows)

Set the brute force computing parallelism.

postgres=# set force_parallel_mode = on;  
postgres=# set min_parallel_table_scan_size = 0;  
postgres=# set parallel_setup_cost = 0;  
postgres=# set parallel_tuple_cost = 0;  
postgres=# alter table hm set (parallel_workers = 128);  
postgres=# set max_parallel_workers_per_gather = 64;

Concurrently query records whose Hamming distance is less than 4. This takes 463 ms to complete.

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 463.314 ms

Non-concurrently query records whose Hamming distance is less than 4. This takes 16 seconds to complete.

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 16791.215 ms (00:16.791)

There are more efficient methods to query the number of bits with different values between strings. Theoretically, this takes less than 100 ms to complete.

https://www.postgresql.org/message-id/flat/ab1ea6540903121110l2a3021d4h6632b206e2419898%40mail.gmail.com#ab1ea6540903121110l2a3021d4h6632b206e2419898@mail.gmail.com

b. Indexing

Alibaba Cloud RDS PostgreSQL provides a smlar plug-in to efficiently query the similarity between arrays. Therefore, we need to convert SimHash data records into arrays.

According to the previous design, classify eight bits into one array to search for values with the Hamming distance less than or equal to 8.

Before slicing, verify filtering after data is sliced.

postgres=# select relpages from pg_class where relname='hm';  
 relpages   
----------  
    63695  
(1 row)  
  
1、单个片为1时，不用说，每个块都包含。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,1)='0' group by 1)t;  
 count   
-------  
 63695  
(1 row)  
  
2、单个片为8时，有接近一半的块包含。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,8)='00000000' group by 1)t;  
 count   
-------  
 29100  
(1 row)  
  
3、单个片为16时，只有100多个块包含了。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,16)='0000000000000000' group by 1)t;  
 count   
-------  
   160  
(1 row)

Performance Verification of the 8-Array Slicing Method

Follow the steps listed below to test the performance:

Step 1. Create the smlar plug-in.

create extension smlar;

Step 2. Create a test table.

create table hm1 (id int, hmval bit(64), hmarr text[]);

Step 3. Now, generate 10 million test data records. Slice these data records based on the 8-array slicing method and record the sliced records as text arrays.

insert into hm1   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,8)||',3_'||substring(val,17,8)||',4_'||substring(val,25,8)||',5_'||substring(val,33,8)||',6_'||substring(val,41,8)||',7_'||substring(val,49,8)||',8_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm1 limit 10;  
 id |                              hmval                               |                                           hmarr                                             
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}  
  2 | 0001001000010101001100100010101010111001001000000110101101100100 | {1_00010010,2_00010101,3_00110010,4_00101010,5_10111001,6_00100000,7_01101011,8_01100100}  
  3 | 0011111111010100011001001010110110100010101110101001101111010000 | {1_00111111,2_11010100,3_01100100,4_10101101,5_10100010,6_10111010,7_10011011,8_11010000}  
  4 | 1100110010011001001110101110111111111111010000100000010011000010 | {1_11001100,2_10011001,3_00111010,4_11101111,5_11111111,6_01000010,7_00000100,8_11000010}  
  5 | 0011000011010001011111010101010111100110000110000011101100000101 | {1_00110000,2_11010001,3_01111101,4_01010101,5_11100110,6_00011000,7_00111011,8_00000101}  
  6 | 0111101101111110101000010110101101110011011110100100010111011001 | {1_01111011,2_01111110,3_10100001,4_01101011,5_01110011,6_01111010,7_01000101,8_11011001}  
  7 | 0010001011111111100010101011110001001101001011100100011000010000 | {1_00100010,2_11111111,3_10001010,4_10111100,5_01001101,6_00101110,7_01000110,8_00010000}  
  8 | 1110001111100011011110110111101111010101000111000100111111111101 | {1_11100011,2_11100011,3_01111011,4_01111011,5_11010101,6_00011100,7_01001111,8_11111101}  
  9 | 0111110010111000010111001000000101111000000110110110000011101110 | {1_01111100,2_10111000,3_01011100,4_10000001,5_01111000,6_00011011,7_01100000,8_11101110}  
 10 | 0111001101100010001101101111000000100100000000010001010011100101 | {1_01110011,2_01100010,3_00110110,4_11110000,5_00100100,6_00000001,7_00010100,8_11100101}  
(10 rows)

Step 4. Create a smlar index.

postgres=# create index idx_hm1 on hm1 using gin(hmarr _text_sml_ops );

Step 5. Now, use the smlar index to search for values with the Hamming distance less than or equal to 1. It takes 63 ms to complete the query after the smlar index is used.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 7;  
  
select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=117.83..420.48 rows=100 width=169) (actual time=62.928..62.929 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[]))  
   Buffers: shared hit=166  
   ->  Bitmap Heap Scan on public.hm1  (cost=117.83..10205.17 rows=3333 width=169) (actual time=62.927..62.927 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Recheck Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Filter: (length(replace((bitxor(B'0000001110101101100110011000100111100100001100100101101010010011'::"bit", hm1.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=166  
         ->  Bitmap Index Scan on idx_hm1  (cost=0.00..117.00 rows=10000 width=0) (actual time=62.898..62.898 rows=1 loops=1)  
               Index Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
               Buffers: shared hit=165  
 Planning time: 0.147 ms  
 Execution time: 62.975 ms  
(14 rows)  
  
postgres=# select                  
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                           hmarr                                           | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} |     8  
(1 row)  
Time: 61.227 ms

To query values with the Hamming distance less than or equal to 4, we can use the 16-array slicing method or a combination of different slicing methods.

Performance Verification of the Hybrid 6-Array Slicing Method

The configuration of the 6-array slicing method is 8-bit, 16-bit, 8-bit, 8-bit, 16-bit, and 8-bit. This slicing method can query values with the Hamming distance less than or equal to 6.

create table hm2 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm2   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,16)||',3_'||substring(val,25,8)||',4_'||substring(val,33,8)||',5_'||substring(val,41,16)||',6_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm2 limit 10;  
 id |                              hmval                               |                                        hmarr                                          
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}  
  2 | 0111111000101011000111010011011000000010010001111001000111011101 | {1_01111110,2_0010101100011101,3_00110110,4_00000010,5_0100011110010001,6_11011101}  
  3 | 0111111000101111000101011100100000001111011101101100110100000101 | {1_01111110,2_0010111100010101,3_11001000,4_00001111,5_0111011011001101,6_00000101}  
  4 | 0111010101010010100000110001100011110010111000001011000010010010 | {1_01110101,2_0101001010000011,3_00011000,4_11110010,5_1110000010110000,6_10010010}  
  5 | 1111101100110100101111000011001011111110111000100110101001100001 | {1_11111011,2_0011010010111100,3_00110010,4_11111110,5_1110001001101010,6_01100001}  
  6 | 0011110000100010101001000001100010000010111011100010011001000110 | {1_00111100,2_0010001010100100,3_00011000,4_10000010,5_1110111000100110,6_01000110}  
  7 | 0000111111001110100110011110000110001101110111111111111010111001 | {1_00001111,2_1100111010011001,3_11100001,4_10001101,5_1101111111111110,6_10111001}  
  8 | 0110100010010100111100110110000011101110101001001111010101011111 | {1_01101000,2_1001010011110011,3_01100000,4_11101110,5_1010010011110101,6_01011111}  
  9 | 0111001111001100101011001001100100000000111100000110110001000011 | {1_01110011,2_1100110010101100,3_10011001,4_00000000,5_1111000001101100,6_01000011}  
 10 | 1101111101011000111100101010101000100001101100101110100001111000 | {1_11011111,2_0101100011110010,3_10101010,4_00100001,5_1011001011101000,6_01111000}  
(10 rows)  
  
create index idx_hm2 on hm2 using gin(hmarr _text_sml_ops );

The query time is reduced to 2 ms when values with a Hamming distance less than or equal to 1 are queried.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                        hmarr                                        | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------+-------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} |     6  
(1 row)  
Time: 1.954 ms  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=103.83..406.06 rows=100 width=153) (actual time=2.414..2.416 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[]))  
   Buffers: shared hit=102  
   ->  Bitmap Heap Scan on public.hm2  (cost=103.83..10177.17 rows=3333 width=153) (actual time=2.414..2.415 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Recheck Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Filter: (length(replace((bitxor(B'1100111011000001100100100111111110100011100111111101101001101010'::"bit", hm2.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=102  
         ->  Bitmap Index Scan on idx_hm2  (cost=0.00..103.00 rows=10000 width=0) (actual time=2.374..2.374 rows=1 loops=1)  
               Index Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
               Buffers: shared hit=101  
 Planning time: 0.149 ms  
 Execution time: 2.463 ms  
(14 rows)

Performance Verification of the 4-Array Slicing Method

create table hm3 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm3   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,16)||',2_'||substring(val,17,16)||',3_'||substring(val,33,16)||',4_'||substring(val,41,16), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm3 limit 10;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
  2 | 1101011000010000000000000000111011011111011101110100000010101011 | {1_1101011000010000,2_0000000000001110,3_1101111101110111,4_0111011101000000}  
  3 | 0101000010110110110010001010100010101001001010111111011000110011 | {1_0101000010110110,2_1100100010101000,3_1010100100101011,4_0010101111110110}  
  4 | 0111000111100011111000100111000011101111110000011110101101000100 | {1_0111000111100011,2_1110001001110000,3_1110111111000001,4_1100000111101011}  
  5 | 0010111010101011111010011110110010011110111111110011101110010011 | {1_0010111010101011,2_1110100111101100,3_1001111011111111,4_1111111100111011}  
  6 | 0110111110011100100110010111010000000011100011000011110001010110 | {1_0110111110011100,2_1001100101110100,3_0000001110001100,4_1000110000111100}  
  7 | 0100110100111001110011011110100111101110101001000101010110110110 | {1_0100110100111001,2_1100110111101001,3_1110111010100100,4_1010010001010101}  
  8 | 0110010111001100111000011011011100001100111111101111011010100010 | {1_0110010111001100,2_1110000110110111,3_0000110011111110,4_1111111011110110}  
  9 | 0110111010110000001010101111000101110000010011100011100101000100 | {1_0110111010110000,2_0010101011110001,3_0111000001001110,4_0100111000111001}  
 10 | 0101101000000110100101100011111111000101110001010011100110101011 | {1_0101101000000110,2_1001011000111111,3_1100010111000101,4_1100010100111001}  
(10 rows)  
  
create index idx_hm3 on hm3 using gin(hmarr _text_sml_ops );

The query time is reduced to 0.2 ms when values with a Hamming distance less than or equal to 1 are queried.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 3;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=99.83..401.19 rows=100 width=134) (actual time=0.169..0.170 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[]))  
   Buffers: shared hit=14  
   ->  Bitmap Heap Scan on public.hm3  (cost=99.83..10144.17 rows=3333 width=134) (actual time=0.168..0.169 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Recheck Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Filter: (length(replace((bitxor(B'0101011111111010000001001011101101100011111101111101101100000011'::"bit", hm3.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=14  
         ->  Bitmap Index Scan on idx_hm3  (cost=0.00..99.00 rows=10000 width=0) (actual time=0.145..0.145 rows=1 loops=1)  
               Index Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
               Buffers: shared hit=13  
 Planning time: 0.101 ms  
 Execution time: 0.200 ms  
(14 rows)

The query time remains several milliseconds when values with a Hamming distance less than or equal to 4 are queried.

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 0;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
Time: 6.983 ms

The query takes 23 seconds to complete if no index is used.

postgres=# select * from hm3 where length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
(1 row)  
  
Time: 22954.686 ms

Compared with scenarios in which no index is used, the performance is improved by 114,800 times from 23 seconds to 0.2 milliseconds.

Automatic Slicing

If you are wondering how to generate arrays once the SimHash data is sliced, then don’t worry. PostgreSQL provides powerful user-defined functions (UDFs) to generate arrays.

You can create UDFs and triggers, and once the data is written, arrays are generated automatically upon slicing. The following example illustrates how automatic slicing works:

create table hm4 (id int, hmval bit(64), hmarr text[]);  create or replace function sp(val bit(64)) returns text[] as $$
select regexp_split_to_array('1_'||substring(val::text,1,10)||',2_'||substring(val::text,11,10)||',3_'||substring(val::text,21,10)||',4_'||substring(val::text,31,10)||',5_'||substring(val::text,41,10)||',6_'||substring(val::text,51,14), ',') ;            
$$ language sql strict;postgres=# select sp(123::bit(64));
                                         sp                                          
-------------------------------------------------------------------------------------
 {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}
(1 row)-- 写入1亿记录insert into hm4  
select   
  id,   
  val::bit(64),   
  sp(val::bit(64))    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,100000000) t(id)) t;  -- 索引create index idx_hm4 on hm4 using gin(hmarr _text_sml_ops );  -- 查询海明距离小于等于1的记录，性能杠杠的postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5; select    
    *,    
    smlar( hmarr, sp(123::bit(64)))     -- 查询与123::bit(64)海明距离小于2的记录
  from    
    hm3  
  where    
    hmarr % sp(123::bit(64))      -- 查询与123::bit(64)海明距离小于2的记录
    and length(replace(bitxor(123::bit(64), hmval)::text,'0','')) < 2      -- 查询与123::bit(64)海明距离小于2的记录
  limit 100;  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select
    *,
    smlar( hmarr, sp(123::bit(64)))     -- 查询与123::bit(64)海明距离小于2的记录
  from
    hm3
  where
    hmarr % sp(123::bit(64))      -- 查询与123::bit(64)海明距离小于2的记录
    and length(replace(bitxor(123::bit(64), hmval)::text,'0','')) < 2      -- 查询与123::bit(64)海明距离小于2的记录
  limit 100;
                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=109.83..411.19 rows=100 width=134) (actual time=0.078..0.078 rows=0 loops=1)
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}'::text[]))
   Buffers: shared hit=19
   ->  Bitmap Heap Scan on public.hm3  (cost=109.83..10154.17 rows=3333 width=134) (actual time=0.076..0.076 rows=0 loops=1)
         Output: id, hmval, hmarr, smlar(hmarr, '{1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}'::text[])
         Recheck Cond: (hm3.hmarr % '{1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}'::text[])
         Filter: (length(replace((bitxor(B'0000000000000000000000000000000000000000000000000000000001111011'::bit(64), hm3.hmval))::text, '0'::text, ''::text)) < 2)
         Buffers: shared hit=19
         ->  Bitmap Index Scan on idx_hm3  (cost=0.00..109.00 rows=10000 width=0) (actual time=0.074..0.074 rows=0 loops=1)
               Index Cond: (hm3.hmarr % '{1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}'::text[])
               Buffers: shared hit=19
 Planning time: 0.592 ms
 Execution time: 0.117 ms
(13 rows)

Next, you need to create a trigger. When SimHash data is written, arrays resulting from slicing are written automatically.

create or replace function tg() returns trigger as $$
declare
begin
  NEW.hmarr := sp(NEW.hmval);
  return NEW;
end;
$$ language plpgsql strict;postgres=# create trigger tg before insert or update on hm4 for each row execute procedure tg();
CREATE TRIGGER-- 效果很赞postgres=# truncate hm4;
TRUNCATE TABLE
postgres=# insert into hm4 values (1,1::bit(64));
INSERT 0 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000000000000000000001 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000000000001}
(1 row)postgres=# update hm4 set hmval=123456::bit(64);
UPDATE 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000011110001001000000 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000111,6_10001001000000}
(1 row)

Introduction to GIN-Index-Based Similarity Algorithms

Let’s take a look at the various GIN-index-based similarity algorithms:

1. Cosine

相似值算法
                case ST_COSINE:
                        {
                                int                             cnt;
                                double                  power;                                power = ((double)(sa->nelems)) * ((double)(sb->nelems));
                                cnt = numOfIntersect(sa, sb);                                PG_RETURN_FLOAT4(  ((double)cnt) / sqrt( power ) );
                        }
GIN一重过滤算法
                                case ST_COSINE:
                                                {
                                                        double                  power;                                                        power = sqrt( ((double)(sa->nelems)) * ((double)(cnt)) );                                                        if (  ((double)cnt) / power >= GetSmlarLimit()  )
                                                                res = true;
                                                }

2. Overlap

相似值算法
                case ST_OVERLAP:
                        {
                                float4 res = (float4)numOfIntersect(sa, sb);                                PG_RETURN_FLOAT4(res);
                        }
GIN一重过滤算法
                                case ST_OVERLAP:
                                                if (cnt >= GetSmlarLimit())
                                                        res = true;

3. TF-IDF

Currently, data retrieval based on the GIN index is supported only when the TF value is equal to the IDF value. Additionally, the IDF value is written to the table. This implies tha an index supports the IDF value but not the TF value. The algorithm itself supports both TF and IDF.

To enable a GIN index to support TF, you need to modify the smlar code. However, since GiST indexes already support TF, you can use GiST indexes instead. You can refer the blog -
Application of PostgreSQL Together with Cosine and Linear Correlation Algorithms in the Text, Image, and Array Similarity Fields (2) — smlar Plug-in Introduction for more information.

The smlar plug-in can be used to determine the similarity between short documents.

相似值算法
考虑了IDF的权重(一张IDF TABLE来记录权重)。相比cosine算法，更适合短文本相似判断。 但是请注意不考虑TF。
例如一个短文的关键词向量为 (我1, 爱1, 中国5),这里的tf 115全部不计。    static double
TFIDFSml(SimpleArray *a, SimpleArray *b)
{
        int                             cmp;
        Datum                   *aptr = a->elems,
                                        *bptr = b->elems;
        ProcTypeInfo    info = a->info;
        double                  res = 0.0;
        double                  suma = 0.0, sumb = 0.0;        Assert( a->info->typid == b->info->typid );
        Assert( a->df );
        Assert( b->df );        getFmgrInfoCmp(info);        while( aptr - a->elems < a->nelems && bptr - b->elems < b->nelems )
        {
                cmp = cmpArrayElem(aptr, bptr, info);
                if ( cmp < 0 )
                {
                        suma += a->df[ aptr - a->elems ] * a->df[ aptr - a->elems ];
                        aptr++;
                }
                else if ( cmp > 0 )
                {
                        sumb += b->df[ bptr - b->elems ] * b->df[ bptr - b->elems ];
                        bptr++;
                }
                else
                {
                        res += a->df[ aptr - a->elems ] * b->df[ bptr - b->elems ];
                        suma += a->df[ aptr - a->elems ] * a->df[ aptr - a->elems ];
                        sumb += b->df[ bptr - b->elems ] * b->df[ bptr - b->elems ];
                        aptr++;
                        bptr++;
                }
        }        /*
         * Compute last elements
         */
        while( aptr - a->elems < a->nelems )
        {
                suma += a->df[ aptr - a->elems ] * a->df[ aptr - a->elems ];
                aptr++;
        }        while( bptr - b->elems < b->nelems )
        {
                sumb += b->df[ bptr - b->elems ] * b->df[ bptr - b->elems ];
                bptr++;
        }        if ( suma > 0.0 && sumb > 0.0 )
                res = res / sqrt( suma * sumb );
        else
                res = 0.0;        return res;
}
GIN一重过滤算法
                                case ST_TFIDF:
                                                {
                                                        double  weight = 0.0, /* exact weight of union */
                                                                        saSum = 0.0,  /* exact length of query */
                                                                        siSum = 0.0;  /* lower limit of length of indexed value */                                                         if ( getTFMethod() != TF_CONST )
                                                                elog(ERROR,"GIN supports only smlar.tf_method = \"const\"" );                                                        Assert(sa->df);                                                        for(i=0; i<sa->nelems; i++)
                                                        {
                                                                /*
                                                                 * With smlar.tf_method = "const"   sa->df[i] is 
                                                                 * equal to its idf, so lookup of StatElem is not needed
                                                                 */
                                                                if ( check[i] )
                                                                {
                                                                        weight += sa->df[i] * sa->df[i];
                                                                        siSum += sa->df[i] * sa->df[i];
                                                                }
                                                                saSum += sa->df[i] * sa->df[i];
                                                        }                                                        if ( saSum > 0.0 && siSum > 0.0 && weight / sqrt(saSum * siSum ) > GetSmlarLimit() )
                                                                res = true;
                                                }

Query similar documents based on TF-IDF.

4. Technical Points

The smlar plug-in for Alibaba Cloud ApsaraDB RDS for PostgreSQL uses GIN indexes, block-level convergence, and double filtering. It takes less than 0.2 ms to retrieve records with a Hamming distance less than or equal to 2 from 10 million data records.
Alibaba Cloud ApsaraDB RDS for PostgreSQL 10 uses multi-core parallel computing and brute force scanning. It takes less than 450 ms to retrieve records with a Hamming distance less than or equal to N from 10 million data records.
Alibaba Cloud HybridDB for PostgreSQL uses multi-host parallel computing to horizontally expand the computing capability and perform brute force scanning. Query efficiency depends on the number of nodes.

5. Conclusion

You can use the smlar plug-in with Alibaba Cloud ApsaraDB RDS for PostgreSQL to retrieve similar data from tens of millions of SimHash data records. Information about the test for a larger data volume will be provided later, and the response speed should be several milliseconds. Compared to the scenarios that do not use an index, the performance improves by 114,800 times from 23 seconds to 0.2 milliseconds.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

The smlar Plug-in for Effective Retrieval of Massive Volumes of SimHash Data

Background

Infographic: Journey to the Clouds | Data Driven Investor

Smart business leaders understand the value of utilizing the cloud. As data storage needs grow, as they have been…

1. Requirements

2. Architecture Design

a. Brute Force Computing

b. Indexing

3. Demo and Performance

a. Brute Force Computing

b. Indexing

Performance Verification of the 8-Array Slicing Method

Performance Verification of the Hybrid 6-Array Slicing Method

Performance Verification of the 4-Array Slicing Method

Automatic Slicing

Introduction to GIN-Index-Based Similarity Algorithms

4. Technical Points

5. Conclusion

Original Source:

The smlar Plug-in for Effective Retrieval of Massive Volumes of SimHash Data

This article explains how the PostgreSQL database along with smlar plug-in helps in efficiently retrieving massive…

Written by Alibaba Cloud