TencentDB for PostgreSQL Searching in High Numbers of Tags Based on pg

Many business use cases have the tag-based query feature. If the data volume and tag value quantity are high, a large amount of storage capacity will be used, and the performance will be poor. Therefore, how to filter target resources efficiently and quickly without taking up too much storage space has become a challenge for business management optimization.
This document describes how to easily search in an ultra high number of tags based on the pg_roaringbitmap extension.
pg_roaringbitmap Overview
pg_roaringbitmap is a compressed bitmap storage extension based on roaring bitmap. It supports roaring bitmap access as well as set, aggregation, and other operations.
Roaring Bitmap Usage
Roaring bitmap is commonly used in businesses to store user attribute tags. It allows adding, deleting, modifying, and querying these attribute tags. By using union, intersection, and other methods based on the stored user tags, you can precisely and quickly search for specific users in ultra-large-scale attribute data. This not only improves performance but also reduces storage space, making it an excellent application practice in big data analysis scenarios.
For example, in a traditional model, there is a user tag table for a music application, as shown below:
Audio playback + text display in user ID
Username
Interest Tags
1
John Smith
{Classical, Jazz, R&B, Country}
2
Jane Smith
{Folk Songs, Chinese Style, Instrumental Music}
3
Chris
{HipHop, Jazz, R&B, Rap, Reggae}
...
...
...
1000000000
Text 2
{Rock,}
To find all users who like pure music, you need to search based on the interest tag column, find rows containing the pure music tag, and return this data to the application. The simplest approach is to create a user interest table in the database according to the table structure and then execute an array query statement to find the interest tags for inclusion search. However, this method has a problem: in scenarios with large data volumes and numerous tag values, not only will more storage capacity be used, but the performance will also be extremely poor. Therefore, we need to adopt a different implementation plan by splitting this table into three tables, with interest tags as the primary key and users containing these interest tags stored as bitmaps, as shown in the table below:
User Table:
Audio playback + text display in user ID
Username
1
John Smith
2
Jane Smith
N
...
Tag Table:
Tag ID
Username
1
Classic
2
Folk Songs
N
...
User Tag Table:
Tag ID
Username
1
[ 1,3,7,123,423 ]
2
[ 5,31]
N
...
When you need to find users who like listening to classical and folk music at the same time, you can directly perform a bitmap query by user ID in the user tag table. This can greatly improve the performance and reduce the capacity usage.
Query Performance Comparison Between Traditional Method and Roaring Bitmap Method
Preparing test scenario
1. Create a function for random character generation.
create or replace function random_string(length integer) returns text as
$$
declare
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
length2 integer := (select trunc(random() * length + 1));
begin
if length2 < 0 then
raise exception 'Given length cannot be less than 0';
end if;
for i in 1..length2 loop
result := result || chars[1+random()*(array_length(chars, 1)-1)];
end loop;
return result;
end;
$$ language plpgsql;
2. Create a function that generates an array of random integers.
create or replace function random_int_array(int, int)
returns int[] language sql as
$$
select array_agg(round(random()* $1)::int)
from generate_series(1, $2)
$$;
3. Create a function that generates an array of random characters.
create or replace function random_string_array(int, int)
returns TEXT[] language sql as
$$
select array_agg(random_string($1)) from generate_series(1, $2);
$$;
Scheme 1: Traditional method
One table does it all.
1. Create a table containing all the data.
create table account(
uin bigint primary KEY,
name varchar,
tag TEXT []
);
2. Insert the data of ten million simulated accounts with the function as described in the preparations, and then create a GIN index.
insert into account select generate_series(1,10000000), random_string(20),random_string_array(5,10);
create index tag_inx on account USING GIN(tag);
3. Run a query to list users with tags GN and o.
explain analyze select uin,name from account where tag @>ARRAY['GN','o'];
﻿
QUERY PLAN
﻿
----------------------------------------------------------------------------------------------------
-----------------
Bitmap Heap Scan on account (cost=52.81..466.86 rows=105 width=19) (actual time=4.263..4.502 rows=
184 loops=1)
Recheck Cond: (tag @> '{GN,o}'::text[])
Heap Blocks: exact=184
-> Bitmap Index Scan on tag_inx (cost=0.00..52.78 rows=105 width=0) (actual time=4.240..4.240 r
ows=184 loops=1)
Index Cond: (tag @> '{GN,o}'::text[])
Planning Time: 0.108 ms
Execution Time: 4.528 ms
4. Run a query to list xx users with tags lvXe and Zt (the query will be slow when executed for the first time).
explain analyze select count(uin) from account where tag && ARRAY['lvXe','Zt'];
﻿
----------------------------------------------------------------------------------------------------
--------------------------
Aggregate (cost=21816.39..21816.40 rows=1 width=8) (actual time=8.236..8.238 rows=1 loops=1)
-> Bitmap Heap Scan on account (cost=109.08..21800.56 rows=6332 width=8) (actual time=1.655..7.
901 rows=5390 loops=1)
Recheck Cond: (tag && '{lvXe,Zt}'::text[])
Heap Blocks: exact=5327
-> Bitmap Index Scan on tag_inx (cost=0.00..107.49 rows=6332 width=0) (actual time=0.962.
.0.962 rows=5390 loops=1)
Index Cond: (tag && '{lvXe,Zt}'::text[])
Planning Time: 0.110 ms
Execution Time: 8.270 ms
Scheme 2: Optimized scheme
In order to reduce the performance loss caused by the types of tag fields in the query, change the actual tag in the above table to tagid.
1. Introduce a new tag dictionary table.
create table tag_dict (  
tagid int primary key,
taginfo text
);
2. Suppose there are 100,000 dictionary types in total.
insert into tag_dict select generate_series(1,100000), md5(random()::text);
3. Create another table to store user and tag information.
create table account1(
uin bigint primary KEY,
name varchar,
tag INT []
);
4. Insert the data of ten million accounts.
insert into account1 select generate_series(1,10000000), random_string(20),random_int_array(100000,10);
5. List users with both tag IDs 100 and 5711.
Before Indexing:
test=> explain analyze select uin,name from account1 where tag @> ARRAY[100,5711];
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..191007.68 rows=250 width=19) (actual time=982.585..1000.806 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on account1 (cost=0.00..189982.68 rows=104 width=19) (actual time=962.640..962.640 rows=0 loops=3)
Filter: (tag @> '{100,5711}'::integer[])
Rows Removed by Filter: 3333333
Planning Time: 0.205 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.280 ms, Inlining 0.000 ms, Optimization 1.176 ms, Emission 14.189 ms, Total 17.645 ms
Execution Time: 1001.574 ms
(12 rows)
Add Index:
create index tag_inx_2 on account1 USING GIN(tag);
After Indexing:
test=> explain analyze select uin,name from account1 where tag @> ARRAY[100,5711];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on account1 (cost=49.94..1021.13 rows=250 width=19) (actual time=0.126..0.127 rows=0 loops=1)
Recheck Cond: (tag @> '{100,5711}'::integer[])
-> Bitmap Index Scan on tag_inx_2 (cost=0.00..49.87 rows=250 width=0) (actual time=0.124..0.124 rows=0 loops=1)
Index Cond: (tag @> '{100,5711}'::integer[])
Planning Time: 0.410 ms
Execution Time: 0.171 ms
(6 rows)
6. List users with both tag IDs 61568 and 97350.
test=> explain analyze select uin,name from account1 where tag @> ARRAY[61568,97350];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on account1 (cost=49.94..1021.13 rows=250 width=19) (actual time=0.130..0.131 rows=1 loops=1)
Recheck Cond: (tag @> '{61568,97350}'::integer[])
Heap Blocks: exact=1
-> Bitmap Index Scan on tag_inx_2 (cost=0.00..49.87 rows=250 width=0) (actual time=0.125..0.125 rows=1 loops=1)
Index Cond: (tag @> '{61568,97350}'::integer[])
Planning Time: 0.071 ms
Execution Time: 0.151 ms
(7 rows)
7. List xx users who share interests with xx users (tag IDs 100 and 5711).
test=> explain analyze select count(uin) from account1 where tag && ARRAY[61568,97350];
QUERY PLAN
﻿
---------------------------------------------------------------------------------------------------------------------------------
------
Gather (cost=1961.06..173801.15 rows=99750 width=19) (actual time=5.020..28.885 rows=2066 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on account1 (cost=961.06..162826.15 rows=41562 width=19) (actual time=1.623..3.305 rows=689 loo
ps=3)
Recheck Cond: (tag && '{61568,97350}'::integer[])
Heap Blocks: exact=2053
-> Bitmap Index Scan on tag_inx_2 (cost=0.00..936.12 rows=99750 width=0) (actual time=0.685..0.685 rows=2066 loops=1)
Index Cond: (tag && '{61568,97350}'::integer[])
Planning Time: 0.082 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.078 ms, Inlining 0.000 ms, Optimization 0.270 ms, Emission 3.489 ms, Total 5.836 ms
Execution Time: 29.725 ms
(14 rows)
Scheme 3: Roaring bitmap
1. Create the extension first. It is integrated in TencentDB for PostgreSQL natively, so you don't need to care about compilation and other operations. You can directly create it in the database.
create extension roaringbitmap;
2. Create a tag-user mapping table.
create table tag_uin_list(
tagid int primary key,
uin_offset int,
uinbits roaringbitmap
);
3. Insert 100,000 tags and corresponding user data according to the previously created tag table.
insert into tag_uin_list
select tagid, uin_offset, rb_build_agg(uin::int) as uinbits from
(
select
unnest(tag) as tagid,
(uin / (2^31)::int8) as uin_offset,
mod(uin, (2^31)::int8) as uin
from account1
) t
group by tagid, uin_offset;
4. Query the number of users with tags 1, 3, 10, and 200.
explain analyze select sum(ub) from
(
select uin_offset,rb_or_cardinality_agg(uinbits) as ub
from tag_uin_list
where tagid in (1,3,10,200)
group by uin_offset
) t;
﻿
﻿
QUERY PLAN
﻿
---------------------------------------------------------------------------------------------------------------------------------
------------
Aggregate (cost=32.47..32.48 rows=1 width=32) (actual time=0.964..0.966 rows=1 loops=1)
-> GroupAggregate (cost=32.42..32.46 rows=1 width=12) (actual time=0.955..0.956 rows=1 loops=1)
Group Key: tag_uin_list.uin_offset
-> Sort (cost=32.42..32.43 rows=4 width=22) (actual time=0.107..0.109 rows=4 loops=1)
Sort Key: tag_uin_list.uin_offset
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on tag_uin_list (cost=17.20..32.38 rows=4 width=22) (actual time=0.044..0.067 rows=4 loops=1
)
Recheck Cond: (tagid = ANY ('{1,3,10,200}'::integer[]))
Heap Blocks: exact=4
-> Bitmap Index Scan on tag_uin_list_pkey (cost=0.00..17.20 rows=4 width=0) (actual time=0.031..0.031 rows
=4 loops=1)
Index Cond: (tagid = ANY ('{1,3,10,200}'::integer[]))
Planning Time: 0.289 ms
Execution Time: 1.083 ms
(13 rows)
5. View the list of users with tags 1, 3, 10, and 200.
explain analyze select uin_offset,rb_or_agg(uinbits) as ub
from tag_uin_list
where tagid in (1,3,10,200)
group by uin_offset;
QUERY PLAN
﻿
---------------------------------------------------------------------------------------------------------------------------------
------
GroupAggregate (cost=32.42..32.46 rows=1 width=36) (actual time=0.246..0.246 rows=1 loops=1)
Group Key: uin_offset
-> Sort (cost=32.42..32.43 rows=4 width=22) (actual time=0.043..0.045 rows=4 loops=1)
Sort Key: uin_offset
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on tag_uin_list (cost=17.20..32.38 rows=4 width=22) (actual time=0.029..0.036 rows=4 loops=1)
Recheck Cond: (tagid = ANY ('{1,3,10,200}'::integer[]))
Heap Blocks: exact=4
-> Bitmap Index Scan on tag_uin_list_pkey (cost=0.00..17.20 rows=4 width=0) (actual time=0.021..0.021 rows=4 loops=1)
Index Cond: (tagid = ANY ('{1,3,10,200}'::integer[]))
Planning Time: 0.119 ms
Execution Time: 0.310 ms
(12 rows)
Viewing index size and table size
test=> select relname, pg_size_pretty(pg_relation_size(relid)) from pg_stat_user_tables where schemaname='public' order by pg_relation_size(relid) desc;
   relname    | pg_size_pretty 
--------------+----------------
 account      | 1545 MB
 account1     | 1077 MB
 t_user       | 651 MB
 tag_dict     | 6672 kB
 tag_uin_list | 5888 kB
(5 rows)
﻿
test=> select indexrelname, pg_size_pretty(pg_relation_size(relid)) from pg_stat_user_indexes where schemaname='public' order by pg_relation_size(relid) desc;
   indexrelname    | pg_size_pretty 
-------------------+----------------
 tag_inx           | 1545 MB
 account_pkey      | 1545 MB
 tag_inx_2         | 1077 MB
 account1_pkey     | 1077 MB
 t_user_pkey       | 651 MB
 tag_dict_pkey     | 6672 kB
 tag_uin_list_pkey | 5888 kB
(7 rows)
Conclusion
The query performance comparison of different schemes is as follows:
Query Item
Option 1
Method 2
Roaring Bitmap Solution
Querying the user list containing specified tags
4.528ms
0.151ms
0.310ms
Determine the number of users with common tags
8.27ms
29.725ms
1.083ms
Storage Capacity Statistics
4635MB
3244.344MB
1237.12MB
As can be clearly seen from the above three schemes, the optimization effect is very obvious. The roaring bitmap scheme works very well in terms of query speed and storage capacity usage.
﻿
Searching in High Numbers of Tags Based on pg_roaringbitmap

On this page:

pg_roaringbitmap Overview

Roaring Bitmap Usage

Query Performance Comparison Between Traditional Method and Roaring Bitmap Method

Preparing test scenario

Scheme 1: Traditional method

Scheme 2: Optimized scheme

Scheme 3: Roaring bitmap

Viewing index size and table size

Conclusion

Audio playback + text display in user ID	Username	Interest Tags
1	John Smith	{Classical, Jazz, R&B, Country}
2	Jane Smith	{Folk Songs, Chinese Style, Instrumental Music}
3	Chris	{HipHop, Jazz, R&B, Rap, Reggae}
...	...	...
1000000000	Text 2	{Rock,}
Query Item	Option 1	Method 2	Roaring Bitmap Solution
Querying the user list containing specified tags	4.528ms	0.151ms	0.310ms
Determine the number of users with common tags	8.27ms	29.725ms	1.083ms
Storage Capacity Statistics	4635MB	3244.344MB	1237.12MB