It s possible to use the `similarity` extension like this `` Supabase #sql

It's possible to use the `similarity` extension li...

Scott P

10/29/2021, 10:19 PM

It's possible to use the

similarity

extension like this:

Copy code

sql
SELECT * FROM wines WHERE
  extensions.similarity(wines.description::TEXT, search_value) > 0.25

You can also use something like the distance operator (from the

pg_trgm

extension I believe):

Copy code

sql
SELECT * FROM wines WHERE
  1 - (wines.description::TEXT <-> search_value) > 0.25

With enough fine tuning of the values (e.g.

0.25

), you can have something that is akin to fuzzy searching, but the specific value you should use will depend on your dataset and your own testing. If you need something more advanced to search across multiple columns in a single function,

to_tsvector

and

to_tsquery

can be combined like this:

Copy code

sql
SELECT * FROM wines WHERE
  to_tsvector(wines.name || ' ' || wines.description || ' ' || wines.manufacturer || ' ' || wines.country) -- concat columns, but be sure to include a space to separate them!
  @@ to_tsquery(search_value)

In all 3 cases,

search_value

is a

TEXT

parameter passed into the function. You can of course combine any or all of the above options if you really need to using

AND

OR

chipilov

10/30/2021, 8:48 AM

Indeed, great tips @User , thank you! Just want to mention that Supabase also supports the https://www.postgresql.org/docs/8.3/fuzzystrmatch.html module which has similar functionality (I am not an expert in either, but a quick google search shows a few results about the tradeoffs between fuzzystrmatch and pg_trgm)

chipilov

10/30/2021, 8:49 AM

@User do you have experience with both? Do you think you can give us your take on their tradeoffs?

garyaustin

10/30/2021, 4:14 PM

Just a note from the fuzzystrmatch docs: Caution At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).

Scott P

10/30/2021, 6:36 PM

From my experience, using

similarity

and distance (

<->

) in combination are good enough for many use cases, but the performance of them becomes a big problem as you add more of them (with

AND

OR

). For example, adding 3 of each might result in a query that takes 500ms to execute which is generally acceptable for most users. If you increase that to 6 of each, the execution time can often be closer to 4000ms. Another option is using

which is a wildcard for multiple characters, used in conjunction with

LIKE

and

ILIKE

. If I wanted to find

London

, but only type

Lond

, it should find it. This can also be rather slow, but it's handy if you have a column where all data has a prefix and/or suffix and you want to fine-tune. In some cases, I do use

levenshtein

(in combination with

similarity

) from fuzzystrmatch, but only on very small datasets (~4000 rows) where I have a need to search across multiple columns, but I have a need to fine tune the match ratio for each column. For example, in a dataset of train station locations, a user is more likely to search by station name than station code, so I can weigh the results towards a

location_name

column more than I do towards a

station_code

column. Using it on large datasets will lead to queries that can take upwards of 8000ms, and in some cases, several minutes.

Scott P

10/30/2021, 6:36 PM

For some datasets, I use similarity and , or %, but for others I use

to_tsvector

and

to_tsquery

tsvector

is an index over words, while functions in

pg_trgm

are indexes over characters.

tsvector

is much quicker for searching across columns where it's word-based data, and good use cases include place names, product names, etc. A specific use case I had was finding airport data based upon things like the town, country, airport name, etc.

similarity

and

<->

are more in-depth, but more suitable for searching for irregular 'non-human' data, such as ID codes in single columns. The downside is the performance hit if you add too many of these operators/functions.

Scott P

10/30/2021, 6:40 PM

TLDR: The most suitable option depends on the number of columns you want to search, the number of rows in the dataset, and the complexity of the data in the set

chipilov

10/30/2021, 8:11 PM

Great summary, thank you @User !

Previous Next