It's possible to use the `similarity` extension li...
# sql
s
It's possible to use the
similarity
extension like this:
Copy code
sql
SELECT * FROM wines WHERE
  extensions.similarity(wines.description::TEXT, search_value) > 0.25
You can also use something like the distance operator (from the
pg_trgm
extension I believe):
Copy code
sql
SELECT * FROM wines WHERE
  1 - (wines.description::TEXT <-> search_value) > 0.25
With enough fine tuning of the values (e.g.
0.25
), you can have something that is akin to fuzzy searching, but the specific value you should use will depend on your dataset and your own testing. If you need something more advanced to search across multiple columns in a single function,
to_tsvector
and
to_tsquery
can be combined like this:
Copy code
sql
SELECT * FROM wines WHERE
  to_tsvector(wines.name || ' ' || wines.description || ' ' || wines.manufacturer || ' ' || wines.country) -- concat columns, but be sure to include a space to separate them!
  @@ to_tsquery(search_value)
In all 3 cases,
search_value
is a
TEXT
parameter passed into the function. You can of course combine any or all of the above options if you really need to using
AND
/
OR
.
c
Indeed, great tips @User , thank you! Just want to mention that Supabase also supports the https://www.postgresql.org/docs/8.3/fuzzystrmatch.html module which has similar functionality (I am not an expert in either, but a quick google search shows a few results about the tradeoffs between fuzzystrmatch and pg_trgm)
@User do you have experience with both? Do you think you can give us your take on their tradeoffs?
g
Just a note from the fuzzystrmatch docs: Caution At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
s
From my experience, using
similarity
and distance (
<->
) in combination are good enough for many use cases, but the performance of them becomes a big problem as you add more of them (with
AND
/
OR
). For example, adding 3 of each might result in a query that takes 500ms to execute which is generally acceptable for most users. If you increase that to 6 of each, the execution time can often be closer to 4000ms. Another option is using
%
which is a wildcard for multiple characters, used in conjunction with
LIKE
and
ILIKE
. If I wanted to find
London
, but only type
Lond
, it should find it. This can also be rather slow, but it's handy if you have a column where all data has a prefix and/or suffix and you want to fine-tune. In some cases, I do use
levenshtein
(in combination with
similarity
) from fuzzystrmatch, but only on very small datasets (~4000 rows) where I have a need to search across multiple columns, but I have a need to fine tune the match ratio for each column. For example, in a dataset of train station locations, a user is more likely to search by station name than station code, so I can weigh the results towards a
location_name
column more than I do towards a
station_code
column. Using it on large datasets will lead to queries that can take upwards of 8000ms, and in some cases, several minutes.
For some datasets, I use similarity and , or %, but for others I use
to_tsvector
and
to_tsquery
.
tsvector
is an index over words, while functions in
pg_trgm
are indexes over characters.
tsvector
is much quicker for searching across columns where it's word-based data, and good use cases include place names, product names, etc. A specific use case I had was finding airport data based upon things like the town, country, airport name, etc.
similarity
and
<->
are more in-depth, but more suitable for searching for irregular 'non-human' data, such as ID codes in single columns. The downside is the performance hit if you add too many of these operators/functions.
TLDR: The most suitable option depends on the number of columns you want to search, the number of rows in the dataset, and the complexity of the data in the set
c
Great summary, thank you @User !