Scott P
10/29/2021, 10:19 PMsimilarity
extension like this:
sql
SELECT * FROM wines WHERE
extensions.similarity(wines.description::TEXT, search_value) > 0.25
You can also use something like the distance operator (from the pg_trgm
extension I believe):
sql
SELECT * FROM wines WHERE
1 - (wines.description::TEXT <-> search_value) > 0.25
With enough fine tuning of the values (e.g. 0.25
), you can have something that is akin to fuzzy searching, but the specific value you should use will depend on your dataset and your own testing.
If you need something more advanced to search across multiple columns in a single function, to_tsvector
and to_tsquery
can be combined like this:
sql
SELECT * FROM wines WHERE
to_tsvector(wines.name || ' ' || wines.description || ' ' || wines.manufacturer || ' ' || wines.country) -- concat columns, but be sure to include a space to separate them!
@@ to_tsquery(search_value)
In all 3 cases, search_value
is a TEXT
parameter passed into the function.
You can of course combine any or all of the above options if you really need to using AND
/ OR
.chipilov
10/30/2021, 8:48 AMchipilov
10/30/2021, 8:49 AMgaryaustin
10/30/2021, 4:14 PMScott P
10/30/2021, 6:36 PMsimilarity
and distance (<->
) in combination are good enough for many use cases, but the performance of them becomes a big problem as you add more of them (with AND
/ OR
).
For example, adding 3 of each might result in a query that takes 500ms to execute which is generally acceptable for most users. If you increase that to 6 of each, the execution time can often be closer to 4000ms.
Another option is using %
which is a wildcard for multiple characters, used in conjunction with LIKE
and ILIKE
. If I wanted to find London
, but only type Lond
, it should find it. This can also be rather slow, but it's handy if you have a column where all data has a prefix and/or suffix and you want to fine-tune.
In some cases, I do use levenshtein
(in combination with similarity
) from fuzzystrmatch, but only on very small datasets (~4000 rows) where I have a need to search across multiple columns, but I have a need to fine tune the match ratio for each column. For example, in a dataset of train station locations, a user is more likely to search by station name than station code, so I can weigh the results towards a location_name
column more than I do towards a station_code
column. Using it on large datasets will lead to queries that can take upwards of 8000ms, and in some cases, several minutes.Scott P
10/30/2021, 6:36 PMto_tsvector
and to_tsquery
. tsvector
is an index over words, while functions in pg_trgm
are indexes over characters.
tsvector
is much quicker for searching across columns where it's word-based data, and good use cases include place names, product names, etc. A specific use case I had was finding airport data based upon things like the town, country, airport name, etc.
similarity
and <->
are more in-depth, but more suitable for searching for irregular 'non-human' data, such as ID codes in single columns. The downside is the performance hit if you add too many of these operators/functions.Scott P
10/30/2021, 6:40 PMchipilov
10/30/2021, 8:11 PM