https://pinot.apache.org/ logo
#general
Title
# general
t

troywinter

05/11/2021, 3:43 AM
What kind of index should I use if I have datetime string column to enable faster ranged query? Will a ranged index help?
x

Xiang Fu

05/11/2021, 3:49 AM
range index only works for numbers.
for date string, you can try to have a separated column to convert the datestring to seconds or days value
also if you have the data partitioned by date for segments, then range query should mostly fall into to the individual segment
t

troywinter

05/11/2021, 3:54 AM
I see. But the problem for our use case is, superset time column can only be datetime, that’s why we converted epoch_ms to datetime during ingestion.
x

Xiang Fu

05/11/2021, 4:00 AM
oh?
I thought superset can have epoch_ms and epoch_s?
something like:
t

troywinter

05/11/2021, 4:04 AM
For temporal column, it has to datetime or datetime like.
I guess a long type can’t work as temporal?
x

Xiang Fu

05/11/2021, 4:05 AM
you can check mark
is temporal
this is one example
message has been deleted
this is the query superset generated:
t

troywinter

05/11/2021, 4:07 AM
It works, the description on superset is a little misleading, I thought it must be datetime type.
👍 1
Why using epoch_ms time column is much slower than a datetime string column?
The same superset cost 3675 ms using epoch_ms column, but only cost 575 ms using a datetime string column.
x

Xiang Fu

05/11/2021, 4:15 AM
hmm, do you have the query generated for the graph?
it’s possible that the group by transformation make the computation complicated
t

troywinter

05/11/2021, 4:17 AM
The only difference for the query is the time filter.
Copy code
SELECT count(DISTINCT device_sn) AS count_1
FROM pad_interaction
WHERE "time" >= 1620705960000
  AND "time" < 1620706560000
  AND company_name = 'xxx'
  AND action_value = 'mqtt_ping'
  AND action_message = 'success'
  AND ((regexp_like(device_sn, '^WAL.*')))
ORDER BY count(DISTINCT device_sn) DESC;
x

Xiang Fu

05/11/2021, 4:18 AM
oh? can you paste the queries?
t

troywinter

05/11/2021, 4:18 AM
Copy code
SELECT count(DISTINCT device_sn) AS count_1
FROM pad_interaction
WHERE ds >= '2021-05-11 12:03:00'
  AND ds < '2021-05-11 12:13:00'
  AND company_name = 'xxx'
  AND action_value = 'mqtt_ping'
  AND action_message = 'success'
  AND ((regexp_like(device_sn, '^WAL.*')))
ORDER BY count(DISTINCT device_sn) DESC
x

Xiang Fu

05/11/2021, 4:21 AM
hmm
this query seems only hit the most recent data
it’s weird that there are some big difference
t

troywinter

05/11/2021, 4:24 AM
numEntriesScannedInFilter difference is huge, datetime string has only 9370339 entries scanned, epoch_ms has 262577613 entries scanned.
x

Xiang Fu

05/11/2021, 4:38 AM
oh, because it’s ms not seconds
then add a second epoch value column should help
also,
do you have any sort index?
t

troywinter

05/11/2021, 4:43 AM
No, no index is configured.
x

Xiang Fu

05/11/2021, 4:45 AM
ic, then adding inverted index may help as well for columns in filtering
t

troywinter

05/11/2021, 4:56 AM
Thanks, I will add some inverted index.
k

Kishore G

05/11/2021, 5:29 AM
Why not range?
t

troywinter

05/11/2021, 5:59 AM
Can range index used for string columns? I didn’t see any description in the documentation.
x

Xiang Fu

05/11/2021, 6:04 AM
I feel sortIndex on time and inverted index on company_name, action_value, action_message may help
t

troywinter

05/11/2021, 6:07 AM
inverted index can’t apply on multivalue columns?
x

Xiang Fu

05/11/2021, 5:30 PM
inv index works for multi-value column