hey, any good way to optimize such query ```SELECT...
# troubleshooting
y
hey, any good way to optimize such query
Copy code
SELECT hour_start_timestamp_utc FROM downtime WHERE (secondsSinceEpoch > 1606247126) ORDER BY secondsSinceEpoch DESC, hour_start_timestamp_utc DESC LIMIT 1
it scans the past 1 week of data but return only 1 record. since the table is large,  it ends up scanning about 100 million records per query, and takes seconds query output is like
Copy code
{
  "selectionResults": {
    "columns": [
      "hour_start_timestamp_utc"
    ],
    "results": [
      [
        "2020-12-01 09:00:00"
      ]
    ]
  },
  "exceptions": [],
  "numServersQueried": 9,
  "numServersResponded": 9,
  "numSegmentsQueried": 1059,
  "numSegmentsProcessed": 1059,
  "numSegmentsMatched": 18,
  "numConsumingSegmentsQueried": 0,
  "numDocsScanned": 142101504,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 284203008,
  "numGroupsLimitReached": false,
  "totalDocs": 7374174837,
  "timeUsedMs": 3522,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 0
}
m
Perhaps add another predicate on secondsSinceEpoch < xxx
That should allow for pruning more segments. If the code doesn't do that today, would be a good feature to add.
y
you mean the range indexing?
m
Well range based pruning (based on metadata)
so if minTime = a and maxTime = b (in metadata), then segments where querty is not between a and be can be pruned out
We do have a time segment pruner, I don't recall if it can handle ranges
y
got it
does it require to have both lower and upper bounds?
x
use max(hour_start_timestamp_utc)?
k
If the segments are time-based, and you set the table config’s columnMinMaxValueGeneratorMode to something that gave you min/max for the secondsSinceEpoch column, then would Pinot optimize out the segments that contain no possible rows?
m
@Yupeng Fu I assumed your predicate timestamp was for 1 week ago. If so, that would prune out data from older than 1 week, but would still scan data for the last one week
@Xiang Fu using
max
would still scan, unless we can somehow prune out segments
y
@Xiang Fu tried max, but got error
Copy code
"message": "QueryExecutionError:\njava.lang.NumberFormatException: For input string: \"2020-10-15 09:00:00\"\n\tat sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)\n\tat sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)\n\tat java.lang.Double.parseDouble(Double.java:538)\n\tat org.apache.pinot.core.segment.index.readers.StringDictionary.getDoubleValue(StringDictionary.java:58)\n\tat org.apache.pinot.core.operator.query.DictionaryBasedAggregationOperator.getNextBlock(DictionaryBasedAggregationOperator.java:66)\n\tat org.apache.pinot.core.operator.query.DictionaryBasedAggregationOperator.getNextBlock(DictionaryBasedAggregationOperator.java:43)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.CombineOperator$1.runJob(CombineOperator.java:105)\n\tat org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)\n\tat com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)"
m
Yeah, seems that works on numeric data only
x
it’s not for string, if that column is number then should be ok
then you can use time_coverter function to convert it to human readable format
y
that would lead to scan as well?
x
if it’s a full segment match, then we should just read it from dictionary
y
@Mayank that’s right, we wanted to get latest record, but not sure when it was last generated
m
I don't think max helps here, unless we find a way to avoid segment scans (pruning)
y
so use a wider range to be safe
m
Why do you go back 7 days for the latest record?
y
to be safe
it’s used to track the most recent down time
m
Your data is time ordered?
y
need to check with the user, but i hope we can assume so
m
If you just say max without any predicates then it will work (I think that is what Xiang was suggesting)?
without any predicates, we dont' scan, we just look at the metadata
y
right
we tried that query, and got the error above
m
Try time-convert as Xiang suggested