Hi, I have time series data flowing in a kafka str...
# general
p
Hi, I have time series data flowing in a kafka stream that I am ingesting into Pinot using real time ingestion technique. I have a 48 hour data retention. Data volume is around 6 TB for 48 hours. we have created an inverted index for one of the filter attributes. Query performance is about 18 seconds when we query the data for that filter. However, there is another query parameter of timestamps that are range filters and have no indexes created currently as we thought that the segments will be created based on the time attribute defined in the table config. Is that a correct assumption or do you suggest creating a sorted index for the timestamp attributes? What kind of hardware can be ideal for getting <100 ms query performance time. Query performance is running into seconds initially and then reduces significantly and runs into 200 ms. So, holding the data into memory is key, but we cant have 6TB of memory across 43 pods we have allocated to pinot servers. Each pinot server is configured to have about 28-32 gb out of which 50% is allocated to JVM heap and the rest to the memory mapping
r
Hi @User you don't need to sort on time, try to reserve sorting for your most filtered column since it's generally the best option in terms of space/time
for time you can add a range index
also how is your kafka topic partitioned? Have you configured partitioning so queries can make use of it?
p
Currently no. But, thats in works currently.
So, do you think we need to add range index for Timestamp even though we configured the timestamp field as the time attribute in the table config?
r
Yes I believe you need to add it to the range index columns
however, there's a new feature which might be interesting to you available on latest builds - timestamp index which @User wrote recently
maybe you could experiment with it, otherwise range index should be good
p
WE cannot upgrade Pinot atm
@User what is the significance of this configuration in the schema?
Copy code
"dateTimeFieldSpecs": [
    {
      "name": "timestamp_ms",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:HOURS"
    }
I thought this is used for dividing up the data into different segments and hence there wont be a need for creating a range index for the datetime field, which in our case is the timestamp field
m
That config is specifying the time column(s). It is not used to time partition the segments.
m
Do you use timeColumn in queries’ where clause like
time > 'value' and time < 'value'
then you can configure your table to have -
Copy code
"routing": {
    "segmentPrunerTypes": ["time"]
}
Broker will only query the relevant segments.
p
Thank you
is this part of the tableconfig/
should we add the name of the field that is used for the time field? or the keyword "time" ?
m
time
keyword. Yes, this is part of table config.
s
@Mohemmad Zaid Khan @Padma Malladi
Copy code
"routing": {
    "segmentPrunerTypes": ["time"]
}
This configuration is supposed to create 'n' segments based on the
"dateTimeFieldSpecs"
right ? Use case : I have a dataset of 76K records with
{ "name":"PDate",
"dataType": "STRING",
format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
"granularity": "1:DAYS" }
@Mohemmad Zaid Khan @Padma Malladi
Copy code
"routing": {
    "segmentPrunerTypes": ["time"]
}
This configuration is supposed to create 'n' segments based on the
"dateTimeFieldSpecs"
right ? Test Case : I have a dataset of 76K records where I'm trying this routing configuration. I have the following configuration in the table-schema for a Real Time Table configured to consume from a single kafka topic-partition
Copy code
{
  "name": "PDate",
  "dataType": "STRING",
  "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
  "granularity": "1:DAYS"
}
The PDate column contains 3600 odd unique values. When I inspect, I could see only two segments created for the table. Shouldn't there be more segments created based on the values ? Or am I missing something ??
m
“segmentPrunerTypes”: [“time”]
This setting is not for partitioning segments. It is just used to determine the whether to prune the segments or not, based on segment’ startTime and endTime