Hi I have time series data flowing in a kafka stream that I Apache Pinot #general

Hi, I have time series data flowing in a kafka str...

Padma Malladi

04/12/2022, 2:55 PM

Hi, I have time series data flowing in a kafka stream that I am ingesting into Pinot using real time ingestion technique. I have a 48 hour data retention. Data volume is around 6 TB for 48 hours. we have created an inverted index for one of the filter attributes. Query performance is about 18 seconds when we query the data for that filter. However, there is another query parameter of timestamps that are range filters and have no indexes created currently as we thought that the segments will be created based on the time attribute defined in the table config. Is that a correct assumption or do you suggest creating a sorted index for the timestamp attributes? What kind of hardware can be ideal for getting <100 ms query performance time. Query performance is running into seconds initially and then reduces significantly and runs into 200 ms. So, holding the data into memory is key, but we cant have 6TB of memory across 43 pods we have allocated to pinot servers. Each pinot server is configured to have about 28-32 gb out of which 50% is allocated to JVM heap and the rest to the memory mapping

Richard Startin

04/12/2022, 3:00 PM

Hi @User you don't need to sort on time, try to reserve sorting for your most filtered column since it's generally the best option in terms of space/time

Richard Startin

04/12/2022, 3:00 PM

for time you can add a range index

Richard Startin

04/12/2022, 3:02 PM

also how is your kafka topic partitioned? Have you configured partitioning so queries can make use of it?

Richard Startin

04/12/2022, 3:02 PM

https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing

Padma Malladi

04/12/2022, 4:26 PM

Currently no. But, thats in works currently.

Padma Malladi

04/12/2022, 4:28 PM

So, do you think we need to add range index for Timestamp even though we configured the timestamp field as the time attribute in the table config?

Richard Startin

04/12/2022, 4:35 PM

Yes I believe you need to add it to the range index columns

Richard Startin

04/12/2022, 4:35 PM

however, there's a new feature which might be interesting to you available on latest builds - timestamp index which @User wrote recently

Richard Startin

04/12/2022, 4:36 PM

maybe you could experiment with it, otherwise range index should be good

Padma Malladi

04/12/2022, 4:37 PM

WE cannot upgrade Pinot atm

Padma Malladi

04/12/2022, 4:53 PM

@User what is the significance of this configuration in the schema?

Padma Malladi

04/12/2022, 4:54 PM

Copy code

"dateTimeFieldSpecs": [
    {
      "name": "timestamp_ms",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:HOURS"
    }

Padma Malladi

04/12/2022, 4:55 PM

I thought this is used for dividing up the data into different segments and hence there wont be a need for creating a range index for the datetime field, which in our case is the timestamp field

Mayank

04/12/2022, 9:01 PM

That config is specifying the time column(s). It is not used to time partition the segments.

Mohemmad Zaid Khan

04/13/2022, 2:39 AM

Do you use timeColumn in queries’ where clause like

time > 'value' and time < 'value'

then you can configure your table to have -

Copy code

"routing": {
    "segmentPrunerTypes": ["time"]
}

Broker will only query the relevant segments.

Padma Malladi

04/13/2022, 5:29 PM

Thank you

Padma Malladi

04/13/2022, 5:30 PM

is this part of the tableconfig/

Padma Malladi

04/13/2022, 5:46 PM

should we add the name of the field that is used for the time field? or the keyword "time" ?

Mohemmad Zaid Khan

04/21/2022, 5:40 AM

time

keyword. Yes, this is part of table config.

Sudharsan Kannan

08/02/2022, 6:57 AM

@Mohemmad Zaid Khan @Padma Malladi

Copy code

"routing": {
    "segmentPrunerTypes": ["time"]
}

This configuration is supposed to create 'n' segments based on the

"dateTimeFieldSpecs"

right ? Use case : I have a dataset of 76K records with

{ "name":"PDate",

"dataType": "STRING",

format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",

"granularity": "1:DAYS" }

Sudharsan Kannan

08/02/2022, 7:04 AM

@Mohemmad Zaid Khan @Padma Malladi

Copy code

"routing": {
    "segmentPrunerTypes": ["time"]
}

This configuration is supposed to create 'n' segments based on the

"dateTimeFieldSpecs"

right ? Test Case : I have a dataset of 76K records where I'm trying this routing configuration. I have the following configuration in the table-schema for a Real Time Table configured to consume from a single kafka topic-partition

Copy code

{
  "name": "PDate",
  "dataType": "STRING",
  "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
  "granularity": "1:DAYS"
}

The PDate column contains 3600 odd unique values. When I inspect, I could see only two segments created for the table. Shouldn't there be more segments created based on the values ? Or am I missing something ??

Mohemmad Zaid Khan

08/02/2022, 7:23 AM

“segmentPrunerTypes”: [“time”]

This setting is not for partitioning segments. It is just used to determine the whether to prune the segments or not, based on segment’ startTime and endTime

✔️ 1

Open in Slack

Previous Next