Hi All I have a Realtime table consuming from kafka As of no Apache Pinot #general

Hi All, I have a Realtime table consuming from kaf...

kauts shukla

05/02/2021, 11:12 AM

Hi All, I have a Realtime table consuming from kafka. As of now it has 5 Billion records. I’m performing look up [predicate] on inverted index keys [userid, eventcategory, eventlabel] with using “metricFieldSpecs” column as timestampist for range condition. My Query is taking too much time to finish > 10 seconds almost. How can i can configure it with best optimised configuration. Query

Copy code

select userid,eventlabel,sessionid, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time  from default.click_stream where eventlabel !='null' and timestampist between 1615833000000 and 1616225312000 group by userid,eventlabel,sessionid

Mayank

05/02/2021, 12:19 PM

How long does it take if you just do count(*) without group by but using same where clause? And how many records does it select? My guess is your query selects too many rows and so group by is slow In this case, may be try star tree indexing

kauts shukla

05/02/2021, 1:31 PM

@User: select count(*) from click_stream , average its taking 250ms

kauts shukla

05/02/2021, 1:37 PM

@User: It’s taking 5 seconds for ---- select MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from click_stream where timestampist between 1612264456000 and 1619962345000. numDocsScanned : 442963466

kauts shukla

05/02/2021, 1:42 PM

Copy code

select userid,eventlabel, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time  from click_stream where  timestampist between 1612264456000 and 1619962345000 group by userid,eventlabel.   This is taking 60 seconds numDocsScanned 221866025

Mayank

05/02/2021, 3:44 PM

Yeah, you are trying to group by 442M records. How many nodes, what’s the jvm heap size? Your query is essentially very expensive

kauts shukla

05/02/2021, 3:46 PM

I have re created table with star tree :

Copy code

{
  "tableName": "click_stream_REALTIME",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
    "timeColumnName": "created_at",
    "retentionTimeUnit": "DAYS",
    "retentionTimeValue": "90",
    "segmentPushType": "APPEND",
    "timeType": "DAYS",
    "replication": "1",
    "replicasPerPartition": "1",
    "schemaName": "click_stream"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableIndexConfig": {
    "autoGeneratedInvertedIndex": true,
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "simple",
      "stream.kafka.topic.name": "vedantu.dp.click_stream.click_stream",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.hlc.zk.connect.string": "zk1:2181/kafka",
      "stream.kafka.zk.broker.url": "zk1:2181/kafka",
      "stream.kafka.broker.list": "kafka01:9092",
      "realtime.segment.flush.threshold.time": "3600000",
      "realtime.segment.flush.threshold.size": "50000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    },
    "aggregateMetrics": false,
    "enableDefaultStarTree": false,
    "enableDynamicStarTreeCreation": false,
    "nullHandlingEnabled": true,
    "createInvertedIndexDuringSegmentGeneration": false,
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": [
          "eventcategory",
          "eventlabel",
          "userid",
          "sessionid"
        ],
        "skipStarNodeCreationForDimensions": [],
        "functionColumnPairs": [
          "MIN__timestampist",
          "MAX__timestampist"
        ]
      }
    ],
    "rangeIndexColumns": [
      "timestampist",
      "created_at"
    ]
  },
  "metadata": {
    "customConfigs": {}
  },
  "routing": {
    "instanceSelectorType": "replicaGroup"
  },
  "isDimTable": false
}

kauts shukla

05/02/2021, 3:47 PM

@User but it doesn’t help here, what I have to follow, do suggest me.

Mayank

05/02/2021, 3:47 PM

The query is really expensive.

Mayank

05/02/2021, 3:48 PM

How many servers and what’s the jvm heap

kauts shukla

05/02/2021, 3:48 PM

Yeah, there are 2 servers with 100GB heap space

Mayank

05/02/2021, 3:49 PM

What’s the use case? Is this going to be a real production query? Or are you experimenting

kauts shukla

05/02/2021, 3:50 PM

This is going to be a prod query :

Copy code

select userid,eventlabel,sessionid, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time  from default.click_stream where sessionid !='null' and timestampist between 1615833000000 and 1616225312000 group by userid,eventlabel,sessionid

kauts shukla

05/02/2021, 3:50 PM

scanning 3 months ka data.

Mayank

05/02/2021, 3:50 PM

What’s your time column granularity

kauts shukla

05/02/2021, 3:51 PM

1 minute

Mayank

05/02/2021, 3:51 PM

Hmm can you live with 1 day

Mayank

05/02/2021, 3:51 PM

Since you are going back 3 months?

Mayank

05/02/2021, 3:52 PM

And uses aggregate metrics. This will reduce number of rows and improve latency

kauts shukla

05/02/2021, 3:52 PM

I’ll try now

kauts shukla

05/02/2021, 3:53 PM

Copy code

"dateTimeFieldSpecs": [
    {
      "name": "created_at",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:DAY"
    }
  ]

kauts shukla

05/02/2021, 3:53 PM

@User should i try this…

Mayank

05/02/2021, 3:54 PM

Aslo enable metrics aggregation

kauts shukla

05/02/2021, 3:54 PM

kauts shukla

05/02/2021, 3:54 PM

“aggregateMetrics”: false --- > true

Mayank

05/02/2021, 3:54 PM

https://docs.pinot.apache.org/basics/components/table#pre-aggregation

Mayank

05/02/2021, 3:55 PM

Lets move to troubleshooting

kauts shukla

05/02/2021, 4:02 PM

@User yes

kauts shukla

05/02/2021, 4:10 PM

@User Only supported aggregation right now is

SUM

on “aggregateMetrics”: true. Which aren’t using as a part of query. Now My queries are not ever responding in 60 seconds with new table definition with start tree index.

Open in Slack

Previous Next