another question related to importing data we have imported Apache Pinot #troubleshooting

another question related to importing data: we hav...

Luis Fernandez

06/08/2022, 7:39 PM

another question related to importing data: we have imported our last 2 years worth of data into pinot using the standalone job in Dev however we are observing things between our 2 different behavior for the same query in prod/dev (prod doesn’t have this historical data yet but it does have data for this particular time range). Performance in dev is way slower. this query:

Copy code

SELECT product_id, SUM(impression_count) as impression_count, SUM(click_count) as click_count, SUM(cost) as spent_total FROM metrics WHERE user_id = xxx AND serve_time BETWEEN 1651363200 AND 1654012799 GROUP BY product_id LIMIT 6000

production metadata response:

Copy code

"numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 97,
  "numSegmentsProcessed": 31,
  "numSegmentsMatched": 31,
  "numConsumingSegmentsQueried": 1,
  "numDocsScanned": 15109,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 60436,
  "numGroupsLimitReached": false,
  "totalDocs": 493642793,
  "timeUsedMs": 32,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1654715649414,
  "numRowsResultSet": 9708

dev metadata response:

Copy code

"exceptions": [],
  "numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 11703,
  "numSegmentsProcessed": 31,
  "numSegmentsMatched": 31,
  "numConsumingSegmentsQueried": 1,
  "numDocsScanned": 15117,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 60468,
  "numGroupsLimitReached": false,
  "totalDocs": 51283295726,
  "timeUsedMs": 580,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1654716958681,
  "numRowsResultSet": 9708

amount of segments in prod: 1600 amount of segments in dev: 13000 I guess my question is that I see segments queried be way higher in dev and I’m wondering why and if that’s the reason why the query is just performing slower in dev it’s almost equal to the amount of segments that exist in the cluster while prod is only querying a tiny portion. Do you have an idea as to what may be happening?

Mayank

06/08/2022, 9:26 PM

Probably your dev data is not partitioned.

Luis Fernandez

06/08/2022, 9:29 PM

i thought partitioning was only for QPS

Mayank

06/08/2022, 9:30 PM

Is your question on

"numSegmentsQueried": 11703,

Luis Fernandez

06/08/2022, 9:30 PM

yes

Mayank

06/08/2022, 9:31 PM

Partitioning improves QPS by reducing

numSegmentsQueried

Mayank

06/08/2022, 9:31 PM

Also, your VM configs may differ between dev and prod

Luis Fernandez

06/08/2022, 9:31 PM

dev is a replica of prod in terms of specs

Luis Fernandez

06/08/2022, 9:32 PM

the only difference i see between the 2 metadata responses is the

numSegmentsQueried

Mayank

06/08/2022, 9:32 PM

Num nodes, jvm configs, all other variables match exactly?

Luis Fernandez

06/08/2022, 9:32 PM

yes everything is the same

Mayank

06/08/2022, 9:32 PM

Check segment metadata in dev to see if it shows 1 partition or multiple

Luis Fernandez

06/08/2022, 9:33 PM

dev however has way more data because we haven’t run that import

Luis Fernandez

06/08/2022, 9:33 PM

on prod

Luis Fernandez

06/08/2022, 9:34 PM

"segment.partition.metadata": "{\"columnPartitionMap\":{\"user_id\":{\"numPartitions\":8,\"partitions\":[0,1,2,3,4,5,6,7],\"functionName\":\"Murmur\",\"functionConfig\":null}}}"

Mayank

06/08/2022, 9:35 PM

It is not partitioned ^^

Luis Fernandez

06/08/2022, 9:35 PM

that means it’s not partitioned

Mayank

06/08/2022, 9:35 PM

Yes

Luis Fernandez

06/08/2022, 9:35 PM

would that explain that bottleneck we are seeing?

Luis Fernandez

06/08/2022, 9:35 PM

like that difference in response times?

Mayank

06/08/2022, 9:37 PM

It might, depending on main memory vs total segment size on disk, whether you are running other load at the time, and other factors.

Luis Fernandez

06/08/2022, 9:38 PM

all the queries in general are way slower in dev that they are on prod

Luis Fernandez

06/08/2022, 9:39 PM

for the same time windows that prod has data for

Luis Fernandez

06/08/2022, 9:39 PM

and the only variable i see is the numSegmentsQueried

Mayank

06/08/2022, 9:39 PM

And that’s what I am explaining above.

Luis Fernandez

06/08/2022, 9:41 PM

i have this config on offline

Copy code

"segmentPartitionConfig": {
        "columnPartitionMap": {
          "user_id": {
            "functionName": "Murmur",
            "numPartitions": 8
          }
        }
      },
"routing": {
      "segmentPrunerTypes": [
        "partition"
      ]
    },

Luis Fernandez

06/08/2022, 9:42 PM

this means that the responsible to do the partitioning on this data is whatever generates the data that offline then ingests thru the standalone job, not pinot itself yes?

Open in Slack

Previous Next