another question related to importing data: we hav...
# troubleshooting
l
another question related to importing data: we have imported our last 2 years worth of data into pinot using the standalone job in Dev however we are observing things between our 2 different behavior for the same query in prod/dev (prod doesn’t have this historical data yet but it does have data for this particular time range). Performance in dev is way slower. this query:
Copy code
SELECT product_id, SUM(impression_count) as impression_count, SUM(click_count) as click_count, SUM(cost) as spent_total FROM metrics WHERE user_id = xxx AND serve_time BETWEEN 1651363200 AND 1654012799 GROUP BY product_id LIMIT 6000
production metadata response:
Copy code
"numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 97,
  "numSegmentsProcessed": 31,
  "numSegmentsMatched": 31,
  "numConsumingSegmentsQueried": 1,
  "numDocsScanned": 15109,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 60436,
  "numGroupsLimitReached": false,
  "totalDocs": 493642793,
  "timeUsedMs": 32,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1654715649414,
  "numRowsResultSet": 9708
dev metadata response:
Copy code
"exceptions": [],
  "numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 11703,
  "numSegmentsProcessed": 31,
  "numSegmentsMatched": 31,
  "numConsumingSegmentsQueried": 1,
  "numDocsScanned": 15117,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 60468,
  "numGroupsLimitReached": false,
  "totalDocs": 51283295726,
  "timeUsedMs": 580,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1654716958681,
  "numRowsResultSet": 9708
amount of segments in prod: 1600 amount of segments in dev: 13000 I guess my question is that I see segments queried be way higher in dev and I’m wondering why and if that’s the reason why the query is just performing slower in dev it’s almost equal to the amount of segments that exist in the cluster while prod is only querying a tiny portion. Do you have an idea as to what may be happening?
m
Probably your dev data is not partitioned.
l
i thought partitioning was only for QPS
m
Is your question on
"numSegmentsQueried": 11703,
?
l
yes
m
Partitioning improves QPS by reducing
numSegmentsQueried
Also, your VM configs may differ between dev and prod
l
dev is a replica of prod in terms of specs
the only difference i see between the 2 metadata responses is the
numSegmentsQueried
m
Num nodes, jvm configs, all other variables match exactly?
l
yes everything is the same
m
Check segment metadata in dev to see if it shows 1 partition or multiple
l
dev however has way more data because we haven’t run that import
on prod
"segment.partition.metadata": "{\"columnPartitionMap\":{\"user_id\":{\"numPartitions\":8,\"partitions\":[0,1,2,3,4,5,6,7],\"functionName\":\"Murmur\",\"functionConfig\":null}}}"
m
It is not partitioned ^^
l
that means it’s not partitioned
m
Yes
l
would that explain that bottleneck we are seeing?
like that difference in response times?
m
It might, depending on main memory vs total segment size on disk, whether you are running other load at the time, and other factors.
l
all the queries in general are way slower in dev that they are on prod
for the same time windows that prod has data for
and the only variable i see is the numSegmentsQueried
m
And that’s what I am explaining above.
l
i have this config on offline
Copy code
"segmentPartitionConfig": {
        "columnPartitionMap": {
          "user_id": {
            "functionName": "Murmur",
            "numPartitions": 8
          }
        }
      },
"routing": {
      "segmentPrunerTypes": [
        "partition"
      ]
    },
this means that the responsible to do the partitioning on this data is whatever generates the data that offline then ingests thru the standalone job, not pinot itself yes?