Dan Hill
07/01/2020, 10:41 PMmetrics
is slow to query but the individual metrics_OFFLINE
and metrics_REALTIME
are quick to query separately. Any ideas?
select utc_date, sum(impressions) from metrics_OFFLINE where utc_date >= 1591142400000 and utc_date < 1593648000000 group by utc_date order by utc_date ASC limit 1831
This returns pretty fast (200ms) over a lot of 400mil rows.
If I switch to metrics_REALTIME
, it's also fast and returns zero rows.
select utc_date, sum(impressions) from metrics_REALTIME where utc_date >= 1591142400000 and utc_date < 1593648000000 group by utc_date order by utc_date ASC limit 1831
However, if I query metrics
, it's very slow.
select utc_date, sum(impressions) from metrics where utc_date >= 1591142400000 and utc_date < 1593648000000 group by utc_date order by utc_date ASC limit 1831
pinot-server
include the following timeout. I modified the query above to simplify it (in case you see differences).
2020/07/01 22:28:11.526 ERROR [CombineGroupByOrderByOperator] [pqr-0] Timed out while combining group-by results after 9996ms, brokerRequest = BrokerRequest(querySource:QuerySource(tableName:metrics_OFFLINE), filterQuery:FilterQuery(id:0, value:null, operator:AND, nestedFilterQueryIds:[1, 2, 3]), aggregationsInfo:[AggregationInfo(aggregationType:SUM, aggregationParams:{column=clicks}, isInSelectList:true, expressions:[clicks]), AggregationInfo(aggregationType:SUM, aggregationParams:{column=impressions}, isInSelectList:true, expressions:[impressions]), AggregationInfo(aggregationType:SUM, aggregationParams:{column=cost_usd_micros}, isInSelectList:true, expressions:[cost_usd_micros])], groupBy:GroupBy(topN:1831, expressions:[utc_date]), filterSubQueryMap:FilterQueryMap(filterQueryMap:{0=FilterQuery(id:0, value:null, operator:AND, nestedFilterQueryIds:[1, 2, 3]), 1=FilterQuery(id:1, column:utc_date, value:[[1591142400000 *)], operator:RANGE, nestedFilterQueryIds:[]), 2=FilterQuery(id:2, column:utc_date, value:[(* 1593648000000)], operator:RANGE, nestedFilterQueryIds:[]), 3=FilterQuery(id:3, column:timestamp, value:[(* 1593561599986]], operator:RANGE, nestedFilterQueryIds:[])}), queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9997}, pinotQuery:PinotQuery(dataSource:DataSource(tableName:metrics), selectList:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:clicks))])), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:impressions))])), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:cost_usd_micros))]))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:AND, operands:[Expression(type:FUNCTION, functionCall:Function(operator:GREATER_THAN_OR_EQUAL, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:LITERAL, literal:<Literal longValue:1591142400000>)])), Expression(type:FUNCTION, functionCall:Function(operator:LESS_THAN, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:LITERAL, literal:<Literal longValue:1593648000000>)]))])), groupByList:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date))], orderByList:[Expression(type:FUNCTION, functionCall:Function(operator:ASC, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date))]))], limit:1831), orderBy:[SelectionSort(column:utc_date, isAsc:true)], limit:1831)
{
"exceptions": [],
"numServersQueried": 2,
"numServersResponded": 0,
"numSegmentsQueried": 0,
"numSegmentsProcessed": 0,
"numSegmentsMatched": 0,
"numConsumingSegmentsQueried": 0,
"numDocsScanned": 0,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0,
"numGroupsLimitReached": false,
"totalDocs": 0,
"timeUsedMs": 10000,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 0
}
Here are the fast ones. metrics_OFFLINE
"exceptions": [],
"numServersQueried": 1,
"numServersResponded": 1,
"numSegmentsQueried": 418,
"numSegmentsProcessed": 418,
"numSegmentsMatched": 319,
"numConsumingSegmentsQueried": 0,
"numDocsScanned": 319,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 638,
"numGroupsLimitReached": false,
"totalDocs": 398507752,
"timeUsedMs": 262,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 0
metrics_REALTIME
{
"resultTable": {
"dataSchema": {
"columnDataTypes": [
"STRING",
"DOUBLE"
],
"columnNames": [
"utc_date",
"sum(impressions)"
]
},
"rows": []
},
"exceptions": [],
"numServersQueried": 1,
"numServersResponded": 1,
"numSegmentsQueried": 1,
"numSegmentsProcessed": 0,
"numSegmentsMatched": 0,
"numConsumingSegmentsQueried": 1,
"numDocsScanned": 0,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0,
"numGroupsLimitReached": false,
"totalDocs": 0,
"timeUsedMs": 16,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 9223372036854776000
}
Xiang Fu
Dan Hill
07/01/2020, 11:11 PMXiang Fu
select utc_date, sum(impressions) from metrics_OFFLINE where utc_date >= 1591142400000 and utc_date < 1593648000000 and utc_date < 1593561599986 group by utc_date order by utc_date ASC limit 1831
Dan Hill
07/01/2020, 11:18 PMXiang Fu
Dan Hill
07/01/2020, 11:18 PMXiang Fu
Dan Hill
07/01/2020, 11:19 PMXiang Fu
2020/07/01 22:28:11.526 ERROR [CombineGroupByOrderByOperator] [pqr-0] Timed out while combining group-by results after 9996ms, brokerRequest = BrokerRequest(querySource:QuerySource(tableName:metrics_OFFLINE), filterQuery:FilterQuery(id:0, value:null, operator:AND, nestedFilterQueryIds:[1, 2, 3]), aggregationsInfo:[AggregationInfo(aggregationType:SUM, aggregationParams:{column=clicks}, isInSelectList:true, expressions:[clicks]), AggregationInfo(aggregationType:SUM, aggregationParams:{column=impressions}, isInSelectList:true, expressions:[impressions]), AggregationInfo(aggregationType:SUM, aggregationParams:{column=cost_usd_micros}, isInSelectList:true, expressions:[cost_usd_micros])], groupBy:GroupBy(topN:1831, expressions:[utc_date]), filterSubQueryMap:FilterQueryMap(filterQueryMap:{0=FilterQuery(id:0, value:null, operator:AND, nestedFilterQueryIds:[1, 2, 3]), 1=FilterQuery(id:1, column:utc_date, value:[[1591142400000 *)], operator:RANGE, nestedFilterQueryIds:[]), 2=FilterQuery(id:2, column:utc_date, value:[(* 1593648000000)], operator:RANGE, nestedFilterQueryIds:[]), 3=FilterQuery(id:3, column:timestamp, value:[(* 1593561599986]], operator:RANGE, nestedFilterQueryIds:[])}), queryOptions:{responseFormat=sql, groupByMode=sql, timeoutMs=9997}, pinotQuery:PinotQuery(dataSource:DataSource(tableName:metrics), selectList:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:clicks))])), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:impressions))])), Expression(type:FUNCTION, functionCall:Function(operator:SUM, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:cost_usd_micros))]))], filterExpression:Expression(type:FUNCTION, functionCall:Function(operator:AND, operands:[Expression(type:FUNCTION, functionCall:Function(operator:GREATER_THAN_OR_EQUAL, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:LITERAL, literal:<Literal longValue:1591142400000>)])), Expression(type:FUNCTION, functionCall:Function(operator:LESS_THAN, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date)), Expression(type:LITERAL, literal:<Literal longValue:1593648000000>)]))])), groupByList:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date))], orderByList:[Expression(type:FUNCTION, functionCall:Function(operator:ASC, operands:[Expression(type:IDENTIFIER, identifier:Identifier(name:utc_date))]))], limit:1831), orderBy:[SelectionSort(column:utc_date, isAsc:true)], limit:1831)
3=FilterQuery(id:3, column:timestamp, value:[(* 1593561599986]], operator:RANGE, nestedFilterQueryIds:[])
Dan Hill
07/01/2020, 11:22 PMXiang Fu
Dan Hill
07/01/2020, 11:22 PMXiang Fu
Dan Hill
07/01/2020, 11:23 PM2020/07/01 22:26:55.306 ERROR [QueryRouter] [jersey-server-managed-async-executor-19] Caught exception while sending request 1496 to server: pinot-server-0_O, marking query failed
java.net.UnknownHostException: pinot-server-0.pinot-server-headless.pinot-metrics-dev.svc.cluster.local: Name or service not known
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_252]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_252]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_252]
at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_252]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_252]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_252]
at java.net.InetAddress.getByName(InetAddress.java:1077) ~[?:1.8.0_252]
Xiang Fu
select utc_date, sum(impressions), sum(cost_usd_micros) from metrics where utc_date >= 1591142400000 and utc_date < 1593648000000 and utc_date < 1593561599986 group by utc_date order by utc_date LIMIT 1831
Dan Hill
07/01/2020, 11:31 PMXiang Fu
Dan Hill
07/02/2020, 12:01 AMJackie
07/02/2020, 1:52 AMtimestamp
as the time column?select utc_date, sum(impressions), sum(cost_usd_micros) from metrics where utc_date >= 1591142400000 and utc_date < 1593648000000 and timestamp < 1593561599986 group by utc_date order by utc_date LIMIT 1831
(notice that the last filter is on timestamp
)timestamp
?Dan Hill
07/02/2020, 3:58 AMmetrics_realtime_table_config.json: |-
{
"tableName": "metrics",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "timestamp",
"timeType": "MILLISECONDS",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "1",
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"schemaName": "metrics",
"replication": "1",
"replicasPerPartition": "1"
},
"tableIndexConfig": {
"loadMode" : "MMAP",
"aggregateMetrics": true,
"noDictionaryColumns": ["insertions", "impressions", "clicks", "cost_usd_micros"],
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": [
"utc_date",
"platform_id",
"customer_id",
"account_id",
"campaign_id",
"promotion_id"
],
"skipStarNodeCreationForDimensions": [
],
"functionColumnPairs": [
"SUM__insertions",
"SUM__impressions",
"SUM__clicks",
"SUM__cost_usd_micros"
]
},
{
"dimensionsSplitOrder": [
"utc_date",
"platform_id",
"content_id"
],
"skipStarNodeCreationForDimensions": [
],
"functionColumnPairs": [
"SUM__insertions",
"SUM__impressions",
"SUM__clicks",
"SUM__cost_usd_micros"
]
}
],
"streamConfigs": {
"streamType": "kafka",
"stream.kafka.consumer.type": "simple",
"stream.kafka.topic.name": "metrics-realtime",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.hlc.zk.connect.string": "kafka-zookeeper:2181",
"stream.kafka.zk.broker.url": "kafka-zookeeper:2181",
"stream.kafka.broker.list": "kafka:9092",
"realtime.segment.flush.threshold.time": "3600000",
"realtime.segment.flush.threshold.size": "50000",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest"
}
},
"tenants": {},
"metadata": {
"customConfigs": {}
}
}
metrics_schema.json: |-
{
"schemaName": "metrics",
"dimensionFieldSpecs": [
{
"name": "utc_date",
"dataType": "LONG"
},
{
"name": "platform_id",
"dataType": "LONG"
},
{
"name": "customer_id",
"dataType": "LONG"
},
{
"name": "account_id",
"dataType": "LONG"
},
{
"name": "campaign_id",
"dataType": "LONG"
},
{
"name": "promotion_id",
"dataType": "LONG"
},
{
"name": "content_id",
"dataType": "LONG"
}
],
"metricFieldSpecs": [
{
"name": "insertions",
"dataType": "LONG"
},
{
"name": "impressions",
"dataType": "LONG"
},
{
"name": "clicks",
"dataType": "LONG"
},
{
"name": "cost_usd_micros",
"dataType": "LONG"
}
],
"timeFieldSpec": {
"incomingGranularitySpec": {
"name": "timestamp",
"dataType": "LONG",
"timeFormat" : "EPOCH",
"timeType": "MILLISECONDS"
}
}
}
select utc_date, sum(impressions) from metrics where utc_date >= 1591142400000 and utc_date < 1593648000000 and "timestamp" >= 1591142400000 and "timestamp" < 1593648000000 group by utc_date order by utc_date ASC limit 1831
Jackie
07/02/2020, 9:06 PMtimestamp
is not part of the dimensionsSplitOrder
select utc_date, sum(impressions) from metrics_OFFLINE where utc_date >= 1591142400000 and utc_date < 1593648000000 and timestamp < 1593648000000 group by utc_date order by utc_date ASC limit 1831
utc_date
as the time column?Dan Hill
07/02/2020, 9:36 PMselect utc_date, sum(impressions) from metrics_OFFLINE where utc_date >= 1591142400000 and utc_date < 1593648000000 and "timestamp" < 1593648000000 group by utc_date order by utc_date ASC limit 1831
Are there any issues with making utc the time column? I'll eventually support times in some other timezones (as separate dimensions).Jackie
07/02/2020, 9:48 PMDan Hill
07/02/2020, 9:57 PMutc_date
and is slow.select utc_date, sum(impressions) from metrics_OFFLINE where utc_date >= 1591142400000 and utc_date < 1593648000000 group by utc_date order by utc_date ASC limit 1831
Jackie
07/02/2020, 11:45 PMDateTimeField
, @Neha Pawar can give more context on thatDan Hill
07/02/2020, 11:53 PM{
"schemaName": "metrics",
"dimensionFieldSpecs": [
{
"name": "utc_date",
"dataType": "LONG"
},
{
"name": "platform_id",
"dataType": "LONG"
},
{
"name": "customer_id",
"dataType": "LONG"
},
{
"name": "account_id",
"dataType": "LONG"
},
{
"name": "campaign_id",
"dataType": "LONG"
},
{
"name": "promotion_id",
"dataType": "LONG"
},
{
"name": "content_id",
"dataType": "LONG"
}
],
"metricFieldSpecs": [
{
"name": "insertions",
"dataType": "LONG"
},
{
"name": "impressions",
"dataType": "LONG"
},
{
"name": "clicks",
"dataType": "LONG"
},
{
"name": "cost_usd_micros",
"dataType": "LONG"
}
],
"timeFieldSpec": {
"incomingGranularitySpec": {
"name": "timestamp",
"dataType": "LONG",
"timeFormat": "EPOCH",
"timeType": "MILLISECONDS"
}
}
}
Jackie
07/03/2020, 2:38 AMutc_date
which uses DAYS
granularitytimestamp
column{
"schemaName": "metrics",
"dimensionFieldSpecs": [
{
"name": "platform_id",
"dataType": "LONG"
},
{
"name": "customer_id",
"dataType": "LONG"
},
{
"name": "account_id",
"dataType": "LONG"
},
{
"name": "campaign_id",
"dataType": "LONG"
},
{
"name": "promotion_id",
"dataType": "LONG"
},
{
"name": "content_id",
"dataType": "LONG"
}
],
"metricFieldSpecs": [
{
"name": "insertions",
"dataType": "LONG"
},
{
"name": "impressions",
"dataType": "LONG"
},
{
"name": "clicks",
"dataType": "LONG"
},
{
"name": "cost_usd_micros",
"dataType": "LONG"
}
],
"dateTimeFieldSpecs": [
{
"name": "utc_date",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:DAYS"
}
]
}
Dan Hill
07/03/2020, 2:43 AMJackie
07/03/2020, 2:48 AMtableConfig.segmentsConfig.timeColumnName
to create the extra filter and merge the resulttimestamp
is not in the star-tree split order, so your query won't utilize the star-tree indextimestamp
to utc_date
, it is part of the split order, so your query can utilize the star-treeDan Hill
07/03/2020, 3:08 AMutc_date
in the star tree index? Should I remove it?segmentsConfig
?
"segmentsConfig" : {
...
"timeColumnName": "utc_date",
"timeType": "MILLISECONDS",
...
},