Luis Fernandez
05/31/2022, 2:33 PMRaluca Lazar
06/01/2022, 2:05 PMpeerSegmentDownloadScheme
on a realtime table. If I do this manually by either adding this line or removing it on the table config, it works, but doing it via an environment variable does not (passing an empty env var vs passing this string to be replaced: , "peerSegmentDownloadScheme": "http"
) . My question is, is there any other way to disable the peer segment download scheme other than removing the setting entirely? I tried this "peerSegmentDownloadScheme": ""
and it failed with this message:
{
"code": 400,
"error": "Invalid value '' for peerSegmentDownloadScheme. Must be one of http or https"
}
Luis Fernandez
06/01/2022, 2:16 PMSELECT product_id, SUM(impression_count) as impression_count, SUM(click_count) as click_count, SUM(cost) as spent_total FROM metrics
WHERE user_id = xx AND serve_time BETWEEN 1641013200 AND 1654092017
GROUP BY product_id
LIMIT 100000
this is an example of a query we are running
"numServersQueried": 2,
"numServersResponded": 2,
"numSegmentsQueried": 1317,
"numSegmentsProcessed": 168,
"numSegmentsMatched": 117,
"numConsumingSegmentsQueried": 0,
"numDocsScanned": 69212,
"numEntriesScannedInFilter": 1165155303,
"numEntriesScannedPostFilter": 415272,
"numGroupsLimitReached": false,
"totalDocs": 10362679599,
"timeUsedMs": 4623,
this is some of the stats that come back. our data resolution is hourly for this data. Do you all have any idea how to make a query like this perform better?
We have an idea of making the records not have hourly resolution after certain period of time but have daily resolution so that the data is compressed even further but wanted to ask if there are any methods we could use for this and if the method we can implement makes sense to you all.Diogo Baeder
06/01/2022, 4:39 PMBruno Brandão
06/01/2022, 9:36 PMpinot.service.role=CONTROLLER
controller.port=9001
controller.zk.str=localhost:2181
controller.access.protocols.http.port=9001
pinot.cluster.name=MyClusterName
controller.vip.host=localhost
controller.vip.port=9001
controller.data.dir=/tmp/pinot/data/controller
controller.helix.cluster.name=MyClusterName
pinot.set.instance.id.to.hostname=true
controller.admin.access.control.principals=admin,user
controller.admin.access.control.principals.user.password=admin
controller.admin.access.control.principals.user.permissions=READ
controller.admin.access.control.principals.admin.password=admin
controller.admin.access.control.factory.class=org.apache.pinot.controller.api.access.BasicAuthAccessControlFactoryI’m executing with Docker compose and the following call is the one that I use to iniciate the controller:
StartController -configFileName /tmp/conf/pinot-controller.confThe following mistake appears:
2022/06/01 154850.463 INFO [StartControllerCommand] [main] Executing command: StartController -configFileName /tmp/conf/pinot-controller.conf
pinot-controller | 2022/06/01 154850.541 ERROR [StartControllerCommand] [main] Caught exception while starting controller, exiting.
pinot-controller | java.lang.NullPointerException: null
pinot-controller | at org.apache.pinot.tools.admin.command.StartControllerCommand.getControllerConf(StartControllerCommand.java:207) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at org.apache.pinot.tools.admin.command.StartControllerCommand.execute(StartControllerCommand.java:183) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at org.apache.pinot.tools.Command.call(Command.java:33) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at org.apache.pinot.tools.Command.call(Command.java:29) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine.executeUserObject(CommandLine.java:1953) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine.access$1300(CommandLine.java:145) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at picocli.CommandLine.execute(CommandLine.java:2078) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:161)
[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller | at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:192) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
pinot-controller exited with code 255Does that mistake happen due some account configuration problem or is it some sort of internal bug from Apache Pinot?
Sumit Lakra
06/02/2022, 11:04 AMStuart Millholland
06/02/2022, 4:04 PMAbhijeet Kushe
06/03/2022, 6:36 PMPriyank Bagrecha
06/03/2022, 10:54 PM>>> import trino
>>> conn = trino.dbapi.connect(host='<redacted>', port=8443, catalog='pinot', schema='default', http_scheme='https', auth=trino.auth.BasicAuthentication("xxx", "yyyy"))
>>> cur = conn.cursor()
>>> cur.execute('SELECT * FROM mytable LIMIT 10')
<trino.client.TrinoResult object at 0x10428d160>
>>> rows = cur.fetchall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/trino/dbapi.py", line 558, in fetchall
return list(self.genall())
File "/usr/local/lib/python3.8/site-packages/trino/client.py", line 509, in __iter__
rows = self._query.fetch()
File "/usr/local/lib/python3.8/site-packages/trino/client.py", line 677, in fetch
status = self._request.process(response)
File "/usr/local/lib/python3.8/site-packages/trino/client.py", line 440, in process
raise self._process_error(response["error"], response.get("id"))
trino.exceptions.TrinoQueryError: TrinoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message="Failed communicating with server: <http://pinot-broker-1.pinot-broker-headless.pinot-dev-ns.svc.cluster.local:8099/debug/routingTable/mytable>", query_id=20220603_211510_00025_9srer)
I am using the external ip of the loadbalancer i.e service/pinot-controller-external
with port 9000 for pinot.controller-urls
. If it helps, I am using community provided helm chart to stand up the pinot infrastructure on AWS EKS.Alice
06/05/2022, 12:50 AMAlice
06/05/2022, 3:50 AMAli Atıl
06/06/2022, 8:03 AMTommaso Peresson
06/06/2022, 10:23 AM{
"OFFLINE": {
"tableName": "DailyUniqHll_OFFLINE",
"tableType": "OFFLINE",
"segmentsConfig": {
"timeType": "DAYS",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "365",
"replication": "1",
"timeColumnName": "partition",
"allowNullTimeValue": false
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"enableDefaultStarTree": false,
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": [
"partition",
"fields.1",
"fields.2",
"fields.3",
"fields.4",
"fields.5",
"fields.6",
"fields.7",
"fields.8",
"fields.9"
],
"functionColumnPairs": [
"SUM__counters.c",
"DISTINCTCOUNTHLL__hllState"
],
"maxLeafRecords": 1000
}
],
"enableDynamicStarTreeCreation": true,
"aggregateMetrics": false,
"nullHandlingEnabled": false,
"rangeIndexVersion": 2,
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false
},
"metadata": {},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
},
"complexTypeConfig": {
"fieldsToUnnest": [
"fields",
"counters"
],
"delimiter": ".",
"collectionNotUnnestedToJson": "NON_PRIMITIVE"
}
},
"isDimTable": false
}
}
Schema:
{
"schemaName": "ViewElementDailyUniqHll",
"dimensionFieldSpecs": [
{
"name": "fields.1",
"dataType": "STRING"
},
{
"name": "fields.2",
"dataType": "STRING"
},
{
"name": "fields.3",
"dataType": "STRING"
},
{
"name": "fields.4",
"dataType": "STRING"
},
{
"name": "fields.5",
"dataType": "STRING"
},
{
"name": "fields.6",
"dataType": "STRING"
},
{
"name": "fields.7",
"dataType": "STRING"
},
{
"name": "fields.8",
"dataType": "STRING"
},
{
"name": "fields.9",
"dataType": "STRING"
},
{
"name": "cubeName",
"dataType": "STRING"
},
{
"name": "list",
"dataType": "LONG",
"singleValueField": false
},
{
"name": "hllState",
"dataType": "BYTES"
},
{
"name": "counters.c",
"dataType": "INT"
}
],
"dateTimeFieldSpecs": [
{
"name": "partition",
"dataType": "STRING",
"format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
"granularity": "1:DAYS"
}
]
}
When I ingest some data I get a ~10x size increase because of DISTINCTCOUNTHLL__hllState
in the star tree index. Is this expected? Is there something misconfigured?Luis Fernandez
06/06/2022, 1:25 PM{\"numPartitions\":8,\"partitions\":[0,1,2,3,4,5,6,7]
however in our prod system, which has a hybrid setup I always see one number in the partitions column,
{\"numPartitions\":8,\"partitions\":[1]
is this something I should be concerned about?Mayank
Varagini Karthik
06/06/2022, 4:02 PMsudo docker run --rm -ti \
--network=pinot-demo_default \
-v /home/XXXX/dna/pinot/lookup2/pinot-quick-start:/home/XXXX/dna/pinot/lookup2/pinot-quick-start \
--name pinot-batch-table-creation \
apachepinot/pinot:latest AddTable \
-schemaFile /home/XXXX/dna/pinot/lookup2/pinot-quick-start/orders-schema.json \
-tableConfigFile /home/XXXX/dna/pinot/lookup2/pinot-quick-start/orders-table-offline.json \
-controllerHost manual-pinot-controller \
-controllerPort 9000 -exec
sudo docker run --rm -ti \
--network=pinot-demo_default \
-v /home/XXXX/dna/pinot/lookup/pinot-quick-start:/home/XXXX/dna/pinot/lookup/pinot-quick-start \
--name pinot-data-ingestion-job \
apachepinot/pinot:latest LaunchDataIngestionJob \
-jobSpecFile /home/XXXX/dna/pinot/lookup/pinot-quick-start/docker-job-spec.yml
Mathieu Druart
06/06/2022, 4:30 PMselect distinct myMultiValuedColumn from MyTable where otherColumn in ('MY_VALUE') limit 1000
I have this error :
"message": "QueryExecutionError:\njava.lang.UnsupportedOperationException\n\tat org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readDictIds(ForwardIndexReader.java:84)\n\tat org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readDictIds(DataFetcher.java:418)\n\tat org.apache.pinot.core.common.DataFetcher.fetchDictIds(DataFetcher.java:89)\n\tat org.apache.pinot.core.common.DataBlockCache.getDictIdsForSVColumn(DataBlockCache.java:109)",
"errorCode": 200
If I remove the distinct or the where clause, I have no issue. Am I missing something ? Thank you !Alice
06/07/2022, 2:26 PMDiogo Baeder
06/07/2022, 4:07 PMPriyank Bagrecha
06/07/2022, 6:17 PMPrashant Pandey
06/08/2022, 5:04 AMSowmya Gowda
06/08/2022, 6:57 AM{
"OFFLINE": {
"tableName": "test_transcript_OFFLINE",
"tableType": "OFFLINE",
"segmentsConfig": {
"schemaName": "test_transcript",
"replication": "1",
"timeColumnName": "timestamp",
"segmentPushFrequency": "HOURLY",
"segmentPushType": "APPEND",
"replicasPerPartition": "1"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableIndexConfig": {
"invertedIndexColumns": [],
"noDictionaryColumns": [],
"rangeIndexColumns": [],
"rangeIndexVersion": 2,
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false,
"sortedColumn": [],
"bloomFilterColumns": [],
"loadMode": "MMAP",
"onHeapDictionaryColumns": [],
"varLengthDictionaryColumns": [],
"enableDefaultStarTree": false,
"enableDynamicStarTreeCreation": false,
"aggregateMetrics": false,
"nullHandlingEnabled": false
},
"metadata": {},
"quota": {},
"task": {
"taskTypeConfigsMap": {
"SegmentGenerationAndPushTask": {
"schedule": "/5 * * * * ?",
"tableMaxNumTasks": "10"
}
}
},
"routing": {},
"query": {},
"ingestionConfig": {
"batchIngestionConfig": {
"batchConfigMaps": [
{
"input.fs.className": "org.apache.pinot.plugin.filesystem.S3PinotFS",
"input.fs.prop.region": "us-east-1",
"input.fs.prop.secretKey": "*****",
"input.fs.prop.accessKey": "*****",
"inputDirURI": "<s3://pp-airflow-qa/dremio_test_files/jsonfiles/>",
"includeFileNamePattern": "glob:**/*.json",
"excludeFileNamePattern": "glob:**/*.tmp",
"inputFormat": "json"
}
],
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "HOURLY"
}
},
"isDimTable": false
}
}
Kevin Liu
06/08/2022, 8:09 AMLuis Fernandez
06/08/2022, 5:21 PMabhinav wagle
06/08/2022, 6:48 PMQuery console
and check logs on Pinot Broker Pod. I see following log. Is there a way to identify source_ip
on who/which server issued the query. My goal is to have trace of logs on broker which can provide info on which user/host is querying Pinot.
requestId=130,table=<redacted>,timeMs=23,docs=72/108615312,entries=2741105/792,segments(queried/processed/matched/consuming/unavailable):256/253/8/32/0,consumingFreshnessTimeMs=1654707653215,servers=6/6,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);pinot-server-2_R=0,4,1817,0,1;pinot-server-5_R=0,5,1817,1,1;pinot-server-1_R=0,20,1820,0,1;pinot-server-4_R=1,21,1820,0,1;pinot-server-0_R=1,4,1818,0,1;pinot-server-3_R=1,5,1817,0,1,offlineThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,realtimeThreadCpuTimeNs(total/thread/sysActivity/resSer):0/0/0/0,query=<redacted>
Luis Fernandez
06/08/2022, 7:02 PMLuis Fernandez
06/08/2022, 7:39 PMSELECT product_id, SUM(impression_count) as impression_count, SUM(click_count) as click_count, SUM(cost) as spent_total FROM metrics WHERE user_id = xxx AND serve_time BETWEEN 1651363200 AND 1654012799 GROUP BY product_id LIMIT 6000
production metadata response:
"numServersQueried": 4,
"numServersResponded": 4,
"numSegmentsQueried": 97,
"numSegmentsProcessed": 31,
"numSegmentsMatched": 31,
"numConsumingSegmentsQueried": 1,
"numDocsScanned": 15109,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 60436,
"numGroupsLimitReached": false,
"totalDocs": 493642793,
"timeUsedMs": 32,
"offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0,
"offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 1654715649414,
"numRowsResultSet": 9708
dev metadata response:
"exceptions": [],
"numServersQueried": 4,
"numServersResponded": 4,
"numSegmentsQueried": 11703,
"numSegmentsProcessed": 31,
"numSegmentsMatched": 31,
"numConsumingSegmentsQueried": 1,
"numDocsScanned": 15117,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 60468,
"numGroupsLimitReached": false,
"totalDocs": 51283295726,
"timeUsedMs": 580,
"offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0,
"offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 1654716958681,
"numRowsResultSet": 9708
amount of segments in prod: 1600
amount of segments in dev: 13000
I guess my question is that I see segments queried be way higher in dev and I’m wondering why and if that’s the reason why the query is just performing slower in dev it’s almost equal to the amount of segments that exist in the cluster while prod is only querying a tiny portion. Do you have an idea as to what may be happening?Alice
06/09/2022, 12:16 AMingestionConfig.batchIngestionConfig.segmentIngestionFrequency
in the offline table”.
I’m wondering how is time boundary computed for offline table without this config configured, if RealtimeToOfflineSegmentsTask is configured in the corresponding realtime table?Alice
06/09/2022, 3:05 AMLuis Fernandez
06/09/2022, 4:05 PM