Apache Pinot #general

Jackie

02/20/2020, 1:19 AM

But smaller segment can hurt performance in general as you need to process and merge more segments

Hemavathi

02/20/2020, 9:11 AM

We have a column which contains the “json file content” as the value. It seems to be Pinot table string type has some length limitation. So when we try to write this value to the string column, the value got truncated. Is there any configuration available to resolve this?

Sidd

02/20/2020, 11:12 AM

The default is 512 chars. Also, I am not sure if json file content qualifies for a string column value. It is more into blob/clob territory

Adli

02/20/2020, 11:22 AM

Hi <!here>, could anyone tell me the recent status of this feature? https://cwiki.apache.org/confluence/display/PINOT/%5BProposal%5D+Pinot+upsert+design+doc

Kishore G

02/20/2020, 4:09 PM

@User @User can give you more info on that. That feature is available on a branch.

Xiang Fu

02/20/2020, 11:01 PM

@User do we have any tooling for this? ^^

Subbu Subramaniam

02/21/2020, 12:41 AM

no tooling, you can update zk metadata. Tooling for this (and other stuff in realtime) is an ask for contribution, I believe i have an issue some place

Hemavathi

02/21/2020, 6:35 AM

@User Thanks for the details, the schema allows to change the size of the string column, we will update once loaded the data.

Hemavathi

02/21/2020, 6:56 AM

We have configured “Column1” as “NoDictionaryColumn” and “bloomFilterColumns” and try to filter the query via “Column1” but got timeout error. As per our understanding if we have the high cardinality string column and requires the filter based on this value then prefer the combination of “NoDictionaryColumn” and “bloomFilterColumns”. But it didn’t work as expected. If we include the “Column1” in dictionary and apply “bloomFilterColumns” then we could not able to find any performance differences. Need your help for better understanding. "tableIndexConfig" : { "noDictionaryColumns": [ "Column1" ], "bloomFilterColumns" : ["Column1"], "loadMode" : "MMAP", "lazyLoad" : "false" } select * from table where Column1 = 'XXX'

Seunghyun

02/21/2020, 7:15 AM

@User try to add inverted index on

Column1

and try out the same query

Hemavathi

02/21/2020, 7:16 AM

Ya, it works fine in inverted index, we need to understand the exact use case for bloom filer

Seunghyun

02/21/2020, 7:17 AM

if your cardinality is very high, bloom filter may not perform well

Seunghyun

02/21/2020, 7:17 AM

what’s your cardinality of the column per segment?

Seunghyun

02/21/2020, 7:18 AM

we currently put 1MB limit for bloomfilter size per segment

Seunghyun

02/21/2020, 7:20 AM

so it will work well up to ~1M cardinality

Seunghyun

02/21/2020, 7:22 AM

https://krisives.github.io/bloom-calculator/ set error = 0.05, and play with count

Hemavathi

02/21/2020, 7:23 AM

in our case the segment size is around 240 MB and i think the bloomfilter size may exceed 1 MB

Seunghyun

02/21/2020, 7:24 AM

so if the cardinality is too high, our current implementation makes bloom filter not useful

Seunghyun

02/21/2020, 7:24 AM

this is because our bloom filter implementation is on-heap based

Hemavathi

02/21/2020, 7:24 AM

ok got it, will try with some small size

Hemavathi

02/21/2020, 7:24 AM

thank you

Seunghyun

02/21/2020, 7:25 AM

yeah we can improve bloomfilter feature by making size limit configurable but that should come along with offheap implementation

Sidd

02/21/2020, 12:09 PM

<!here>, as part of working on PR (https://github.com/apache/incubator-pinot/pull/5074), I hit a bug where adding a new column and then enabling inverted index with V1 segment format is not supported on the segment reload path. We hit NPE. I don't know if this is intentionally not supported. In my PR I was adding tests for supporting text index reload for both V1 and V3 and that's when I discovered this.

Sidd

02/21/2020, 12:10 PM

I have put a fix here -- https://github.com/apache/incubator-pinot/pull/5087

👍 1

Seunghyun

02/21/2020, 6:14 PM

we changed our distribution to shade everything?

Kishore G

02/21/2020, 10:02 PM

yes, it that failing?

Xiang Fu

02/21/2020, 10:04 PM

I think we also change

quick-start-offline.sh

quick-start-batch.sh

Seunghyun

02/21/2020, 11:16 PM

let me retry with a clean checkout

Seunghyun

02/21/2020, 11:16 PM

maybe old file didn’t get deleted

Seunghyun

02/21/2020, 11:40 PM

Copy code

~/workspace/pinot/pinot-distribution/target/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/bin master* 1m 10s
❯ ./quick-start-batch.sh
***** Starting Zookeeper, controller, broker and server *****
Executing command: StartZookeeper -zkPort 2123 -dataDir /var/folders/1s/11z0n1j9057dk1nhgjdgfcp0000mp7/T//PinotAdmin/zkData
Start zookeeper at localhost:2123 in thread main
Executing command: StartController -clusterName QuickStartCluster -controllerHost 172.25.113.39 -controllerPort 9000 -dataDir /var/folders/1s/11z0n1j9057dk1nhgjdgfcp0000mp7/T//PinotController -zkAddress localhost:2123
Invalid instance setup, missing znode path: /QuickStartCluster/CONFIGS/PARTICIPANT/Controller_172.25.113.39_9000
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Controller_172.25.113.39_9000/MESSAGES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Controller_172.25.113.39_9000/CURRENTSTATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Controller_172.25.113.39_9000/STATUSUPDATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Controller_172.25.113.39_9000/ERRORS
Feb 21, 2020 3:37:50 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:9000]
Feb 21, 2020 3:37:50 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer] Started.
Executing command: StartBroker -brokerHost null -brokerPort 8000 -zkAddress localhost:2123
Feb 21, 2020 3:37:58 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:8000]
Feb 21, 2020 3:37:58 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-1] Started.
Invalid instance setup, missing znode path: /QuickStartCluster/CONFIGS/PARTICIPANT/Broker_172.25.113.39_8000
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Broker_172.25.113.39_8000/MESSAGES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Broker_172.25.113.39_8000/CURRENTSTATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Broker_172.25.113.39_8000/STATUSUPDATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Broker_172.25.113.39_8000/ERRORS
Executing command: StartServer -clusterName QuickStartCluster -serverHost 172.25.113.39 -serverPort 7000 -serverAdminPort 7500 -dataDir /tmp/1582328263750/PinotServerData0 -segmentDir /tmp/1582328263750/PinotServerSegment0 -zkAddress localhost:2123
Invalid instance setup, missing znode path: /QuickStartCluster/CONFIGS/PARTICIPANT/Server_172.25.113.39_7000
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Server_172.25.113.39_7000/MESSAGES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Server_172.25.113.39_7000/CURRENTSTATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Server_172.25.113.39_7000/STATUSUPDATES
Invalid instance setup, missing znode path: /QuickStartCluster/INSTANCES/Server_172.25.113.39_7000/ERRORS
Feb 21, 2020 3:38:04 PM org.glassfish.grizzly.http.server.NetworkListener start
INFO: Started listener bound to [0.0.0.0:7500]
Feb 21, 2020 3:38:04 PM org.glassfish.grizzly.http.server.HttpServer start
INFO: [HttpServer-2] Started.
***** Adding baseballStats table *****
Executing command: AddTable -tableConfigFile /Users/snlee/workspace/pinot/pinot-distribution/target/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/bin/quickStartData1582328263666/baseballStats_offline_table_config.json -schemaFile /Users/snlee/workspace/pinot/pinot-distribution/target/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/apache-pinot-incubating-0.3.0-SNAPSHOT-bin/bin/quickStartData1582328263666/baseballStats_schema.json -controllerHost 172.25.113.39 -controllerPort 9000 -exec
{"status":"Table baseballStats_OFFLINE succesfully added"}
***** Launch data ingestion job to build index segment for baseballStats and push to controller *****
Exception in thread "main" java.lang.IllegalStateException: PinotFS for scheme: jar has not been initialized
	at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:518)
	at org.apache.pinot.spi.filesystem.PinotFSFactory.create(PinotFSFactory.java:78)
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:115)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:96)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:77)
	at org.apache.pinot.tools.admin.command.QuickstartRunner.launchDataIngestionJob(QuickstartRunner.java:183)
	at org.apache.pinot.tools.Quickstart.execute(Quickstart.java:154)
	at org.apache.pinot.tools.Quickstart.main(Quickstart.java:209)