https://pinot.apache.org/ logo
a

Alexander Vivas

02/11/2021, 3:04 PM
Guys, I merged the latest changes into our fork, built again the docker image, deployed and now when we try to create a table we see this error in the controller:
ClassNotFoundException: org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory
Any suggestions?
d

Daniel Lavoie

02/11/2021, 3:05 PM
Are you running a customized PluginDir value?
a

Alexander Vivas

02/11/2021, 3:07 PM
Nope, actually we're using almost everything as it was in the first place
pluginsDir: /opt/pinot/plugins
d

Daniel Lavoie

02/11/2021, 3:08 PM
what about
plugins.include
?
a

Alexander Vivas

02/11/2021, 3:08 PM
This is the portion in values.yaml used to deploy in a kubernetes environment using helm:
d

Daniel Lavoie

02/11/2021, 3:12 PM
can you list the content
/opt/pinot/plugins
 of your within your forked image?
Also, can you extract your controller logs using this command?
Copy code
kubectl exec <controller-pod-name> -- cat pinotController.log > controller.log
The startup logs all plugins being loaded
a

Alexander Vivas

02/11/2021, 3:21 PM
This is what I see in
/opt/pinot/plugins
d

Daniel Lavoie

02/11/2021, 3:22 PM
If you can share your controller logs we’ll have a better understand of what is going on
a

Alexander Vivas

02/11/2021, 3:40 PM
d

Daniel Lavoie

02/11/2021, 3:42 PM
Somehow, you have provided
pinot-gcs
as
plugins.include
. the plugin is loaded by default, so you don’t need specify it. Overwriting plugins.include will disable all other plugins, including the kafka one.
a

Alexander Vivas

02/11/2021, 3:42 PM
d

Daniel Lavoie

02/11/2021, 3:43 PM
Yes, the documentation may induce in error since it doesn’t mention that 1- other plugins will be disabled, 2- it’s already part of the docker image.
I don’t see anything in your
values.yaml
, so I guess it is part of your docker image fork?
a

Alexander Vivas

02/11/2021, 3:45 PM
It's in the
jvmOpts
section of
values.yaml
d

Daniel Lavoie

02/11/2021, 3:45 PM
Not for the controller
You only shared the controller values
a

Alexander Vivas

02/11/2021, 3:49 PM
Sorry, wrong file
This is the one
d

Daniel Lavoie

02/11/2021, 3:50 PM
anyways, just remove
-Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-gcs
from all your jvmArgs
a

Alexander Vivas

02/11/2021, 3:51 PM
Okay, I'll try that
I did that and I just deployed our table in pinot. For that table I set this properties:
Copy code
"realtime.segment.flush.threshold.time":"24h"
"realtime.segment.flush.threshold.size":"0"
"realtime.segment.flush.desired.size":"500M"
The thing is, despite having set 24 hours and 500MB as our limits in time and size for segments, I see this behavior in gcs, not sure if that's a good sign or not:
It's been consuming by roughly ten minutes or so and there is already 9 segments, one of them is not showing here yet, but look at those sizes, is that okay?
d

Daniel Lavoie

02/12/2021, 1:17 PM
I’m not sure but I don’t think threshold.size 0 is helping out
Copy code
realtime.segment.flush.threshold.segment.size
realtime.segment.flush.threshold.time
realtime.segment.flush.threshold.rows
So you want something like
Copy code
"realtime.segment.flush.threshold.time":"24h",
"realtime.segment.flush.threshold.size":"500M",
"realtime.segment.flush.threshold.rows": "500000000",
Default values for
But my guess is that 500m records will always be bigger than 500MB 😛 So I would leave rows to default value.
a

Alexander Vivas

02/12/2021, 1:30 PM
Nice! I'll try those out
Thanks
So, just to confirm, this should be the properties to go on the table config, right?
Copy code
"realtime.segment.flush.threshold.time":"24h"
"realtime.segment.flush.threshold.segment.size":"500M"
"realtime.segment.flush.threshold.rows": "5000000"
d

Daniel Lavoie

02/12/2021, 2:09 PM
Yes next to your kafka config
The rows config is not required since 5 million is the default
a

Alexander Vivas

02/12/2021, 2:10 PM
Ah, okay, nice
Interesting, now it consumed 5 million messages from kafka and stopped consuming, is that okay?
d

Daniel Lavoie

02/12/2021, 2:34 PM
What are you logs saying?
a

Alexander Vivas

02/12/2021, 2:34 PM
Created only one segment, every time I try to get the number of records in pinot I get different results, I presume this is because the query get to a server that doesn't hold yet the whole segment, we have 2 replicas configured
d

Daniel Lavoie

02/12/2021, 2:35 PM
Check the server logs
a

Alexander Vivas

02/12/2021, 2:38 PM
d

Daniel Lavoie

02/12/2021, 2:40 PM
Zookeeper is down.
Copy code
21/02/12 10:26:44.302 WARN [ClientCnxn] [Start a Pinot [SERVER]-SendThread(mls-zookeeper.production.svc.cluster.local:2181)] Session 0x100722c6f210005 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
Unless that is an old message. Can you share the controller logs too?
a

Alexander Vivas

02/12/2021, 2:41 PM
No, actually it's been quite stable for now
I think that might be because I redeployed everything to use gcs in an attempt to configure it as the deep store
Another thing is, It stopped consuming after reaching 5 million messages but still hasn't stored the segment in gcs as it was doing previously
d

Daniel Lavoie

02/12/2021, 2:46 PM
Can you share the controller logs?
there’s a high chance it’s simply failing to store the segment
General tip, whenever something is not working as expected, always check all the logs first
a

Alexander Vivas

02/12/2021, 2:56 PM
But I don't see any errors regarding gcs here
d

Daniel Lavoie

02/12/2021, 2:57 PM
How many controllers do you have?
a

Alexander Vivas

02/12/2021, 2:57 PM
3
d

Daniel Lavoie

02/12/2021, 2:57 PM
get the logs from all of them 🙂
a

Alexander Vivas

02/12/2021, 2:58 PM
Okay, hold on
I see more data in controller 2,
It took it quite some time to upload the segment data to gcs, but now I see it there. It didn't reach the 500MB goal though
message has been deleted
I saw a log regarding the minion component, I only started 1 instance with the default config values because I wasn't sure how we were supposed to use it
Now I see pinot started consuming again
Didn't change anything, I was just getting the logs out of every controller instance
d

Daniel Lavoie

02/12/2021, 3:22 PM
how many servers do you have?
a

Alexander Vivas

02/12/2021, 3:22 PM
We have 3 servers
d

Daniel Lavoie

02/12/2021, 3:22 PM
pinot-server
a

Alexander Vivas

02/12/2021, 3:23 PM
Yep, 3 pinot server instances
d

Daniel Lavoie

02/12/2021, 3:23 PM
ok, keep monitoring and tell me if you keep witnessing the same behavior. You could also increase the
rows
value 10 000 000, your segments should double in size
a

Alexander Vivas

02/12/2021, 3:24 PM
Okay, and then regarding the minion instance, should I also provide the same resources to it? How many minion instances should we have with this architecture?
d

Daniel Lavoie

02/12/2021, 3:25 PM
Minion is used for operational tasks such as scheduled batch ingestion.
If you only need realtime streaming, you don’t need minion instances
a

Alexander Vivas

02/12/2021, 3:26 PM
Okay, then if I understood correctly, if we were to have offline tables and ingest data from any other source then we should increase the minion resources and instances. Is that correct?
d

Daniel Lavoie

02/12/2021, 3:27 PM
Do your ingestion need, yes
Typically, a bunch of workers for an initial load, then just the right amount for the periodic new files to ingest
a

Alexander Vivas

02/12/2021, 3:28 PM
Ah, okay, so it depends on the workload for the batch ingestion
Many thanks!
d

Daniel Lavoie

02/12/2021, 3:29 PM
Yes, it scales horizontally pretty well, it’s usually CPU bottlenecked because it’s responsible for generating the segments.
a

Alexander Vivas

02/12/2021, 6:28 PM
It works perfectly!
Last question, can we use this https://docs.pinot.apache.org/basics/data-import/pinot-file-system/import-from-gcp#job-spec to import data exported from bigquery?
d

Daniel Lavoie

02/12/2021, 6:30 PM
As long as your can export the data with a schema matching the one of pinot, yes you can
a

Alexander Vivas

02/12/2021, 6:30 PM
Thanks