You don't need the group id or any of the properti...
# getting-started
n
You don't need the group id or any of the properties that say "hlc". Your tables might be out of sync because you've set offset criteria "largest". Each table will start consuming from the last message in the topic, so if your rate of events is high, second table will miss out on events that were emitted between creation of first and second table
p
I tried with smallest instead of largest first and that's where I was seeing the difference and then I started using largest after that. I did see in the code that Pinot uses <table_name>_<timestamp> as a default group id. I am still confused why I don't see it in the list of consumer groups. I'll try again today.
n
The concept of consumer group is not used in low level consumer
p
And I am creating tables in both clusters at the same time using the same topic. if anything I would expect the difference to be smaller and not 2-3x of one another as event rate is low.
i see. also forgot to mention that i am using 0.7.1 and kafka 2.x
how do i use consumer group with high level consumer? clearly i am missing something when configuring that as well.
do i need to use
stream.kafka.hlc.zk.connect.string
and
stream.kafka.zk.broker.url
? i see those in the example table configs in the github repo for high level consumer. kafka cluster has its own zookeeper, and each pinot cluster have their own zookeeper as well.
n
you shouldn’t be using high level, and hence shouldn’t have to worry about consumer group
p
I see. Could you please go into a little bit about why you recommend that?
n
we’ve stopped actively developing high-level consumer and would likely deprecate it soon. All you need for properties is https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion/import-from-apache-kafka
p
Got it. Thank you so much once again for your help and time.
n
still doesn’t solve your missing events issue though..is there a way for you to run some queries (like min/max timestamps, or count(*) group by timestamp) to verify that you’re indeed seeing events being missed?
p
Yeah let me try those queries and share results with you.
i am setting up everything to be able to run those queries. in the mean time i have few more questions. does low level consumer use group id by itself? or am i wrong in understanding that it uses a default group id based on table name and timestamp? if it is doing that, would merely using a different table name help? if it is using a group id internally i don't understand why kafka-consumer-groups doesn't show it? i do empty space as one of the consumer group. if it is not using a group id, then wouldn't the two tables compete with each other to consume from the same topic in the same kafka cluster?
output for
select min(upload_time), max(upload_time) from table
for table with inverted index
output for
select min(upload_time), max(upload_time) from table
for table with star-tree index
looks like the one with
star-tree
index is lagging behind.
used
largest
instead of
smallest
and they tend to be more or less doing similarly well. i think it also helped that i used different table names for table with inverted index v/s table with star-tree index. i don't have any proof othe than what i am seeing 😂 . thank you neha for all the help, your time and patience. much appreciated!
n
oh cool..
regarding
does low level consumer use group id by itself? or am i wrong in understanding that it uses a default group id based on table name and timestamp? if it is doing that, would merely using a different table name help? if it is using a group id internally i don't understand why kafka-consumer-groups doesn't show it? i do empty space as one of the consumer group.
- we dont use group id even internally.
``if it is not using a group id, then wouldn't the two tables compete with each other to consume from the same topic in the same kafka cluster` - Not sure what you mean by 2 tables should compete with each other. If you’re saying that 2 tables will interfere with each other, such that the messages they each receive are exclusive to the other - then no, that is not what happens. We directly consume from offsets inside the pinot-server, maintaining our own checkpointing
👍 1
this might help: https://www.confluent.io/resources/kafka-summit-2020/apache-pinot-case-study-building-distributed-analytics-systems-using-apache-kafka/ This talks about how and why we moved away from high-level to low level, and how it works internally
p
thank you. i do have questions around consuming from kafka and offset management. i'll go through this case study first.