After adding a Kafka partition there is a issue that data is Apache Pinot #troubleshooting

After adding a Kafka partition, there is a issue t...

sunny

09/22/2022, 6:30 AM

After adding a Kafka partition, there is a issue that data is not visible when querying the pinot partition table with a where query.

Copy code

1. Create Pinot Partitioned Table (kafka topic partition = 3)
- Query Successed.

2. Add kafka partition to a topic (3->4)
- New consuming segment(3__0) is created in pinot.

3. Add kafka topic data to a new partition.
- Query Successed (select * from)
- But, the row is not shown in the query ( select * from where in)
- The row is visible only when the segment is completed. However, the data coming into the new consuming segment(3__1) doesn't look the same as before.

It may happen that kafka partitions are increased during operation. So Please check the issue. 😊

sunny

09/22/2022, 6:39 AM

Image_1663827855.png,Image_1663828656.png

sunny

09/22/2022, 6:41 AM

Image_1663827870.png

Mayank

09/22/2022, 2:15 PM

New partition should be detected automatically (either periodically or when flushing the current batch I need to check). But you may force that by applying rebalance

Subbu Subramaniam

09/22/2022, 5:26 PM

A better way to force it maybe to run the realtime periodic job on demand.

Mayank

09/22/2022, 8:57 PM

Perhaps a good one to document, unless already there.

Subbu Subramaniam

09/22/2022, 9:20 PM

ok, let me see where to add this

Subbu Subramaniam

09/23/2022, 12:08 AM

@Mayank I have submitted doc change for review. Not sure how to tag you as the reviewer, but can u take a look please?

Mayank

09/23/2022, 12:15 AM

Thanks @Subbu Subramaniam, is that a PR on the pinot-docs repo? I didn’t find one there, if you can link it here that would help

sunny

09/23/2022, 12:15 AM

Thank you for checking. Pinot detected new partition automatically. (new consuming segment is created) But, the problem is that the row in the new consuming segment is not shown. If I run realtime periodic job, is the row visible in where query ?

Mayank

09/23/2022, 12:16 AM

Yes, that’s what we are suggesting

sunny

09/23/2022, 12:17 AM

Ok, If doc is updated, Please notice to me :)

sunny

09/23/2022, 7:10 AM

@Mayank @Subbu Subramaniam Hi, As you said, I ran the task directly.

Copy code

curl -X GET "<http://pay-poc-pinot.sandbox.onkakao.net:9001/periodictask/run?taskname=RealtimeSegmentValidationManager&tableName=transcript_part_tbl&type=REALTIME|http://pay-poc-pinot.sandbox.onkakao.net:9001/periodictask/run?taskname=RealtimeSegmentValidationManager&tableName=transcript_part_tbl&type=REALTIME>" -H "accept: application/json"

We have confirmed that the task is running on all controllers. However, segment processing is not done and the data is still not visible.

Mayank

09/23/2022, 1:56 PM

The default limit is 10, you need to increase limit or run other queries to check if data is available or not

sunny

09/29/2022, 7:30 AM

@Mayank when increasing limit, data is not visible. But, When I query(select from where) the broker, the data comes out normally. I don't know why I can't find the query when querying from the controller web.

Copy code

curl -u admin:verysecret  -H "Content-Type: application/json"  -X POST   <http://pay-poc-pinot-b2.ay1.krane.9rum.cc:7001/query/sql|http://pay-poc-pinot-b2.ay1.krane.9rum.cc:7001/query/sql> -d @pinot.json  | python -m json.tool

However, numSegmentsQueried is equal to the total number of segments. In other words, it seems that broker pruning does not work in the Pinot partition table.

sunny

09/29/2022, 7:32 AM

I have 2 questions. 1. ☆☆☆☆ Even if I increase the number of Kafka topic partitions, can it be used normally in the Pinot partition table? 2. After adding the Kafka topic partition, the data is not visible when the query(select from where) is executed through the controller, and the data is visible when executed through the broker. What is the cause of the difference? Please check :) Thank you

Subbu Subramaniam

09/30/2022, 2:27 AM

AFAIK if you have partitioned the ingestion topic according to a key, and then added ONE partition, things should not work, since the existing keys partition and new keys partition will not match. The only way to do this without compromising on continuous operation is to double the number of partitions in the stream (if you need more partitions). @Seunghyun and @Mayank can correct me if I am wrong.

Subbu Subramaniam

09/30/2022, 2:35 AM

Btw, regarding the document update, I could not figure out how to assign you to the doc change (I edited it and submitted for review on gitbook), neither could I get a link to a PR. So, I went ahead and committed it. You can give it a check and see if it needs changes. https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion#handling-partition-changes-in-streams

sunny

10/04/2022, 1:41 AM

Thanks for the reply and link :) Btw, RealtimeSegmentValidationManager is run so that pinot recognizes partitions added to Kafka. Even in that case, is there a problem with using the pinot partition table?

Subbu Subramaniam

10/04/2022, 2:43 AM

In a partitioned table, when pinot segments are created, the segment metadata indicates the set of keys that fall into that segment. So, when a query comes in with a certain value for the key, we look up all segments, and send the query only to those segments where the key may be a match. Now, if you suddenly change the number of partitions, then all the metadata that you have saved before for the segments will not be relevant any more, right?

Mayank

10/04/2022, 3:01 AM

IIRC, changes in number of partitions or partition function should not lead to any functional issues, the results will still be correct (we store numPartitions, as well as partition function in the segment). It is just that some optimizations for partition pruning may not apply if you changed partition function.

Mayank

10/04/2022, 3:01 AM

Note, the exception is upsert, which requires you to not change partition function or count.

sunny

10/04/2022, 6:57 AM

@Mayank @Subbu Subramaniam In summary, 1) Increasing the number of Kafka partitions has no functional issues with the pinot partition table. (result data is correct) When I tested it, if RealtimeSegmentValidationManager works, a new segment is created and the result data came out properly. However, according to Subbu, added ONE partition, things should not work, since the existing keys partition and new keys partition will not match. How can I understand what Subbu said? . 2) If the partition function is changed, optimizations such as partition pruning may not be applied. We will not change the partition function. We are only considering changing the number of partitions. Can pruning not be applied even if the number of partitions is changed? After the Kafka partition was increased and the segment was created, the query was executed for the entire segment. (numSegmentsQueried) 3) In upsert mode, the number of partitions or functions should not be changed. So do you guide me to set the number of partitions twice as large when creating upsert table? Again, this question is a bit long. Thanks for the detailed reply. I think it would be more helpful if you provide additional answers. Thanks a lot.

Mayank

10/04/2022, 1:01 PM

1. Yes, increasing partition won’t cause an issue for functionality (unless upsert). 2. Have you enabled partition pruning? If so, one segment (that was consuming at the time) will have data across multiple partitions, and due to that query routing will not be able to use partitioning, until that segment is aged out either due to retention or time boundary. 3. I recommend to have enough partitions upfront. Also note that if you had 16 partitions, the max limit for scaling servers will be 16 (one per partition) per replica, which means each server will need to hold 1/16 of the keys. You want to do the math based on this

4 Views

Open in Slack

Previous Next