HI is there a way for me to get the consuming rows yet to be Apache Pinot #troubleshooting

HI, is there a way for me to get the consuming row...

Padma Malladi

09/01/2022, 5:43 PM

HI, is there a way for me to get the consuming rows (yet to be persisted) with a query? I am seeing that the kafka messages are coming through, but the data is not yet persisted in Pinot for some of the indexed attributes. I set the segment threshold size to a much lower value and it doesnt help. I set the threshold time to 1 min and it persists every min. But thats not something I like as the ZK 's storage is increasing rapidly.I also would like to return the data to me much sooner than 1 min

Xiaobing

09/01/2022, 6:09 PM

Not sure if there is a way to find the consuming rows. Data becomes queryable after it’s ingested, even though it’s not persisted yet. But if you are actually looking to persist data sooner, then tune down either segment size or flush interval should work. However, both would lead to more overhead on ZK. I see there are two knobs to tune down segment size: rows and size (inBytes), and sounds like the former overwrites the later, so you may want to double check that (docs).

Mayank

09/01/2022, 6:17 PM

@Padma Malladi as @Xiaobing mentioned, each row is immediately queryable as soon as it is ingested, you don’t need for it to be persisted. Also, pleased don’t set a 1min threshold, there is no upside of doing that what so ever, but there’s definitely downsides of doing that.

Padma Malladi

09/01/2022, 6:21 PM

@Mayank @Xiaobing Yes. I was only trying those values out to make sure that the data is being ingested. I changed them back to the original values of 24 hours now. My q is why I am not seeing the data from the pinot query until 24 hours after I see the data in kafka

Padma Malladi

09/01/2022, 6:22 PM

if I set the threshold.time to 5 min, I was able to see the data much sooner ie 5 min

Padma Malladi

09/01/2022, 6:23 PM

if the data is ingested right away even though its not persisted, I should see that ingested data right?

Padma Malladi

09/01/2022, 6:23 PM

are there any settings I am missing in the segment config?

Xiaobing

09/01/2022, 6:24 PM

yup, data should be visible to queries as soon as it’s ingested.

Mayank

09/01/2022, 6:34 PM

I can’t think of a reason that explains what you are describing, so I’d like to establish your claim first.

Mayank

09/01/2022, 6:34 PM

Can you look at external view, pick one segment that is in

CONSUMING

state, and then make a query to count the rows in that segment (you can use something like

where $segmentName = '<nameOfConsumingSegment>'

Padma Malladi

09/01/2022, 6:35 PM

is there a way to know which segment is consuming the data for a specific index (id column) value

Mayank

09/01/2022, 6:36 PM

You mean partition?

Mayank

09/01/2022, 6:36 PM

The name of the segment has the partition Id in it

Padma Malladi

09/01/2022, 6:39 PM

yes, if my partitioning_column is x and value is 5, how would I know the partition for the value of x=5

Xiaobing

09/01/2022, 6:45 PM

as you asked about specific partition and value, I’d like to clarify whether all the recently ingested data was not queryable, or just a subset of data? as that’d be different problems.

Padma Malladi

09/01/2022, 6:48 PM

subset of it (for certain values of the partitioning column)

👌 1

Padma Malladi

09/01/2022, 6:52 PM

I am assuming this property is set for the peristence

Padma Malladi

09/01/2022, 6:52 PM

Copy code

segment.flush.threshold.size

Padma Malladi

09/01/2022, 6:53 PM

and this property is set for the consumption from kafka

Padma Malladi

09/01/2022, 6:53 PM

Copy code

realtime.segment.flush.threshold.size

Padma Malladi

09/01/2022, 6:54 PM

and if I change this value in the console, does it start consuming the data right away as soon as it reaches the size I specify? Its currently at 5M and I believe its not reaching that quickly enough

Padma Malladi

09/01/2022, 6:55 PM

I am assuming that whatever thrshold value is reached first would be applied? for eg. if the size reaches 5M as specified in the config before 24 h as specified in the config, it will ingest that data as soon as it reaches 5M right?

Xiaobing

09/01/2022, 7:17 PM

I feel persisting data very quickly might have worked around some hidden issue that made the subset of data not queryable as it was ingested. Do you mind sharing the tableConfig as that includes some partitioning settings which might help understand the problem?

Xiaobing

09/01/2022, 7:17 PM

But for questions above, yes, whichever thresholds got reached firstly should flush the segment.

Padma Malladi

09/01/2022, 7:21 PM

flush the segment to persistence ?

Padma Malladi

09/01/2022, 7:21 PM

or flush the consuming to be ingested?

Padma Malladi

09/01/2022, 7:21 PM

they are two different things right?

Padma Malladi

09/01/2022, 7:22 PM

and the realtime flush threshold is for consuming and the segment flush threshold is for persisting?

Xiaobing

09/01/2022, 7:34 PM

flush the segment to persistence.

realtime.segment.flush.threshold.size

controls when to seal a consuming segment and persist it on disk. Then a new consuming segment is created to continue to ingest data from Kafka like immediately. Basically, data is ingested continuously into Pinot. as to

segment.flush.threshold.size

I don’t know this one, where you found it?

Mayank

09/01/2022, 7:34 PM

@Padma Malladi how many partitions do you have?

Mayank

09/01/2022, 7:35 PM

You might run a query that includes only consuming segments

(where $segmaneName IN (…)

Padma Malladi

09/01/2022, 7:41 PM

I did

Mayank

09/01/2022, 7:41 PM

So that data is not present in consuming segment?

Open in Slack

Previous Next