HI, is there a way for me to get the consuming row...
# troubleshooting
p
HI, is there a way for me to get the consuming rows (yet to be persisted) with a query? I am seeing that the kafka messages are coming through, but the data is not yet persisted in Pinot for some of the indexed attributes. I set the segment threshold size to a much lower value and it doesnt help. I set the threshold time to 1 min and it persists every min. But thats not something I like as the ZK 's storage is increasing rapidly.I also would like to return the data to me much sooner than 1 min
x
Not sure if there is a way to find the consuming rows. Data becomes queryable after it’s ingested, even though it’s not persisted yet. But if you are actually looking to persist data sooner, then tune down either segment size or flush interval should work. However, both would lead to more overhead on ZK. I see there are two knobs to tune down segment size: rows and size (inBytes), and sounds like the former overwrites the later, so you may want to double check that (docs).
m
@Padma Malladi as @Xiaobing mentioned, each row is immediately queryable as soon as it is ingested, you don’t need for it to be persisted. Also, pleased don’t set a 1min threshold, there is no upside of doing that what so ever, but there’s definitely downsides of doing that.
p
@Mayank @Xiaobing Yes. I was only trying those values out to make sure that the data is being ingested. I changed them back to the original values of 24 hours now. My q is why I am not seeing the data from the pinot query until 24 hours after I see the data in kafka
if I set the threshold.time to 5 min, I was able to see the data much sooner ie 5 min
if the data is ingested right away even though its not persisted, I should see that ingested data right?
are there any settings I am missing in the segment config?
x
yup, data should be visible to queries as soon as it’s ingested.
m
I can’t think of a reason that explains what you are describing, so I’d like to establish your claim first.
Can you look at external view, pick one segment that is in
CONSUMING
state, and then make a query to count the rows in that segment (you can use something like
where $segmentName = '<nameOfConsumingSegment>'
p
is there a way to know which segment is consuming the data for a specific index (id column) value
m
You mean partition?
The name of the segment has the partition Id in it
p
yes, if my partitioning_column is x and value is 5, how would I know the partition for the value of x=5
x
as you asked about specific partition and value, I’d like to clarify whether all the recently ingested data was not queryable, or just a subset of data? as that’d be different problems.
p
subset of it (for certain values of the partitioning column)
👌 1
I am assuming this property is set for the peristence
Copy code
segment.flush.threshold.size
and this property is set for the consumption from kafka
Copy code
realtime.segment.flush.threshold.size
and if I change this value in the console, does it start consuming the data right away as soon as it reaches the size I specify? Its currently at 5M and I believe its not reaching that quickly enough
I am assuming that whatever thrshold value is reached first would be applied? for eg. if the size reaches 5M as specified in the config before 24 h as specified in the config, it will ingest that data as soon as it reaches 5M right?
x
I feel persisting data very quickly might have worked around some hidden issue that made the subset of data not queryable as it was ingested. Do you mind sharing the tableConfig as that includes some partitioning settings which might help understand the problem?
But for questions above, yes, whichever thresholds got reached firstly should flush the segment.
p
flush the segment to persistence ?
or flush the consuming to be ingested?
they are two different things right?
and the realtime flush threshold is for consuming and the segment flush threshold is for persisting?
x
flush the segment to persistence.
realtime.segment.flush.threshold.size
controls when to seal a consuming segment and persist it on disk. Then a new consuming segment is created to continue to ingest data from Kafka like immediately. Basically, data is ingested continuously into Pinot. as to
segment.flush.threshold.size
I don’t know this one, where you found it?
m
@Padma Malladi how many partitions do you have?
You might run a query that includes only consuming segments
(where $segmaneName IN (…)
p
I did
m
So that data is not present in consuming segment?