SQL query filtering on the field used for partitio...
# troubleshooting
s
SQL query filtering on the field used for partitioning returning nothing. Filtering on other fields is fine. I do not see anything worth mentioning in log. What’s going on?
k
whats the query?
s
select * from abc_test where service_slug='xyz'
very simple query like this
service_slug is the column used to partition
k
I dont think there is any data in that table
totalDocs is 0
s
This is the result of
select count(*) from …
however, if I add that where clause, everything is zero
n
how about partition ‘xyz’, are you certain that exists in the data?
s
yes. I can see it from select *
“dimensionFieldSpecs”: [     {       “name”: “service_slug”,       “dataType”: “STRING”     },
n
can you do
select count(*), service_slug from abc_test group by service_slug order by count(*) limit 10
and use one of those?
s
“segmentPartitionConfig”: {         “columnPartitionMap”: {           “service_slug”: {             “functionName”: “HashCode”,             “numPartitions”: 16           }         }       },
k
can you paste the metadata of the segment
looks like its pruning all the segments
s
Untitled
this query works
select count(*) from oas_log_test where service_slug='ofo4'
this returns nothing
k
can you paste the metadata of a segment?
s
what’s that?
REST GET?
k
yes or you can use clutermanager UI to navigate
n
Screen Shot 2020-09-16 at 9.20.25 AM.png
s
Untitled
n
was this exact loigc used to partition the stream:
Copy code
return Math.abs(value.hashCode()) % _numPartitions;
s
I do not write any code to partition, do I?
And how do I get the Zookeeper browser UI?
There are 1557 records with null in the
service_slug
column used for partition. How does Pinot handle this?
n
Data partitioning won’t happen in Pinot. The data needs to be pre-partitioned. From this doc: https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#partitioning
Copy code
After setting the above config, data needs to be partitioned with the same partition function and number of partitions before running Pinot segment build and push job for offline push. Realtime partitioning depends on the kafka for partitioning. When emitting an event to kafka, a user need to feed partitioning key and partition function for Kafka producer API
k
@Mayank ^^
s
Ah, did not know Kafka needs to be using the same partition. Should Pinot at least return some data or error in such case?
m
Pinot will treat it as unpartitioned.
k
yes, it should return data
if its unpartitioned,
n
Mayank, Kishore, we put this in the metadata, so it looks like we set partitions based on whatever data was received
Copy code
{\"columnPartitionMap\":{\"service_slug\":{\"functionName\":\"HashCode\",\"numPartitions\":16,\"partitions\":[10]}}}
m
Is this in the Pinot segment?
s
select count(*) from oas_log_test where service_slug=‘ofo1’ This query on
ofo1
returns a bit more info
n
basically partitioning is mismatched. the stream was not partitioned with the hashcode. Bot Pinot is expecting it to be. Pinot has stored in metadata, based on whatever is seeing. So querytime we’re seeing mismatch
yes @Mayank that is in segment metadata. Shen has posted some metadata above
m
I don't think partitioning setup issues can cause empty results
n
it will if there’s no matching partition found right?
m
I think it is from table config and not segment metadata?
s
preparing lunch. Will be back after lunch. Lemme know what other info you guys need.
m
Segment metadata should look like:
Copy code
column.service_slug.partitionFunction = Murmur
column.service_slug.numPartitions = 32
column.service_slug.partitionValues = 24
@Neha Pawar So during consumption, we identify all the partitions rows of a consuming segment are in. If they belong to different partitions, then either we write multiple partitions in metadata (or don't write at all, can't recall). So during pruning, segment won't be pruned as long as there is either no partition info, or one of the partition ids in the metadata match
n
oh i see. so not doing partitioning on stream will simply cause non-optimal querying. But there won’t be any incorrectness. Got it
m
Yep
k
Is this a bug?
m
No, why
@Shen Wan Could you modify the query as
where service_slug in ('ofo1')
? I want to validate a theory
n
ofo1 is the one that returns results. You mean ofo4?
m
yeah
s
select count(*) from oas_log_test where service_slug in ('ofo4')
returns nothing
m
So I think it may not be related to partitioning
IIRC, partition pruning kicks in for equality predicate.
Is there any query that returns
ofo4
?
n
also can you share the broker logs from around that time, even if there’s no errors you see. there might be something that pops up for us
s
recent broker logs
select distinct service_slug from oas_log_test where service_slug <> 'null'
n
could you share segment metadata from a few other segments, of different partitions (for example, previously shared metadata was kafka partition 10)
s
like this?
n
yes, maybe a few more for other partitions?
trying to validate something
s
Untitled
Untitled
{ “segment.realtime.endOffset”: “322913", “segment.time.unit”: “MILLISECONDS”, “segment.start.time”: “1600245140167", “segment.flush.threshold.size”: “113905", “segment.realtime.startOffset”: “209008", “segment.end.time”: “1600258808629", “segment.total.docs”: “113905", “segment.table.name”: “oas_log_test_REALTIME”, “segment.realtime.numReplicas”: “2", “segment.creation.time”: “1600246011143", “segment.realtime.download.url”: “http://pinot-logging-controller-2.pinot-logging-controller-headless.pinot-logging.svc.cluster.local:9000/segments/oas_log_test/oas_log_test__0__8__20200916T0846Z”, “segment.name”: “oas_log_test__0__8__20200916T0846Z”, “segment.index.version”: “v3", “custom.map”: null, “segment.flush.threshold.time”: null, “segment.type”: “REALTIME”, “segment.crc”: “1038864885", “segment.partition.metadata”: “{\“columnPartitionMap\“{\“service slug\“{\“functionName\“\“HashCode\“,\“numPartitions\“16,\“partitions\“:[0]}}}“, “segment.realtime.status”: “DONE” }
n
thank you
is it possible that your stream is already partitioned by HashCode on service_slug? or are you certain the stream has no partitioning whatsoever? Just trying to verify why the kafka partition number is always matching the “partitions” in the partition metadata.
s
I do not know. This is the config
Do you have an example code that shows how to set up partition while sending messages to Kafka?
I hope this works.
n
Hey @Shen Wan we have identified a bug in the realtime partitioning logic. Please give us some time to figure out a fix/workaround.
👍 1
@Shen Wan if you want to use partitioning, unfortunately the only way forward is to recreate this table. And before doing that, set partitioning logic in Kafka stream to match the logic in the Pinot table config
s
I see. This is a test table. So it is OK. What is the bug about?
k
we assume that kafka stream is partitioned on that key (in your case -service)
n
In realtime, Pinot is assuming that the stream is partitioned. So the partition number is directly used as available partitions in the segment metadata. When consuming data from the partitions and creating segments, no validation is done to ensure that the data actually matches the partition, based on the column.
s
So you guys are going to make Pinot query all partitions when partition info is incorrect?
And BTW, before I drop the table and recreated, I’d like to get some stats, like storage usage per column. Where can I get them?
n
i’m not sure we have per column storage stats. @Kishore G?
k
we do, its in the segment directory, its called index_map
s
So not a REST API but a file?
k
yes, for now, please an issue. we can add that as part of segment metadata
s
in pinot server? what directory?
n
this will be whatever directory you used when starting server as -dataDir
s
I find nothing under
/var/pinot/server/data/segment
something under
…/data/index
n
do you see directories for each segment there?
s
no
actually yes, found
index_map
Are all the sizes in bytes? I add them all up and get ~60% of
diskSizeInBytes
. Are the rest 40% raw data? Does this look reasonable?
And I wonder how is repartitioning supposed to work: updating Kafka and Pinot config cannot be atomic, so there will be a period when Kafka’s partition setting and Pinot’s is out-of-sync, right?
n
yes it is in bytes
Which is why deleting table was suggested. delete Pinot table. correct the partitioning in the stream recreate the table.
s
That’s infeasible in prod.
k
in prod, you will have to remove the partition info from the metadata
s
rebalance does not help?
k
no, the segment processing framework that @Neha Pawar is building can help but its not ready
s
so removing partition info will cause Pinot to treat all data as one partition?
k
yes
broker is basically looking at the segment metadata in ZK and thinks that this segment is partitioned
s
then update Kafka partition, then update Pinot partition to be consistent, right?
k
and applies the partitioning function, if it does not match it excludes the segment from query execution
yes
is this already in production?
s
my table? no, it’s just a test.
so segments created during the interim will have bad query performance, right?
k
got it
correct, by the way how many services do you have
s
you mean pinot servers? 12
k
no, what is the cardinality for the partition column
s
up to 100 I think
And to my previous question: forward index is the data, right? So why all the sizes in
index_map
add up to just ~60% of
diskSizeInBytes
?
k
you are probably missing inverted index
s
I included that.
I included dict size, fwd index size inv index size range index size bloomfilter size
all that I can find in
index_map
k
can you paste the output
s
index_map
REST response
k
it does not add up?
can you do ls -l on the segment file as well
s
not any more, I already dropped the table to repartition
will try to get some stats again tomorrow
k
ok
these things should match.
s
I also wonder where is the text index info? I set text index for columns
req
and
resp
but do not see anything related.
I deleted the table 
oas_log_test
 and created a new table 
oas_log_test_v2
 with new schema. But the new table contains 12 million very old records and new records are not flowing in. Do we need to reset Kafka?
n
are you using the same kafka topic? and that kafka topic has all this old data? As soon as the table is created, pinot will ingest whatever is already in the topic
you could change that to consume from the latest messages post table creation. streamConfigs section “offset” field change from smallest to largest
another possibility is that the table didn’t get deleted completely and new table was created . after deleting, check external view to make sure everything is gone
s
external view of
oas_log_test
is 404. external view of
oas_log_test_v2
is stuck on CONSUMING.
n
why do you say it is stuck on CONSUMING? it looks like a valid EV
s
because I’m expecting new segments generated for the new data I sent to Kafka
n
you cannot see the new data in the queries?
segments are created only ocassionally
s
no. always the same 1.2 million records over 30 hours ago
even after I tried your suggestion to update the table to
largest
n
updating to
largest
will not remove older data from the table. that signal is for a new table to about where to start consumption
s
I do not see new data coming in.
n
could you start with a clean kafka topic and table?
or post any exceptions that you see
s
And I do not understand why old data are still there even after I deleted the table and recreate.
n
you used the same topic right? and that topic has all the data?
s
yes
I did not touch Kafka
n
then Pinot is going to ingest all the data from the topic, if you had set to “smallest”
s
that’s fine. but only 1.2M ingested.
and nothing changes after I set to “largest”
n
like i said above, updating an existing table to “largest” will have no effect
i cannot tell why newer events aren’t getting ingested. will need to see logs
s
I’ll delete table and recreate with “largest”
now ingestion is active. It should exceed 1.2M records soon.
You mentioned that maybe the old table was not deleted completely. How would that affect the new table consuming data from Kafka? And how can we verify that a table is completely deleted?
n
if deleting and recreating with largest fixed it for you, then it was probably not about un-deleted data. When a table is deleted, the directories for that table in the server and controller get deleted. If new table create is issued before delete is done, the old directories could interfere with the new table. But again, it doesn’t appear to be the case for you
s
select count(*) from oas_log_test_v2 With this table setting the number of docs ingested halted at 800k
no new segments created
I feel Pinot is still in an unhealthy/stuck state.
2020/09/17 18:47:41.613 ERROR [LLRealtimeSegmentDataManager_oas_log_test_v2__11__0__20200917T1810Z] [oas_log_test_v2__11__0__20200917T1810Z] Could not build segment
This log confirms the issue but provides no insight.
n
can you share the whole log
what version of Pinot are you using?
s
Untitled
And that was the whole log line.
logs around that error
Is this exception the culprit? What does it mean?
inverted index must be built on columns with dictionary?
n
can i see the full table confg and full schema?
also why does the exception get skipped in your logs? The logs line is actually
Copy code
} catch (Exception e) {
        segmentLogger.error("Could not build segment", e);
but i dont see the exception
s
That I complained yesterday and thought that was a Pinot bug. Maybe stackdriver is truncating?
schema
config
n
afaik, you cannot put inv index column as noDictionary
also, timeFieldSpec is deprecated
suggest you put all time fields as dateTimeFieldSpecs
s
I thought inv index would make it unnecessary to also build dictionary?
If it has to be built on top of dictionary the config structure should represent the logical relationship, or document it?
n
why do you want to make it no dictionary?
s
It will be UUID in prod.
does not make sense to me to build a dictionary for UUIDs.
I’d like the UUID itself be the key.