SQL query filtering on the field used for partitioning retur Apache Pinot #troubleshooting

SQL query filtering on the field used for partitio...

Shen Wan

09/16/2020, 3:39 PM

SQL query filtering on the field used for partitioning returning nothing. Filtering on other fields is fine. I do not see anything worth mentioning in log. What’s going on?

Untitled

Kishore G

09/16/2020, 3:54 PM

whats the query?

Shen Wan

09/16/2020, 4:00 PM

select * from abc_test where service_slug='xyz'

very simple query like this

Shen Wan

09/16/2020, 4:00 PM

service_slug is the column used to partition

Kishore G

09/16/2020, 4:09 PM

I dont think there is any data in that table

Kishore G

09/16/2020, 4:09 PM

totalDocs is 0

Shen Wan

09/16/2020, 4:11 PM

This is the result of

select count(*) from …

Untitled

Shen Wan

09/16/2020, 4:12 PM

however, if I add that where clause, everything is zero

Neha Pawar

09/16/2020, 4:12 PM

how about partition ‘xyz’, are you certain that exists in the data?

Shen Wan

09/16/2020, 4:13 PM

yes. I can see it from select *

Shen Wan

09/16/2020, 4:13 PM

“dimensionFieldSpecs”: [ { “name”: “service_slug”, “dataType”: “STRING” },

Neha Pawar

09/16/2020, 4:13 PM

can you do

select count(*), service_slug from abc_test group by service_slug order by count(*) limit 10

and use one of those?

Shen Wan

09/16/2020, 4:14 PM

“segmentPartitionConfig”: { “columnPartitionMap”: { “service_slug”: { “functionName”: “HashCode”, “numPartitions”: 16 } } },

Kishore G

09/16/2020, 4:15 PM

can you paste the metadata of the segment

Kishore G

09/16/2020, 4:15 PM

looks like its pruning all the segments

Shen Wan

09/16/2020, 4:16 PM

Untitled

Shen Wan

09/16/2020, 4:16 PM

this query works

Shen Wan

09/16/2020, 4:18 PM

select count(*) from oas_log_test where service_slug='ofo4'

this returns nothing

Kishore G

09/16/2020, 4:18 PM

can you paste the metadata of a segment?

Shen Wan

09/16/2020, 4:19 PM

what’s that?

Shen Wan

09/16/2020, 4:19 PM

REST GET?

Kishore G

09/16/2020, 4:20 PM

yes or you can use clutermanager UI to navigate

Neha Pawar

09/16/2020, 4:20 PM

Screen Shot 2020-09-16 at 9.20.25 AM.png

Shen Wan

09/16/2020, 4:21 PM

Untitled

Neha Pawar

09/16/2020, 4:24 PM

was this exact loigc used to partition the stream:

Copy code

return Math.abs(value.hashCode()) % _numPartitions;

Shen Wan

09/16/2020, 4:25 PM

I do not write any code to partition, do I?

Shen Wan

09/16/2020, 4:26 PM

And how do I get the Zookeeper browser UI?

Shen Wan

09/16/2020, 4:27 PM

There are 1557 records with null in the

service_slug

column used for partition. How does Pinot handle this?

Neha Pawar

09/16/2020, 4:27 PM

Data partitioning won’t happen in Pinot. The data needs to be pre-partitioned. From this doc: https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#partitioning

Copy code

After setting the above config, data needs to be partitioned with the same partition function and number of partitions before running Pinot segment build and push job for offline push. Realtime partitioning depends on the kafka for partitioning. When emitting an event to kafka, a user need to feed partitioning key and partition function for Kafka producer API

Kishore G

09/16/2020, 4:28 PM

@Mayank ^^

Shen Wan

09/16/2020, 4:31 PM

Ah, did not know Kafka needs to be using the same partition. Should Pinot at least return some data or error in such case?

Mayank

09/16/2020, 4:31 PM

Pinot will treat it as unpartitioned.

Kishore G

09/16/2020, 4:32 PM

yes, it should return data

Kishore G

09/16/2020, 4:32 PM

if its unpartitioned,

Neha Pawar

09/16/2020, 4:32 PM

Mayank, Kishore, we put this in the metadata, so it looks like we set partitions based on whatever data was received

Copy code

{\"columnPartitionMap\":{\"service_slug\":{\"functionName\":\"HashCode\",\"numPartitions\":16,\"partitions\":[10]}}}

Mayank

09/16/2020, 4:33 PM

Is this in the Pinot segment?

Shen Wan

09/16/2020, 4:33 PM

select count(*) from oas_log_test where service_slug=‘ofo1’ This query on

ofo1

returns a bit more info

Untitled

Neha Pawar

09/16/2020, 4:35 PM

basically partitioning is mismatched. the stream was not partitioned with the hashcode. Bot Pinot is expecting it to be. Pinot has stored in metadata, based on whatever is seeing. So querytime we’re seeing mismatch

Neha Pawar

09/16/2020, 4:35 PM

yes @Mayank that is in segment metadata. Shen has posted some metadata above

Mayank

09/16/2020, 4:35 PM

I don't think partitioning setup issues can cause empty results

Neha Pawar

09/16/2020, 4:36 PM

it will if there’s no matching partition found right?

Mayank

09/16/2020, 4:38 PM

I think it is from table config and not segment metadata?

Shen Wan

09/16/2020, 4:38 PM

preparing lunch. Will be back after lunch. Lemme know what other info you guys need.

Mayank

09/16/2020, 4:38 PM

Segment metadata should look like:

Copy code

column.service_slug.partitionFunction = Murmur
column.service_slug.numPartitions = 32
column.service_slug.partitionValues = 24

Mayank

09/16/2020, 4:40 PM

@Neha Pawar So during consumption, we identify all the partitions rows of a consuming segment are in. If they belong to different partitions, then either we write multiple partitions in metadata (or don't write at all, can't recall). So during pruning, segment won't be pruned as long as there is either no partition info, or one of the partition ids in the metadata match

Neha Pawar

09/16/2020, 4:43 PM

oh i see. so not doing partitioning on stream will simply cause non-optimal querying. But there won’t be any incorrectness. Got it

Mayank

09/16/2020, 4:43 PM

Yep

Kishore G

09/16/2020, 4:47 PM

Is this a bug?

Mayank

09/16/2020, 4:48 PM

No, why

Mayank

09/16/2020, 5:01 PM

@Shen Wan Could you modify the query as

where service_slug in ('ofo1')

? I want to validate a theory

Neha Pawar

09/16/2020, 5:02 PM

ofo1 is the one that returns results. You mean ofo4?

Mayank

09/16/2020, 5:03 PM

yeah

Shen Wan

09/16/2020, 5:12 PM

select count(*) from oas_log_test where service_slug in ('ofo4')

returns nothing

Mayank

09/16/2020, 5:12 PM

So I think it may not be related to partitioning

Mayank

09/16/2020, 5:13 PM

IIRC, partition pruning kicks in for equality predicate.

Mayank

09/16/2020, 5:13 PM

Is there any query that returns

ofo4

Neha Pawar

09/16/2020, 5:17 PM

also can you share the broker logs from around that time, even if there’s no errors you see. there might be something that pops up for us

Shen Wan

09/16/2020, 5:38 PM

recent broker logs

Untitled

Shen Wan

09/16/2020, 5:41 PM

select distinct service_slug from oas_log_test where service_slug <> 'null'

Untitled

Neha Pawar

09/16/2020, 6:00 PM

could you share segment metadata from a few other segments, of different partitions (for example, previously shared metadata was kafka partition 10)

Shen Wan

09/16/2020, 6:05 PM

like this?

Untitled

Neha Pawar

09/16/2020, 6:06 PM

yes, maybe a few more for other partitions?

Neha Pawar

09/16/2020, 6:06 PM

trying to validate something

Shen Wan

09/16/2020, 6:07 PM

Untitled

Shen Wan

09/16/2020, 6:07 PM

Untitled

Shen Wan

09/16/2020, 6:08 PM

{ “segment.realtime.endOffset”: “322913", “segment.time.unit”: “MILLISECONDS”, “segment.start.time”: “1600245140167", “segment.flush.threshold.size”: “113905", “segment.realtime.startOffset”: “209008", “segment.end.time”: “1600258808629", “segment.total.docs”: “113905", “segment.table.name”: “oas_log_test_REALTIME”, “segment.realtime.numReplicas”: “2", “segment.creation.time”: “1600246011143", “segment.realtime.download.url”: “http://pinot-logging-controller-2.pinot-logging-controller-headless.pinot-logging.svc.cluster.local:9000/segments/oas_log_test/oas_log_test__0__8__20200916T0846Z”, “segment.name”: “oas_log_test__0__8__20200916T0846Z”, “segment.index.version”: “v3", “custom.map”: null, “segment.flush.threshold.time”: null, “segment.type”: “REALTIME”, “segment.crc”: “1038864885", “segment.partition.metadata”: “{\“columnPartitionMap\“{\“service slug\“{\“functionName\“\“HashCode\“,\“numPartitions\“16,\“partitions\“:[0]}}}“, “segment.realtime.status”: “DONE” }

Neha Pawar

09/16/2020, 6:09 PM

thank you

Neha Pawar

09/16/2020, 6:31 PM

is it possible that your stream is already partitioned by HashCode on service_slug? or are you certain the stream has no partitioning whatsoever? Just trying to verify why the kafka partition number is always matching the “partitions” in the partition metadata.

Shen Wan

09/16/2020, 6:32 PM

I do not know. This is the config

Untitled

Shen Wan

09/16/2020, 6:34 PM

Do you have an example code that shows how to set up partition while sending messages to Kafka?

Shen Wan

09/16/2020, 7:46 PM

I hope this works.

Untitled

Neha Pawar

09/16/2020, 7:47 PM

Hey @Shen Wan we have identified a bug in the realtime partitioning logic. Please give us some time to figure out a fix/workaround.

👍 1

Neha Pawar

09/16/2020, 8:02 PM

@Shen Wan if you want to use partitioning, unfortunately the only way forward is to recreate this table. And before doing that, set partitioning logic in Kafka stream to match the logic in the Pinot table config

Shen Wan

09/16/2020, 8:20 PM

I see. This is a test table. So it is OK. What is the bug about?

Kishore G

09/16/2020, 8:23 PM

we assume that kafka stream is partitioned on that key (in your case -service)

Neha Pawar

09/16/2020, 8:24 PM

In realtime, Pinot is assuming that the stream is partitioned. So the partition number is directly used as available partitions in the segment metadata. When consuming data from the partitions and creating segments, no validation is done to ensure that the data actually matches the partition, based on the column.

Shen Wan

09/16/2020, 8:26 PM

So you guys are going to make Pinot query all partitions when partition info is incorrect?

Shen Wan

09/16/2020, 8:26 PM

And BTW, before I drop the table and recreated, I’d like to get some stats, like storage usage per column. Where can I get them?

Neha Pawar

09/16/2020, 8:33 PM

i’m not sure we have per column storage stats. @Kishore G?

Kishore G

09/16/2020, 8:39 PM

we do, its in the segment directory, its called index_map

Shen Wan

09/16/2020, 8:43 PM

So not a REST API but a file?

Kishore G

09/16/2020, 8:45 PM

yes, for now, please an issue. we can add that as part of segment metadata

Shen Wan

09/16/2020, 8:54 PM

in pinot server? what directory?

Neha Pawar

09/16/2020, 8:56 PM

this will be whatever directory you used when starting server as -dataDir

Shen Wan

09/16/2020, 8:59 PM

I find nothing under

/var/pinot/server/data/segment

Shen Wan

09/16/2020, 9:00 PM

something under

…/data/index

Neha Pawar

09/16/2020, 9:00 PM

do you see directories for each segment there?

Shen Wan

09/16/2020, 9:14 PM

Shen Wan

09/16/2020, 9:15 PM

actually yes, found

index_map

Shen Wan

09/16/2020, 10:49 PM

Are all the sizes in bytes? I add them all up and get ~60% of

diskSizeInBytes

. Are the rest 40% raw data? Does this look reasonable?

Shen Wan

09/16/2020, 10:57 PM

And I wonder how is repartitioning supposed to work: updating Kafka and Pinot config cannot be atomic, so there will be a period when Kafka’s partition setting and Pinot’s is out-of-sync, right?

Neha Pawar

09/16/2020, 10:58 PM

yes it is in bytes

Neha Pawar

09/16/2020, 10:59 PM

Which is why deleting table was suggested. delete Pinot table. correct the partitioning in the stream recreate the table.

Shen Wan

09/16/2020, 11:00 PM

That’s infeasible in prod.

Kishore G

09/16/2020, 11:00 PM

in prod, you will have to remove the partition info from the metadata

Shen Wan

09/16/2020, 11:01 PM

rebalance does not help?

Kishore G

09/16/2020, 11:02 PM

no, the segment processing framework that @Neha Pawar is building can help but its not ready

Shen Wan

09/16/2020, 11:04 PM

so removing partition info will cause Pinot to treat all data as one partition?

Kishore G

09/16/2020, 11:05 PM

yes

Kishore G

09/16/2020, 11:05 PM

broker is basically looking at the segment metadata in ZK and thinks that this segment is partitioned

Shen Wan

09/16/2020, 11:06 PM

then update Kafka partition, then update Pinot partition to be consistent, right?

Kishore G

09/16/2020, 11:06 PM

and applies the partitioning function, if it does not match it excludes the segment from query execution

Kishore G

09/16/2020, 11:06 PM

yes

Kishore G

09/16/2020, 11:06 PM

is this already in production?

Shen Wan

09/16/2020, 11:07 PM

my table? no, it’s just a test.

Shen Wan

09/16/2020, 11:08 PM

so segments created during the interim will have bad query performance, right?

Kishore G

09/16/2020, 11:08 PM

got it

Kishore G

09/16/2020, 11:08 PM

correct, by the way how many services do you have

Shen Wan

09/16/2020, 11:09 PM

you mean pinot servers? 12

Kishore G

09/16/2020, 11:10 PM

no, what is the cardinality for the partition column

Shen Wan

09/16/2020, 11:10 PM

up to 100 I think

Shen Wan

09/16/2020, 11:12 PM

And to my previous question: forward index is the data, right? So why all the sizes in

index_map

add up to just ~60% of

diskSizeInBytes

Kishore G

09/16/2020, 11:19 PM

you are probably missing inverted index

Shen Wan

09/16/2020, 11:20 PM

I included that.

Shen Wan

09/16/2020, 11:21 PM

I included dict size, fwd index size inv index size range index size bloomfilter size

Shen Wan

09/16/2020, 11:21 PM

all that I can find in

index_map

Kishore G

09/16/2020, 11:24 PM

can you paste the output

Shen Wan

09/16/2020, 11:39 PM

index_map

Untitled

Shen Wan

09/16/2020, 11:42 PM

REST response

Untitled

Kishore G

09/16/2020, 11:43 PM

it does not add up?

Kishore G

09/16/2020, 11:44 PM

can you do ls -l on the segment file as well

Shen Wan

09/16/2020, 11:45 PM

not any more, I already dropped the table to repartition

Shen Wan

09/16/2020, 11:45 PM

will try to get some stats again tomorrow

Kishore G

09/16/2020, 11:47 PM

Kishore G

09/16/2020, 11:47 PM

these things should match.

Shen Wan

09/17/2020, 1:12 AM

I also wonder where is the text index info? I set text index for columns

req

and

resp

but do not see anything related.

Shen Wan

09/17/2020, 3:56 AM

I deleted the table

oas_log_test

and created a new table

oas_log_test_v2

with new schema. But the new table contains 12 million very old records and new records are not flowing in. Do we need to reset Kafka?

Neha Pawar

09/17/2020, 4:28 PM

are you using the same kafka topic? and that kafka topic has all this old data? As soon as the table is created, pinot will ingest whatever is already in the topic

Neha Pawar

09/17/2020, 4:29 PM

you could change that to consume from the latest messages post table creation. streamConfigs section “offset” field change from smallest to largest

Neha Pawar

09/17/2020, 4:30 PM

another possibility is that the table didn’t get deleted completely and new table was created . after deleting, check external view to make sure everything is gone

Shen Wan

09/17/2020, 5:35 PM

external view of

oas_log_test

is 404. external view of

oas_log_test_v2

is stuck on CONSUMING.

Untitled

Neha Pawar

09/17/2020, 5:38 PM

why do you say it is stuck on CONSUMING? it looks like a valid EV

Shen Wan

09/17/2020, 5:57 PM

because I’m expecting new segments generated for the new data I sent to Kafka

Neha Pawar

09/17/2020, 6:00 PM

you cannot see the new data in the queries?

Neha Pawar

09/17/2020, 6:00 PM

segments are created only ocassionally

Shen Wan

09/17/2020, 6:00 PM

no. always the same 1.2 million records over 30 hours ago

Shen Wan

09/17/2020, 6:01 PM

even after I tried your suggestion to update the table to

largest

Neha Pawar

09/17/2020, 6:02 PM

updating to

largest

will not remove older data from the table. that signal is for a new table to about where to start consumption

Shen Wan

09/17/2020, 6:03 PM

I do not see new data coming in.

Neha Pawar

09/17/2020, 6:03 PM

could you start with a clean kafka topic and table?

Neha Pawar

09/17/2020, 6:03 PM

or post any exceptions that you see

Shen Wan

09/17/2020, 6:03 PM

And I do not understand why old data are still there even after I deleted the table and recreate.

Neha Pawar

09/17/2020, 6:04 PM

you used the same topic right? and that topic has all the data?

Shen Wan

09/17/2020, 6:04 PM

yes

Shen Wan

09/17/2020, 6:04 PM

I did not touch Kafka

Neha Pawar

09/17/2020, 6:04 PM

then Pinot is going to ingest all the data from the topic, if you had set to “smallest”

Shen Wan

09/17/2020, 6:05 PM

that’s fine. but only 1.2M ingested.

Shen Wan

09/17/2020, 6:05 PM

and nothing changes after I set to “largest”

Neha Pawar

09/17/2020, 6:06 PM

like i said above, updating an existing table to “largest” will have no effect

Neha Pawar

09/17/2020, 6:06 PM

i cannot tell why newer events aren’t getting ingested. will need to see logs

Shen Wan

09/17/2020, 6:08 PM

I’ll delete table and recreate with “largest”

Shen Wan

09/17/2020, 6:14 PM

now ingestion is active. It should exceed 1.2M records soon.

Shen Wan

09/17/2020, 6:18 PM

You mentioned that maybe the old table was not deleted completely. How would that affect the new table consuming data from Kafka? And how can we verify that a table is completely deleted?

Neha Pawar

09/17/2020, 6:26 PM

if deleting and recreating with largest fixed it for you, then it was probably not about un-deleted data. When a table is deleted, the directories for that table in the server and controller get deleted. If new table create is issued before delete is done, the old directories could interfere with the new table. But again, it doesn’t appear to be the case for you

Shen Wan

09/17/2020, 7:17 PM

select count(*) from oas_log_test_v2 With this table setting the number of docs ingested halted at 800k

Untitled

Shen Wan

09/17/2020, 7:18 PM

no new segments created

Shen Wan

09/17/2020, 7:19 PM

I feel Pinot is still in an unhealthy/stuck state.

Shen Wan

09/17/2020, 7:22 PM

2020/09/17 18:47:41.613 ERROR [LLRealtimeSegmentDataManager_oas_log_test_v2__11__0__20200917T1810Z] [oas_log_test_v2__11__0__20200917T1810Z] Could not build segment

Shen Wan

09/17/2020, 7:22 PM

This log confirms the issue but provides no insight.

Neha Pawar

09/17/2020, 7:41 PM

can you share the whole log

Neha Pawar

09/17/2020, 7:45 PM

what version of Pinot are you using?

Shen Wan

09/17/2020, 7:47 PM

Untitled

Shen Wan

09/17/2020, 7:47 PM

And that was the whole log line.

Shen Wan

09/17/2020, 7:50 PM

logs around that error

null__logs__2020-09-17T02-47.json

Shen Wan

09/17/2020, 8:00 PM

Is this exception the culprit? What does it mean?

Untitled

Shen Wan

09/17/2020, 8:04 PM

inverted index must be built on columns with dictionary?

Neha Pawar

09/17/2020, 8:04 PM

can i see the full table confg and full schema?

Neha Pawar

09/17/2020, 8:05 PM

also why does the exception get skipped in your logs? The logs line is actually

Copy code

} catch (Exception e) {
        segmentLogger.error("Could not build segment", e);

but i dont see the exception

Shen Wan

09/17/2020, 8:05 PM

That I complained yesterday and thought that was a Pinot bug. Maybe stackdriver is truncating?

Shen Wan

09/17/2020, 8:06 PM

schema

Untitled

Shen Wan

09/17/2020, 8:06 PM

config

Untitled

Neha Pawar

09/17/2020, 8:10 PM

afaik, you cannot put inv index column as noDictionary

Neha Pawar

09/17/2020, 8:10 PM

also, timeFieldSpec is deprecated

Neha Pawar

09/17/2020, 8:10 PM

suggest you put all time fields as dateTimeFieldSpecs

Shen Wan

09/17/2020, 8:11 PM

I thought inv index would make it unnecessary to also build dictionary?

Shen Wan

09/17/2020, 8:12 PM

If it has to be built on top of dictionary the config structure should represent the logical relationship, or document it?

Neha Pawar

09/17/2020, 8:13 PM

why do you want to make it no dictionary?

Shen Wan

09/17/2020, 8:16 PM

It will be UUID in prod.

Shen Wan

09/17/2020, 8:17 PM

does not make sense to me to build a dictionary for UUIDs.

Shen Wan

09/17/2020, 8:18 PM

I’d like the UUID itself be the key.

Open in Slack

Previous Next