Hi folks let me ask for your opinion on modeling tables in P Apache Pinot #troubleshooting

Hi folks, let me ask for your opinion on modeling ...

Diogo Baeder

05/02/2022, 1:53 PM

Hi folks, let me ask for your opinion on modeling tables in Pinot. Suppose (just a fake case for simple illustration) that you had a data source where you have users having different objects at home, where the types and names of these objects are dynamic, and you wanted to have a way to store them in such a way that you could be able to query them by objects amounts, like finding users that have 2 cars and 2 TVs. Considering that you don't know what objects would be coming in beforehand, how would you model this? JSON field for the objects, to keep them in a single row that represents the individual user? Spreading the objects as different rows and then aggregating and filtering at the application side? How would you guys model this?

Kishore G

05/02/2022, 2:00 PM

Model as json type columns as long the objects a single user holds does not run into hundreds of thousands

Kishore G

05/02/2022, 2:01 PM

This will be the fastest and most efficient

Diogo Baeder

05/02/2022, 2:02 PM

Hmm... In some cases I might have about 100 items or so, but I hope this doesn't turn out to be a problem

Mayank

05/02/2022, 2:06 PM

Should be fine

Kishore G

05/02/2022, 2:09 PM

One thing missing in Pinot right now is ability to have configure indexes for each field within a json

Kishore G

05/02/2022, 2:09 PM

We only do inverted index rt now by default

Kishore G

05/02/2022, 2:10 PM

Will be great if you can file an issue for this

Diogo Baeder

05/02/2022, 2:18 PM

I can, yes. Will do ASAP. Thanks again! 🙂

Mayank

05/02/2022, 2:34 PM

Also is there structure to it, or you just want to do text match on bunch of strings?

Diogo Baeder

05/02/2022, 2:46 PM

Something like, imagine a user has: • TVs: 2 • Cars: 1 And another user has: • TVs: 1 • Dogs: 4 So each user has a certain "thing" and then a certain amount of that thing. Just one level, no complex structure really. But the problem is that users would have different things, hence me not being able to define them as columns.

Kishore G

05/02/2022, 2:47 PM

yeah json is right

Diogo Baeder

05/02/2022, 2:58 PM

Initially I structured this as each "thing" being a separate row, but it turned out to have an obvious problem: it would be impossible to filter users that have "2 TVs and 1 car", for example.

Alice

05/02/2022, 3:11 PM

Maybe you can transform one json to multiple records with TVs/Dogs/.. stored in a field named as type and 2/1/.. stored in a field named as amount.

Kishore G

05/02/2022, 3:13 PM

yes, thats also another idea and commonly used. the only drawback with that is if you want to get count of users who have TVs and Dogs. That will require distinctCount vs Count with json.

Diogo Baeder

05/02/2022, 3:13 PM

Thanks for the hint, Alice! But I was doing that already, and it didn't solve my problem because I ended up not being able to correlate different rows in the same query (e.g. "users that have 2 TVs, and either 2 dogs or 2 cars")

Kishore G

05/02/2022, 3:13 PM

how big is the dataset

Diogo Baeder

05/02/2022, 3:14 PM

I don't know yet how big the dataset will be, in total it will probably be in the order of a few terabytes.

Open in Slack

Previous Next