Hi does Pinot have limitation on the tables amount would it Apache Pinot #general

Join Slack

Hi, does Pinot have limitation on the tables amoun...

# general

Liran Brimer

07/11/2021, 8:11 AM

Hi, does Pinot have limitation on the tables amount? would it support millions of tables?

Kishore G

07/11/2021, 9:40 AM

I don’t think we have seen millions of tables in production.. Max is around 10k

Kishore G

07/11/2021, 9:41 AM

Why do you ask

Liran Brimer

07/11/2021, 10:16 AM

we have a usecase in which our users can create their own dynamic schema tables, and we need to allow them to query (filter/sort) by any column. so we expect to have a lot of those dynamic tables

Mayank

07/11/2021, 1:15 PM

Is the entire schema dynamic? Or users can add dynamic columns to a fixed schema?

Liran Brimer

07/11/2021, 1:19 PM

yes each table has its own structure (columns)

Mayank

07/11/2021, 1:21 PM

I see

Ken Krugler

07/11/2021, 1:59 PM

When we had a similar requirement (using Solr) we modeled it as one physical table (in our case, index) per user, and then the virtual table columns were <virtual table name>-<column name>, with one more field (the virtual table name) per row. No idea if something like that would work well for Pinot, or your use case.

✅ 1

Mayank

07/11/2021, 2:07 PM

Oh that’s a nice idea.

Kishore G

07/11/2021, 4:17 PM

That should work with Pinot as well with json index

Liran Brimer

07/11/2021, 4:21 PM

Why json?

Kishore G

07/11/2021, 4:23 PM

Schema Tablename, json

Kishore G

07/11/2021, 4:23 PM

Json is the actual row..

Liran Brimer

07/11/2021, 4:24 PM

Oh got you. I think Ken actually meant to have multiple actual columns

Kishore G

07/11/2021, 4:24 PM

Create tablename as sortedindex

Kishore G

07/11/2021, 4:25 PM

You can but the table will become too wide

Liran Brimer

07/11/2021, 4:26 PM

So the json would be flat object such as { Field1: 12345 Field2: true Field3: "some text" }

Liran Brimer

07/11/2021, 4:27 PM

Or they must share the same property and gave something like

Liran Brimer

07/11/2021, 4:29 PM

{ Values: [ {col: 'field1', indexedValue: 12345}, col: 'field2' , indexedValue: true}, ] }

Kishore G

07/11/2021, 4:32 PM

So the json would be flat object such as { Field1: 12345 Field2: true Field3: "some text" }

Kishore G

07/11/2021, 4:33 PM

This will be a bit more storage than creating millions of tables but a scalable solution

Kishore G

07/11/2021, 4:33 PM

You can partition and sort by tablename

Ken Krugler

07/11/2021, 4:35 PM

I didn’t think you could use star trees with JSON fields, but maybe I’m miss-remembering from the presentation

Kishore G

07/11/2021, 4:37 PM

You can’t use StarTree with json fields.. My assumption here is millions of tables with each table probably having millions of rows

Kishore G

07/11/2021, 4:39 PM

Startree index won’t be needed here

Ken Krugler

07/11/2021, 4:40 PM

My experience with user-created tables (from back in 2000) was that the vast majority were very small, like a few hundred to a few thousand rows. And then you’d get a few outliers with many millions of rows. It would be great if you could use JSON for the common case, and either individual tables or the one-user-many-tables approach for the big ones.

Kishore G

07/11/2021, 4:59 PM

yes, you can always create a separate tables for outliers

Liran Brimer

07/11/2021, 6:50 PM

@User if we go with your idea, it means it's like not having a colum store in practice? I'm trying to understand if we'll lose benfits of a column store. Or if there are other cons in general. The only one I could think of is the losing of the ability to have different kinds of indexes

Kishore G

07/11/2021, 7:01 PM

You can still get the benefit of indexes- see json index

Kishore G

07/11/2021, 7:01 PM

We are working on storing json in a columnar format

Kishore G

07/11/2021, 7:02 PM

So you will start benefiting from that automatically

Kishore G

07/11/2021, 7:03 PM

Would you mind creating an issue

👍 1

Kishore G

07/11/2021, 7:03 PM

This is an important feature and will be good to document and track

Liran Brimer

07/11/2021, 7:34 PM

Sure I would. How exactly would you store it in a columnar format? Each field would be treated like a different column in storage?

Liran Brimer

07/11/2021, 7:38 PM

Regarding losing benefits of different kinds of indexes, I meant that we'll lose the ability to match an appropriate index to the usage. For example, to have index appropriate for high cardinality, or index for full text search, etc Instead we would have just the inverted index (but maybe it's enough)

Kishore G

07/11/2021, 7:41 PM

I agree with you.. but we have the right primitives to support different index for each field in the json.

Kishore G

07/11/2021, 7:42 PM

So while you may not get the best of Pinot today I think this is the right direction

Kishore G

07/11/2021, 7:42 PM

Btw, I am super impressed with your questions.

🙏 1

Liran Brimer

07/11/2021, 8:01 PM

Thanks 🙂

Liran Brimer

07/11/2021, 8:01 PM

And thank you so much for the help

Liran Brimer

07/12/2021, 6:42 AM

@User looking at the documentation, I see the index creation command expects to have

jsonIndexColumns

field so since our "tables" are dynamic, how can we pass a flat object with dynamic fields? it sounds more like we need what I suggested with the

indexedValue

field for having a general purpose dynamic schema "tables"

Liran Brimer

07/12/2021, 6:43 AM

jsonIndexColumns: ['indexedValue'] vs jsonIndexColumns['field1', 'field2', 'field3', ....]

Liran Brimer

07/12/2021, 6:47 AM

https://github.com/apache/incubator-pinot/issues/7152

Kishore G

07/12/2021, 6:22 PM

thanks

Liran Brimer

07/13/2021, 5:58 AM

following your json index suggestion. i wonder, whats its advantages over having a regular table with schema of: <table id>, <row id>, <column id>, <cell value> ?

Kishore G

07/13/2021, 6:26 AM

this is another approach and it should work.. you will lose filtering across multiple columns

Kishore G

07/13/2021, 6:27 AM

select count(*) from T where c1 = v1 and c2=v2

Liran Brimer

07/13/2021, 6:27 AM

we can't lose that.. but why its so? index cannot be used multiple times for same query?

Liran Brimer

07/13/2021, 6:28 AM

how is it different than

c1 = v1 or c1 = v2

Kishore G

07/13/2021, 6:28 AM

it can but the intersection will be an empty set in the model you described

Liran Brimer

07/13/2021, 6:29 AM

mm i got you now. i think c1=v1 and c2=v2 is possible, but would require to use "or" and then run applicative logic

Liran Brimer

07/13/2021, 6:29 AM

or to use "group by" on the "row id" maybe

Kishore G

07/13/2021, 6:34 AM

right

Kishore G

07/13/2021, 6:34 AM

group by + having maybe

Kishore G

07/13/2021, 6:35 AM

which might not be a bad idea if you have very few rows per tableid

Liran Brimer

07/13/2021, 6:35 AM

with the json index I can see how the inverted indexes would be used for filters. but how would it be used for sorts, or numeric range filters ("greater than.." condition)?

Kishore G

07/13/2021, 6:36 AM

you can use json_extract udf

Liran Brimer

07/13/2021, 6:36 AM

i mean, not as a user of pinot

Liran Brimer

07/13/2021, 6:36 AM

i'm asking how pinot would use the index internally

Kishore G

07/13/2021, 6:37 AM

today, its not possible but the way json index is built, we can support any index on individual fields

Kishore G

07/13/2021, 6:38 AM

today, by default, it adds inverted index

Kishore G

07/13/2021, 6:38 AM

my plan was to support different index on each field

Liran Brimer

07/13/2021, 6:38 AM

so our use-case cannot be supported today with the json index? as we need to allow range filters and sorts

Kishore G

07/13/2021, 6:40 AM

it will work in Pinot, since you can use json_extract expression based filtering but you cant capitalize on the json index

Liran Brimer

07/13/2021, 6:46 AM

You mean the query would work, but just by doing pure parallel cpu computation on all the rows and without taking advantage of the index?

Kishore G

07/13/2021, 6:54 AM

yes

Kishore G

07/13/2021, 6:54 AM

not all rows, I am guessing you will have some predicate on other columns

Kishore G

07/13/2021, 6:54 AM

like atleast tableId=X

Liran Brimer

07/13/2021, 6:57 AM

Yeah I meant all "virtual table" rows

Open in Slack

Previous Next