Hi, does Pinot have limitation on the tables amoun...
# general
l
Hi, does Pinot have limitation on the tables amount? would it support millions of tables?
k
I don’t think we have seen millions of tables in production.. Max is around 10k
Why do you ask
l
we have a usecase in which our users can create their own dynamic schema tables, and we need to allow them to query (filter/sort) by any column. so we expect to have a lot of those dynamic tables
m
Is the entire schema dynamic? Or users can add dynamic columns to a fixed schema?
l
yes each table has its own structure (columns)
m
I see
k
When we had a similar requirement (using Solr) we modeled it as one physical table (in our case, index) per user, and then the virtual table columns were <virtual table name>-<column name>, with one more field (the virtual table name) per row. No idea if something like that would work well for Pinot, or your use case.
1
m
Oh that’s a nice idea.
k
That should work with Pinot as well with json index
l
Why json?
k
Schema Tablename, json
Json is the actual row..
l
Oh got you. I think Ken actually meant to have multiple actual columns
k
Create tablename as sortedindex
You can but the table will become too wide
l
So the json would be flat object such as { Field1: 12345 Field2: true Field3: "some text" }
Or they must share the same property and gave something like
{ Values: [ {col: 'field1', indexedValue: 12345}, col: 'field2' , indexedValue: true}, ] }
k
So the json would be flat object such as { Field1: 12345 Field2: true Field3: "some text" }
This will be a bit more storage than creating millions of tables but a scalable solution
You can partition and sort by tablename
k
I didn’t think you could use star trees with JSON fields, but maybe I’m miss-remembering from the presentation
k
You can’t use StarTree with json fields.. My assumption here is millions of tables with each table probably having millions of rows
Startree index won’t be needed here
k
My experience with user-created tables (from back in 2000) was that the vast majority were very small, like a few hundred to a few thousand rows. And then you’d get a few outliers with many millions of rows. It would be great if you could use JSON for the common case, and either individual tables or the one-user-many-tables approach for the big ones.
k
yes, you can always create a separate tables for outliers
l
@User if we go with your idea, it means it's like not having a colum store in practice? I'm trying to understand if we'll lose benfits of a column store. Or if there are other cons in general. The only one I could think of is the losing of the ability to have different kinds of indexes
k
You can still get the benefit of indexes- see json index
We are working on storing json in a columnar format
So you will start benefiting from that automatically
Would you mind creating an issue
👍 1
This is an important feature and will be good to document and track
l
Sure I would. How exactly would you store it in a columnar format? Each field would be treated like a different column in storage?
Regarding losing benefits of different kinds of indexes, I meant that we'll lose the ability to match an appropriate index to the usage. For example, to have index appropriate for high cardinality, or index for full text search, etc Instead we would have just the inverted index (but maybe it's enough)
k
I agree with you.. but we have the right primitives to support different index for each field in the json.
So while you may not get the best of Pinot today I think this is the right direction
Btw, I am super impressed with your questions.
🙏 1
l
Thanks 🙂
And thank you so much for the help
@User looking at the documentation, I see the index creation command expects to have
jsonIndexColumns
field so since our "tables" are dynamic, how can we pass a flat object with dynamic fields? it sounds more like we need what I suggested with the
indexedValue
field for having a general purpose dynamic schema "tables"
jsonIndexColumns: ['indexedValue'] vs jsonIndexColumns['field1', 'field2', 'field3', ....]
k
thanks
l
following your json index suggestion. i wonder, whats its advantages over having a regular table with schema of: <table id>, <row id>, <column id>, <cell value> ?
k
this is another approach and it should work.. you will lose filtering across multiple columns
select count(*) from T where c1 = v1 and c2=v2
l
we can't lose that.. but why its so? index cannot be used multiple times for same query?
how is it different than
c1 = v1 or c1 = v2
k
it can but the intersection will be an empty set in the model you described
l
mm i got you now. i think c1=v1 and c2=v2 is possible, but would require to use "or" and then run applicative logic
or to use "group by" on the "row id" maybe
k
right
group by + having maybe
which might not be a bad idea if you have very few rows per tableid
l
with the json index I can see how the inverted indexes would be used for filters. but how would it be used for sorts, or numeric range filters ("greater than.." condition)?
k
you can use json_extract udf
l
i mean, not as a user of pinot
i'm asking how pinot would use the index internally
k
today, its not possible but the way json index is built, we can support any index on individual fields
today, by default, it adds inverted index
my plan was to support different index on each field
l
so our use-case cannot be supported today with the json index? as we need to allow range filters and sorts
k
it will work in Pinot, since you can use json_extract expression based filtering but you cant capitalize on the json index
l
You mean the query would work, but just by doing pure parallel cpu computation on all the rows and without taking advantage of the index?
k
yes
not all rows, I am guessing you will have some predicate on other columns
like atleast tableId=X
l
Yeah I meant all "virtual table" rows