Coming from a relational database perspective, I'v...
# general
g
Coming from a relational database perspective, I've had some difficulty conceptualizing what my data might look like in Pinot. Is the standard to have multiple tables like in a RDBMS and query them relationally using something like Presto, or should I strive to have less tables with more columns that remove the need for relational querying? If the latter is preferable, is that still the case when the table would have to contain many columns to replace the relational structure and many of those columns would need to contain things like array or JSON object
b
Definitely the latter
I am new to Pinot as well, but from what I understand wide, denormalized tables are the norm
t
Thanks @User! So it's preferable, within reason, to have duplicated data, to some extent, since the size of the tables in terms of rows are less of an issue in terms of query speed than they would be in an RDBMS?
m
Thanks @User, yes that is the more common usage. Having said that, Pinot does support lookup joins (on dimension table). And folks have also used Presto/Trino connector for Pinot to do more complex queries (joins/nested queries etc)
@User yes that is correct. Pinot is built for performance, and can scale very well with size of data (num rows, or otherwise)
g
And it's still best practice even when those columns become more complicated? Like would it be a problem to have an array-column with 100 entries in it? What about 1000? 10,000? Is it still preferable to have that data be stored in a column at that point instead of in its own table and relationally joined? Or, and I assume this is not the right answer, but is a middle-ground solution to just duplicate data across rows to avoid large array column values?
k
Hi @User - due to how Pinot can use a dictionary to compress columnar data, “duplicate data across rows” typically doesn’t add a lot to the size of the table, or at least that’s been our experience with having denormalized tables.
g
So then, @User, would it be a good idea to have multiple "rows" with duplicated data and a single value column instead of one row with an array column?
k
If nothing else is changing but the value in that one column, then we use an MV (multi-value) column and have a single row.
e.g. we have a column with the unique terms, derived from another column containing a blob of text. That’s stored as a MV column, and we can easily query against those terms to filter to a sub-set of rows.
g
Our pipeline at present is that we have a single immutable data type represented in a base table, and then through multiple steps in our processing pipeline we add data to various tables that relate to the base/core table. When you say "If nothing else is changing but the value in that one column" are you implying that the rest of the columns should be largely immutable as well?
k
If you have say two MV columns A & B, and you’ve collapsed multiple row values into those two columns, then you’ve lost the ability to filter to rows where column A = x and column B = y, since those values could have come from two different pre-collapsed rows. But it sounds like your use case is different, in that you’re adding additional attributes to a base row, thus there’s no row collapsing going on.
g
Awesome, thanks for your input 🙂
k
this is such an amazing thread. Thanks Ken!