Coming from a relational database perspective I ve had some Apache Pinot #general

Coming from a relational database perspective, I'v...

Greyson Von Hagel

11/03/2021, 7:46 PM

Coming from a relational database perspective, I've had some difficulty conceptualizing what my data might look like in Pinot. Is the standard to have multiple tables like in a RDBMS and query them relationally using something like Presto, or should I strive to have less tables with more columns that remove the need for relational querying? If the latter is preferable, is that still the case when the table would have to contain many columns to replace the relational structure and many of those columns would need to contain things like array or JSON object

Bobby Richard

11/03/2021, 7:49 PM

Definitely the latter

Bobby Richard

11/03/2021, 7:50 PM

I am new to Pinot as well, but from what I understand wide, denormalized tables are the norm

tyler dobbs

11/03/2021, 7:51 PM

Thanks @User! So it's preferable, within reason, to have duplicated data, to some extent, since the size of the tables in terms of rows are less of an issue in terms of query speed than they would be in an RDBMS?

Mayank

11/03/2021, 7:52 PM

Thanks @User, yes that is the more common usage. Having said that, Pinot does support lookup joins (on dimension table). And folks have also used Presto/Trino connector for Pinot to do more complex queries (joins/nested queries etc)

Mayank

11/03/2021, 7:53 PM

@User yes that is correct. Pinot is built for performance, and can scale very well with size of data (num rows, or otherwise)

Greyson Von Hagel

11/03/2021, 8:04 PM

And it's still best practice even when those columns become more complicated? Like would it be a problem to have an array-column with 100 entries in it? What about 1000? 10,000? Is it still preferable to have that data be stored in a column at that point instead of in its own table and relationally joined? Or, and I assume this is not the right answer, but is a middle-ground solution to just duplicate data across rows to avoid large array column values?

Ken Krugler

11/03/2021, 9:03 PM

Hi @User - due to how Pinot can use a dictionary to compress columnar data, “duplicate data across rows” typically doesn’t add a lot to the size of the table, or at least that’s been our experience with having denormalized tables.

Greyson Von Hagel

11/03/2021, 9:16 PM

So then, @User, would it be a good idea to have multiple "rows" with duplicated data and a single value column instead of one row with an array column?

Ken Krugler

11/03/2021, 9:17 PM

If nothing else is changing but the value in that one column, then we use an MV (multi-value) column and have a single row.

Ken Krugler

11/03/2021, 9:18 PM

e.g. we have a column with the unique terms, derived from another column containing a blob of text. That’s stored as a MV column, and we can easily query against those terms to filter to a sub-set of rows.

Greyson Von Hagel

11/03/2021, 9:20 PM

Our pipeline at present is that we have a single immutable data type represented in a base table, and then through multiple steps in our processing pipeline we add data to various tables that relate to the base/core table. When you say "If nothing else is changing but the value in that one column" are you implying that the rest of the columns should be largely immutable as well?

Ken Krugler

11/03/2021, 9:24 PM

If you have say two MV columns A & B, and you’ve collapsed multiple row values into those two columns, then you’ve lost the ability to filter to rows where column A = x and column B = y, since those values could have come from two different pre-collapsed rows. But it sounds like your use case is different, in that you’re adding additional attributes to a base row, thus there’s no row collapsing going on.

Greyson Von Hagel

11/03/2021, 9:46 PM

Awesome, thanks for your input 🙂

Kishore G

11/03/2021, 9:57 PM

this is such an amazing thread. Thanks Ken!

Open in Slack

Previous Next