hi folks :slightly_smiling_face: Good to be here! ...
# general
e
hi folks 🙂 Good to be here! I’d really love to learn more about the capabilities of the star-tree index. In particular, I’m curious to know how it might enable ingesting raw data and creating “materialized views” for specific use cases later once they are known. This might be considered an anti-pattern for Pinot, and if so that would be good to know too 👍
m
Currently, starTree index is created based on user configs; https://docs.pinot.apache.org/basics/indexing/star-tree-index
đź‘€ 1
k
Hi Evan! It's definitely not an anti-pattern. You've come to the right place. Creating materialized views by transforming raw data sources can be done using a variety of techniques during ingestion. Specifically, transform functions can mutate your raw data using Groovy scripts. We have a variety of other transform capabilities worth checking out in the docs. For star-tree index, that's a good way to speed up your query responses once you've figured out what your materialized view is going to look like.
What does your domain data look like? Are you looking to do metric aggregation or more focused on building query models with multiple dimensions?
e
@User that’s interesting to know RE Groovy scripts. Thinking about it from a traditional RDBMS perspective and using a classic “blog posts” example, would it make sense to ingest
posts
,
comments
,
likes
etc as distinct data sets, then later make use of Groovy scripts to effectively join those data sets at ingestion time (rather than search time)?
k
There's multiple strategies here. If you're streaming your data to a real-time table using Kafka, you have two potential approaches. You can implement your messaging at the application-level, which would send your domain objects as a payload to three separate topics. At that point, you will need to use something like Flink to join those three streams together into a single materialized view (rather simple in practice). The joined materialized view is then sent to a view topic that is ingested into Pinot and ready for query.
At the point where your materialized view gets ingested, you can do more transform functions and pre-aggregations as you stream in. This can be helpful if you need to mutate your materialized view for different consumers or dashboards (multiple tables but same joined view from RDBMS).
There's also an emerging practice that is much more preferable when it comes to implementation and maintenance down the road. You can use CDC to stream out data changes at the database level on a per table basis. So, your
posts
,
comments
,
likes
, as they sit in different tables... whenever they are updated, deleted, created, a Kafka event is sent per table to a respective topic. Then the rest is as I said before with Flink.
Does that explanation help?
(Also, with upsert, you can make sure that only the most current version of a domain object is made available for query using SQL on Pinot)
🙌 1
e
ok ya, that makes sense using an ETL platform to do the work of joining data sets. I suppose my concern is more in the area of product evolution and the maintenance that would go along with it. For example with an RDBMS, maybe a new table is added later on resulting in a new dimension for
posts
. To then start making use of the new dimension, the new table can be joined; it might not be very efficient but it can start answering questions about the data right away. And testing/local development can be done relatively cheaply by inserting data and joining at query time. What does the developer workflow look like to support the same kind of feature evolution in Pinot? It seems complex to mimic an ETL pipeline for local development, for example
k
Just double checking with Mayank to make sure I answer this right. One sec.
🙏 1
Example: •
fooTable
has columns 
a, b, c
 in the Pinot schema configuration, as well as 
primaryKey
• Upsert is enabled and partitioned on the 
primaryKey
• The table is real-time and has been populated with 1,000 records • Now I change the Pinot schema to 
a, b, c, d
• The Kafka payload has been modified to stream in the new column for 
d
• To make queries return correctly after making this change in Pinot, you need to issue a
reload
on the segments of the
fooTable
• This will populate the
d
column with the value null for the 1,000 existing rows • Because upsert is enabled, when you populate the
d
column in your RDBMS, the old 1,000 rows will be updated with the current version of the
d
value
So, with that workflow in-mind. Your total work is to update the schema of your RDBMS table that is configured to use CDC to stream record updates to Kafka. A simple modification of the Pinot schema configuration adds the new field from the database. To operationalize the change, you simply reload the segments on your table (a simple and safe command). Then you're off to the races.
Does that make sense?
e
Yep, all makes sense 👍 Thanks for sticking with me.  I wonder if you can comment on another area of developer experience. As a developer who would be building dashboards consuming the data (ie making queries), making use of Postgres (just as a concrete example RDBMS) for local development is pretty feasible: spin up a Docker container of Postgres, apply migrations as applicable, insert some data, and start making queries.  With what’s described above, local development sounds a bit heavy. One would need all the described components running locally in order to test changes and develop new features. Is there a trick to that, or established pattern for local development with an analytics DB which requires ETL? Maybe just synthesizing data locally that adheres to the expected format which the ETL job will produce/input to the DB? Thanks again for all your time 🙏
Sorry to create a thread within a thread, but I was also curious to follow up on your comments about
upsert
. What are the semantics of upserts in Pinot? Is it partial upsert or full overwrite? Ex. in the example you gave above RE adding column
d
to an existing schema, could the value for
d
be added to existing entries with the same
primaryKey
and values for
a, b, c
already present by sending a payload with only the
primaryKey
and value for
d
? Or would the payload need to include
a, b, c, d
with a given
primaryKey
?
k
I got caught up today in some other things. Let me get back to you by tomorrow morning. Thanks Evan!
❤️ 1
@User To your first point with the developer experience. That sounds about right. We need to come up with ways to make that easier and more fluid. Feedback like this is what helps us get there. In the coming months I am going to focus heavily on developer experience, and so we'll be looking to optimize as much as possible to make that workflow faster and more iterative.
For upserts, we save the log of previous updates and you can query them by disabling upsert at query time. https://twitter.com/kennybastani/status/1407339535667798021?s=20
The second part of your questions on upserts... Your updates will be applied in full to the record with the primary key. So you won't be able to update just a single column's value.
In the future we may be able to do something like a fold left on the log of updates by primary key, which would merge the previous updates with current updates, allowing you to fold in column-level updates into a single materialized view.
e
Thanks for all the answers here! I still need to digest the content 🙏 Thanks for all your time here
đź‘Ť 1
m
đź‘Ť