hi everyone, I’m wondering how I can assess whethe...
# getting-started
m
hi everyone, I’m wondering how I can assess whether Pinot is the right choice for me. I have a very large table with ~1gb of data ingested each day, I want to write several thousands of queries on top of this data that run continuously, transform data in python, then write a table (each active query has its own table/sink) Then, there may be other queries that are also continuously querying the table that was just written to. I’ve started looking into Pinot, and looked at the streaming pages in the docs, I see that there is a large amount of support for streaming on the ingestion side, but not so much for running queries? I have looked into Delta Live Tables on Databricks which seems fairly close to what I have in mind, but couldn’t understand if running thousands of queries continuously on one large table was going to cause issues.
k
Hi Matt, Thanks for considering Pinot. Looks like you are trying to use Pinot as an ETL tool? I would not use Pinot for that. I dont think Delta Live tables is right either. We might be able to help you better if you can describe the use case first and what are the key goals - is it freshness, query latency, or concurrency of queries?
m
Hi Kishore, thanks for the reply! I am building a platform for querying web traffic. The large table contains network requests (typically http transactions) and I offer the ability for my end users to write queries against this table. Each query declares the traffic patterns it is interested in, and is typically paired with an ETL job that queries and/or transforms the network data. Some examples include: • log status code and timestamp (typical ping-as-a-service scenario) • parse a HTML response body and grab the data at X selector (typical web scraping scenario) • parse and transform a JSON response body • compare the http body response with the value of content-length header to detect errors • aggregate queries, e.g. report analytics on number of http responses that included the Vary header
There are multiple layers of queries, for example there may be queries that extract Product data from various SERP listings, and then another query that listens to the output across multiple lower-level queries, then performs Product Matching using ML