Apache Pinot #general

David Cromberge

11/01/2022, 9:45 AM

Hi Everyone, I am completely new to Pinot and am a little more familiar with Druid. Is it accurate to say that Pinot does not have the concept of pre-aggregations but rather relies on indices to compute aggregates from the raw data instead?

abhinav wagle

11/01/2022, 4:49 PM

Hellos, Is there a reason why Trino connector is connected to Pinot via a controller url and not broker url ? Since all trino is doing is query data.

Alfredo Prates

11/01/2022, 10:11 PM

https://www.youtube.com/watch?v=Gu-2HJSfMis▾

👍 2

Lee Wei Hern Jason

11/02/2022, 7:19 AM

Hi Team, can i check if Pinot has a function to bucket time based on a specified start time ? E.g. Every 5 minutes 0002 0006, 000700:11, 0012 0016 Similar to bin_at function for adx. Example

Bobby Richard

11/02/2022, 8:13 PM

Hello Pinot friends, can someone help me understand the use cases for noDictionaryColumns config? Would it make sense to to specify high cardinality columns such as event_id or large string columns as noDictionary?

Ryhan Sunny

11/02/2022, 8:46 PM

Hi folks! We’ve just locked down the full schedule for this year’s Open Source Analytics Conference – OSA Con 2022 – and what a schedule it’s turned out to be! 20+ Talks 30+ Speakers 15 Community Sponsors 2 Keynotes, 2 Panel discussions, 2 Tracks on the same day! Thanks to the many people who have already signed up to join us on November 15th – and if you haven’t had a chance to register yet, please secure your spot for free as soon as possible!

🙌 1

🆒 1

Anita Jas

11/03/2022, 11:56 AM

Hello ! The schema I am supposed to use has a column name as Count Pinot is defaulting all values for this column as 0 ( and not picking up the actual value from the table ) need some insights here !

vishal

11/04/2022, 9:20 AM

Hi Team, i am trying to push data realtime to offline using minion job "RealtimeToOfflineSegmentsTask". getting the logs as below:

Copy code

Start generating task configs for table: events2_REALTIME for task: RealtimeToOfflineSegmentsTask
No realtime-completed segments found for table: events2_REALTIME, skipping task generation: RealtimeToOfflineSegmentsTask
Finished CronJob: table - events2_REALTIME, task - RealtimeToOfflineSegmentsTask, next runtime is 2022-11-04T07:04:00.000+0000

i've pushed huge number of data and its creating multiple segment but its not converting to realtime to offline table. Thanks

Sonit Rathi

11/05/2022, 11:04 AM

does anyone know how we can set consumer group name in kafka stream ingestion?

Ashish Kumar

11/07/2022, 5:09 AM

does anyone know if Pinot S3 FS implementation has signer (i.e. how you sign a aws api request with the credential) support?

Diogo Baeder

11/07/2022, 11:51 AM

Hi folks! Quick question: does Pinot support Redpanda as a replacement for Kafka? Or does Pinot rely on metadata from Kafka to be available in ZooKeeper, e.g. for management purposes?

Dhwanil Ditani

11/08/2022, 5:46 AM

In Pinot deep store with s3 can we configure it in such a way that after the retention time in completed the data is deleted from pinot servers but not from s3 bucket?

vishal

11/08/2022, 9:58 AM

Hi Team,

Copy code

Window data overflows into CONSUMING segments for partition of segment: events16__0__165__20221108T0953Z. Skipping task generation: RealtimeToOfflineSegmentsTask
Finished CronJob: table - events16_REALTIME, task - RealtimeToOfflineSegmentsTask, next runtime is 2022-11-08T09:58:00.000+0000

how can i solve this overflows issue?

Dhwanil Ditani

11/08/2022, 1:20 PM

Hi Team, I have a small doubt, when pinot deep store(aws s3) is configured while the segment is being uploaded to s3 will the ingestion from kafka be paused or the ingestion will continue? Will enabling split commit make a difference?

Abdelhakim Bendjabeur

11/08/2022, 3:25 PM

Hello, Did anyone use hte JSON_INDEX and has a better idea on how much storage it may take?

Raja Kirshnamoorthi

11/08/2022, 8:14 PM

Hi Team - I am new to Pinot. We like to create the real time dashboard from the raw data stored in s3 & Kinesis. The transformed data will be stored in Pinot. What is the best optimised way to transform the data? Which tool / components/ languages can be used for data compute or transform? Or shall we use a Pinot SQL query for transformation?

Gaurav Sinha

11/09/2022, 8:41 AM

Hi Team, We are doing a POC for using Pinot for our company. We have it setup on K8S on GCP. Today i am observing an error while trying to query through Pinot UI [4 out of 6 segments are showing as unavailable]-

Copy code

[
  {
    "message": "null:\n4 segments [user_impressions_v1_stg__3__0__20221107T1247Z, user_impressions_v1_stg__0__0__20221107T1247Z, user_impressions_v1_stg__1__0__20221107T1247Z, user_impressions_v1_stg__4__0__20221107T1247Z] unavailable",
    "errorCode": 305
  }
]

Gaurav Sinha

11/09/2022, 8:42 AM

Can someone help me out on this ? Tried

Rebalance Server

Rebalance Brokers

without any success

Mamlesh

11/09/2022, 10:08 AM

Hi Team, Ive have some quesition 1. How can we check query result in pages like query result giving 2million records, but ive to check first 5lakh and then next rest 5-5 lakhs records. Like we did in Solr. 2. Can we update any record coloumn in pinot via any API or any other way, like in Solr we directly update records via ID of records.

Dan Caragea

11/09/2022, 10:44 AM

hi all. I have a question about how to best proceed with my usecase. Appreciate any suggestions you might have. I have my raw source data in the following format: Raw events:

Copy code

id, type, dimension1, dimension2
 1    t1     a                  
 1    t1                        
 1    t2                  b

 2    t1     a

Rows with the same id are part of the same "session" so in the example above I have 2 sessions: one with 3 events and another with 1 event. If I were to reconstitute the sessions from these individual events, they would look like this: Sessions:

Copy code

id, dimension1, dimension2, countT1, countT2
 1      a           b          2        1   
 2      a                      1        0

Questions I have to answer are "how many

t1

types in sessions where dimension1 is

?" or "how many sessions had more than one

t1

type?". As far as I can tell, my options are: 1. storing the raw data in pinot as is and figuring out the queries for the above questions. TBH this would be my preferred route but can you help me with a sample SQL for the questions above? Also, can these queries be reasonably fast? The bit I am struggling with (my sql skills are really rusty atm) is that a naive query like

select count(type) where dimension1='a' and type='t1'

would return 1 for id 1 yet it should be 2 (see the reconstituted session for id 1). So I probably need some sort of joins but I am not sure what's the best way to do it. 2. I could try to use the upsert feature of pinot to reconstitute and store the sessions in pinot instead of the raw data. This could work although I am not sure I can do the counts with upserts (countT1/countT2). Also, since I'll have to reconstitute multiple session types (based on various other ids) and pinot requires the topic in kafka to be keyed, it means I'll have to duplicate topics in kafka just to use a different key. It seems a bit wasteful to me atm. 3. I could try to reconstitute in a job outside of pinot and insert only the final version of the session in pinot. This has the downside that I will have to wait for a session to be complete before inserting it and this means less fresh data and even lost events if they come very late (there's no end-of-session marker and events can come out of order anyway). So what would you recommend and can you help with #1 above?

Lars-Kristian Svenøy

11/09/2022, 10:58 AM

Hey team, quick question. I have a table which is roughly 60 GB in size. It contains fields which I commonly have to join on, and I am currently using Trino to accomplish this. I have been considering if it makes sense to make this a dimension table. I know there are size constraints, but I could easily store it on disk. My worry though is that I've seen a comment mentioning the entire table may be stored on the heap. Is that the case, or can I get away with a 60GB dimension table?

Weixiang Sun

11/09/2022, 9:54 PM

Just curious, how is pinot setting up the limit for num of groups to return for aggregation query? I always get 500,000 rows at most.

Qiaochu Liu

11/10/2022, 8:54 PM

hello team.

quick question about distinctCountHLL https://docs.pinot.apache.org/configuration-reference/functions/distinctcounthll DISTINCTCOUNTHLL(colName, log2m)

if pinot users want to leverage the log2m parameter during query, do we need to emit the HLL object with given precision at ingestion time? Will it work if no changes on the ingestion side?

abhinav wagle

11/10/2022, 11:10 PM

Hello folks, looking for recommendations around logging for Pinot production cluster. There seem to be multiple configs to control this. But we wanted to ensure we don't over-do with logging at the same time have flexibility to look at Pinot logs for exceptions/errors ?

Ashish Kumar

11/11/2022, 4:11 AM

hello team, I am trying to use pinot-batch-ingestion to ingest s3 parquet data into pinot. I am getting

java.lang.IllegalArgumentException: INT96 not implemented and is deprecated error

. does someone know, what could be the right way of reading parquet using pinot-parquet record reader? It seems like by default my batch job is using parquetAvroFormatReader and that doesn't implement INT96, what could be other possible way of reading parquet files? stack-trace:

✅ 1

vishal

11/11/2022, 5:30 AM

Hi Team, i've created realtime-offline flow. i've pushed some data 1st day and it got converted to offline 2nd day. and than i pushed more data to but its not pushing to offline even i can see 6 segments are completed but its not pushing to offline table. can somebody help me on this? Thanks.

Abhishek Dubey

11/11/2022, 7:02 AM

Hi Team, is it possible to create pinot segments on attribute other than time field ? This is for offline tables where data volume grows in controlled way.

Ashish Kumar

11/12/2022, 2:15 PM

Hi Team, Can we use pinot-batch-ingestion jar with spark 3.1.2 ?

Abdelhakim Bendjabeur

11/14/2022, 10:06 AM

Hello 👋 I am looking into the details of a self-hosted Pinot on k8s, and trying to answer a few questions to make sure if it's worth it, In the docs, there is 👇 Dynamic configuration changes: Operations such as adding new tables, expanding a cluster, ingesting data, modifying indexing config, and re-balancing must be performed without impacting query availability or performance Does expanding a cluster mean we can add disk and cpu/memory resources to all components seamlessly?

Abdelhakim Bendjabeur

11/14/2022, 10:40 AM

Also, another question, maybe dumb as I am not a devOps exprt From the architecture docs, servers host segments, does this mean that the Persistent Volume of the server (in k8s) can hold hundreds of GB of data? In case of a realtime ingestion (kafka), If the server restarts before committing the data to the storage, will the consumer be aware of this and avoids committing the kafka offset? to ensure data consistency