Apache Pinot #general

Raj

03/10/2023, 2:36 PM

Hi, Any thoughts on this OLAP benchmark. Not sure how reliable/verified this benchmark is. https://benchmark.clickhouse.com/ Pinot is in the middle of the pack

Saubhagya Awaneesh

03/11/2023, 2:31 AM

Hi team, how to 1) limit user access to pinot rest apis? Only selected userid can query via api. 2) setup api token by userid - users who can access api but have dedicated api token. 3) (optional) acl at column / row level?

Ashish Kumar

03/14/2023, 8:23 AM

Hi Team, Is it possible to recover deleted realtime segments back into REALTIME table? Context: We had a REALTIME table DRIVER_METRIC, we accidentally deleted the table from pinot console which used 7d default retention of segments. So we can still see the deleted segments in S3 but table is not available for query anymore. Now, we have created DRIVER_METRIC REALTIME table again which is available for query and consuming data from Kafka but it only has last 1day of data (kafka topic retention). Now, we want to push old data into this table from deleted segments folder in S3. Is it possible? If possible how? cc: @Lee Wei Hern Jason

Yarden Rokach

03/14/2023, 12:42 PM

Apache Pinot Roadmap 2023 meetup | March 23 For the Community- by the Community 🍷 • In this meetup we’ll be featuring the Apache Pinot roadmap for 2023 ; • Get to hear from Linkedin, Uber, StarTree and more, what they have in store this year for Pinot. • Explore what other community members are working on and Hear what the community wants to see in Pinot. RSVP here >> https://www.meetup.com/apache-pinot/events/291954166/?isFirstPublish=true

Weixiang Sun

03/15/2023, 1:39 AM

Quick question: How is`EXPLAIN PLAN FOR` generated for Hybrid table? My testing result is that

EXPLAIN PLAN FOR

for hybrid table is same as realtime table which is different from offline table. Is it expected?

Rohit Yadav

03/16/2023, 10:16 AM

Hi community, I am trying to use the lucene text index for a hybrid table setup. I was able to set it up for realtime table without much effort. For offline table part, we rely on a spark job to generate and upload the segment URIs. Do I need to create segments with text index and then upload or does Pinot create the text indexes automatically after the segments without indexes are uploaded?

piby

03/16/2023, 2:34 PM

Hi community, Is there any way to specify table and column description in the schema json? We ideally want to store some table metadata right within the schema and not use an external solution for it.

abhinav wagle

03/16/2023, 9:40 PM

Hellos, Are there docs available on how to make Udf's work with helm setup.

Sameer Awasekar

03/17/2023, 4:20 AM

Hi Community, I am exploring the RealtimeToOffline Minion Task. I wanted to confirm if the upload of converted offline segments and update of

watermark metadata

is atomic? I do see the segment replacement protocol but I think it doesn't come into picture for RealtimeToOfflineTask but for Merge task.

vishal

03/17/2023, 7:14 AM

can we download segments and put data into csv? @saurabh dubey

Ashish Kumar

03/17/2023, 3:28 PM

Hi Team, what's the difference between

LaunchSparkDataIngestionJobCommand

LaunchDataIngestionJobCommand

? When using batch ingestion job (https://docs.pinot.apache.org/basics/data-import/batch-ingestion/spark) which one should be the main class?

Nizar Hejazi

03/17/2023, 11:52 PM

I have a Kafka topic with AVRO encoding. The time column is of type (long). The logical type is (timestamp-micros). Any way to convert it to milliseconds and defined a datetime field spec like the following (without having to create a new column):

Copy code

{
  "name": "event_time_ms",
  "dataType": "TIMESTAMP",
  "format": "1:MILLISECONDS:TIMESTAMP",
  "granularity": "1:MILLISECONDS"
}

Jason MacLulich

03/18/2023, 6:18 AM

Hi Guys where is the best place to ask about Pinot SQL querying engine? and how it behaves specifically using a very long array expression for the

IN

operator?

Deena Dhayalan

03/20/2023, 7:48 AM

Hi can anyone make and share me a complete doc for how to start pinot with docker with hdfs setup ?

Pratik Tibrewal

03/20/2023, 5:29 PM

Hey, Recently we saw very high disk usage for some of our hosts. On investigating, we found in our servers, directories something like this for a table:

Copy code

_tmp/tmp-<segment_name>-<timestamp>/tmp-<uuid>

The segment name in this path^ does not exist anymore for that table (deleted by retention). The contents of the directory are of this manner:

Copy code

0	col1.sv.sorted.fwd
0	col2.mv.fwd
0	col3.sv.sorted.fwd
0	col4.sv.sorted.fwd
0	col5.sv.sorted.fwd
0	col6.sv.sorted.fwd
4.0K	col1.dict
4.0K	col2.dict
4.0K	col3.dict
26G	    col4.dict
132G	col5.dict
148G	col6.dict

Any idea what this

_tmp

folder signifies and why are they getting created?

Andi Miller

03/20/2023, 5:37 PM

is there a recommended way to apply rollups to an offline data import that's come in with

SegmentGenerationAndPushTask

? do I need to trigger a

MergeRollupTask

and hope it does it?

abhinav wagle

03/20/2023, 8:00 PM

hellos, Any ideas on how folks are providing

ssl.truststore.location

as mentioned here part of the pinot Deployment using helm. Is it local

jks

file being packaged as part of the docker or being added post cluster deployment. Any ideas/best practices around this ? Thanks !

Bobby Richard

03/20/2023, 8:44 PM

ls there any way to backfill segments in a realtime only table?

Mingmin Xu

03/20/2023, 9:51 PM

Hello team, I'm looking for some suggestions on how to setup graceful shutdown properly, on brokers and servers, similar as how trino/presto works. Out Pinot cluster is deployed in K8S, to avoid downtime when a pod is restarted, • a server node need to commit any consuming segments, and mark as inactive to avoid new queries coming; • a broker is marked as inactive to avoid new queries, and wait until active queries are finished. cc @Grace Lu

Tim Berglund

03/21/2023, 5:27 PM

If you haven’t seen the silly parody videos my team has been making, today is your day: https://www.linkedin.com/posts/startreedata_pinot-s-kafka-hes-a-friend-from-work-activity-7043973647237042177-tzKb/

🍷 2

Tim Berglund

03/21/2023, 5:30 PM

All of this madness is remind you of rtasummit.com. Go there and check out the details, look at the program, and register. There’s a lot of Pinot content, and I’d love to see this community there in force.

Tim Berglund

03/21/2023, 5:31 PM

PM me if you want a discount code. 💥

Grace Lu

03/21/2023, 9:06 PM

Hi team, want to consult about one of our high cardinality use case and see how to set it up properly with pinot or whether it is a proper use case for pinot. We have a metrics table that contains hundreds of daily level metrics columns that associated with uuids, the data updates daily to add more than 50 millions unique rows every day (add one row for each uuid everyday, and there is millions uuids). A simplified table schema looks like:

Copy code

uuid    date    group    metrics_1,    metrics_2.     … metrics_xxxx

And a typical simplified query we want to run on this table is selecting a bunch of metrics aggregation for certain groups of uuids across days and then aggregate them again by group, eg:

Copy code

select 
   group,
   avg(m1),
   sum(m2),
   ...
   avg(mxxx)
from 
(
    select 
        uuid,
        group,
        avg(metrics_1) as m1,
        sum(metrics_2) as m2,
        …
        avg(metrics_xxx) as mxxx
    from metrics_table where group in (xxx) and date between aa and bb
    group by 1, 2
) group by 1

When we did preliminary testing previously, we ran into issues of simple aggregation query on uuid takes very long to return, or query return inaccurate approximations due to high cardinality, we want to get some suggestions about whether it is a good use case with pinot, and if it is how to model this with proper cluster config and index config, thank you! cc @Mingmin Xu

Ashish Kumar

03/22/2023, 2:04 PM

Hi, 1. what's the difference between building pinot-0.12.0 from source code with

-Pbuild-shaded-jar

and without it? 2. Is it possible to shade

org.apache.hadoop

being used in main pom.xml in pinot-0.12.0, seems like it's using different version then hadoop being used in our team's cluster. I believe, if we can shade it and build pinot from source code, then it should be fine.

Yarden Rokach

03/22/2023, 3:54 PM

Join us TOMORROW- Apache Pinot Roadmap 2023 meetup 💥 🍷 In this meetup we’ll be featuring the Apache Pinot roadmap for 2023 ; Get to hear from Linkedin, Uber, StarTree and more, what they have in store this year for Pinot. Explore what other community members are working on and Hear what the community wants to see in Pinot. Meet, chat, and deepen your knowledge in Real-Time Analytics with Pinot. RSVP here: https://www.meetup.com/apache-pinot/events/291954166/?isFirstPublish=true

Yarden Rokach

03/22/2023, 4:15 PM

https://www.linkedin.com/posts/startreedata_pinot-s-kafka-hes-a-friend-from-work-acti[…]647237042177-tzKb?utm_source=share&utm_medium=member_desktop Just making sure you all saw the latest release… 🤣 Lord of Pinot is in the house @Tim Berglund

Ken Krugler

03/22/2023, 9:54 PM

So, way less fun than a Thor remix - https://www.thenile.dev/blog/things-dbs-dont-do. Interesting input for the Pinot roadmap…

🔥 1

👀 1

David G. Simmons

03/23/2023, 11:48 AM

Speaking of less interesting than a Thor parody, I thought I'd point y'all to my first blog post for StarTree... Enjoy!

💥 2

David G. Simmons

03/23/2023, 11:51 AM

I also wrote a thing for DZone on Pinot and IoT, if you're at all interested. 🙂 https://dzone.com/articles/real-time-analytics-for-iot

Tim Berglund

03/23/2023, 4:03 PM

Time for the Apache Pinot Roadmap Meetup! https://www.meetup.com/apache-pinot/events/291954166/