Apache Pinot #general

Tim Berglund

03/23/2023, 4:04 PM

It’s happening now if you can join. 🙂

vishal

03/24/2023, 9:44 AM

Hi Team, i am pushing some data through csv some columns contain ";" pushing those columns as string but its not parsing properly it is returning error

Yarden Rokach

03/24/2023, 11:26 AM

Thanks to everyone who joined yesterday to the Apache Pinot Roadmap 2023 🍷 It was great to see the high engagement and enthusiasm around what’s in store for Pinot this year! In case you missed it-

Event Recording▾

Apache Pinot 2023 roadmap deck Wishing you a wonderful day ❤️

Nikhil Srivastava

03/25/2023, 6:35 AM

Hi Team, can someone share good resource which talks about engineering differences between real time analytical stores (like Pinot, Druid), emerging streaming databases (like RisingWave etc.) and which to use for different use cases?

Tim Berglund

03/27/2023, 11:56 PM

My guess is many of you have seen the videos we’ve been making to generate awareness of the Real-Time Analytics Summit next month. @Kishore G had some good things to say about Pinot+Kafka in his post about last week’s video: https://www.linkedin.com/feed/update/urn:li:activity:7046200749961199616/

🌟 3

Sonit Rathi

03/29/2023, 6:00 AM

Hi Team, I have a pinot cluster with only 1 zookeeper. I wanted to increase it to a cluster of 3. But creating a cluster will mean bringing down the only running zookeeper first. How do I gracefully handle this?

➕ 2

Shreeram Goyal

03/30/2023, 8:37 AM

Hi, We have been querying pinot via presto and have found that broker pushdown happens very rarely, complex queries are directly run on servers via presto and a lot of data transfer happens. On trying query functions such as jsonextractscalar of pinot on presto, I found that those functions don't work on presto. Is there any way or work around to increase the query pushdown on broker ?

Yarden Rokach

03/30/2023, 1:27 PM

In 2016, Steven Strange traveled many miles to repair his nerve-damaged hands and acquire unthinkable powers in the Marvel blockbuster “Doctor Strange.” Today, our VP of Developer Relations, Tim Berglund (aka Mr. Batch Data), returns to Kamar-Taj to find out how The Ancient One put a petabyte of data into a single table at Stripe and achieved subsecond query latency throughout the multiverse. And on April 25-26, you need only travel as far as San Francisco for Real-Time Analytics Summit 2023 to learn about many other powerful examples of the transformation the data world is undergoing from batch to real-time. Network with the pros and hear directly from industry leaders at companies like LinkedIn, Confluent, Uber, Cloudera, DoorDash, Just Eat Takeaway.com, and many more. Join us and enter the realm of real-time analytics. Use Code TEACHME for 30% off your ticket: https://stree.ai/3ZOXmQ7

Yarden Rokach

03/30/2023, 1:27 PM

https://www.linkedin.com/posts/startreedata_the-data-multiverse-doctor-strange-parody[…]455313960961-sho-?utm_source=share&utm_medium=member_desktop

Tim Berglund

03/30/2023, 2:55 PM

True story about this one: I just couldn’t get the screaming part right (my colleagues were right outside the studio when we were filming), so Peter Furia, Head of Video at StarTree, had to record himself screaming in his home office editing suite. He nailed it!

Tim Berglund

03/30/2023, 2:56 PM

(This is a process called ADR, and is common in actual movies.)

David G. Simmons

03/30/2023, 7:19 PM

In case you missed these outstanding new guides: https://stree.ai/3ZnzAdn https://stree.ai/3lU0njG

kodalien kodalien

04/03/2023, 4:17 PM

Hi,Everyonr

🙌 3

👋 3

Yarden Rokach

04/04/2023, 11:11 AM

Ola Ola everyone! 🌼🌸☘️ The spring is here, and so are some community updates! 📢 Major News • Sovrn is Revolutionizing AdTech with Real-Time Analytics Powered by StarTree Cloud - Read on Datanami • Real-Time Analytics Summit - check out this blog to learn why you should register and join us in San Francisco, April 25-26 • The Real-Time Analytics Podcast is out now! Listen to the first episode “Real-Time Analytics and Why You Need It” with Tim Berglund • Introducing the Apache Pinot™ Playground, a new, lighter-weight option for playing around with Pinot - check it out! • StarTree Extends Leadership Position in Real-time Analytics with 4x revenue growth, 10x customer growth, and 2x employee growth - Read the Press Release 📖 To Read • Real-Time Analytics for IoT by @David G. Simmons • The Power of Automated Anomaly Detection for Rideshare Booking Metrics by Madhumita Mantri, Jackson Argo • Best Practices for Designing Tables in Apache Pinot™ by Sandeep Dabade, Kulbir Nijjer • A Journey into Apache Pinot™ and Real-Time Analytics bBy David G. Simmons ▶️ To Watch •

Multi-Volume Support in Apache Pinot | StarTree Recipes▾

•

Tracking Ingestion Lag from Apache Kafka | StarTree Recipes▾

•

Meetup: Real-Time Analytics Using Apache Pinot▾

•

Meetup: Apache Pinot Roadmap 2023▾

🗓️ Events • In-person Real-Time Analytics Summit April 25-26 - Hotel Nikko San Francisco • Virtual Meetup on April 6 - Stream Processing vs. Real-Time OLAP: Which One Would You Choose • Virtual Meetup on April 6 - Getting Started with Apache Pinot [Hands-on Workshop] • Virtual Meetup on April 12 - Getting Started with Indexing on Apache Pinot- Brazil • Virtual Meetup on April 19 - Real-Time Analytics for Retail View the full newsletter on LinkedIn here!!!

Barkha Herman

04/04/2023, 4:15 PM

All - I am hosting a hands on workshop this Thursday for getting started with Pinot. Hope to see you all there! https://www.meetup.com/meetup-group-cmutoian/events/292258207/

🍵 2

🍷 4

🎉 2

Ismail Mohammed

04/05/2023, 4:38 PM

Are there any best practices / advice on how the pinot infra can be set up to achieve SOC/X complaint access controls etc. Any inputs will be much appreciated. I have seen OKTA to manage snowflakes but Pinot supports OKTA, questionable. Novice on this, any doc relevant to this will be very helpful.

Doris Zhang

04/06/2023, 5:53 PM

Hi team, does Pinot helix support authentication with Zookeeper? We are hoping to support the Pinot helix using kerberus or password authentication with ZK.

Rohit Yadav

04/10/2023, 4:27 AM

I am aware of the recommendation that segment sizes should be ~200MB and number of segments in a table should again be a few thousands as both can have a negative impact. I have a use case where in a REALTIME table, one column is blob data(other 2 columns will have strings with inverted index only) and its size is avg ~20KB. This leads to a situation where I either have too many small segments or moderate number of large sized segments(2GB). Going the route of too many segments is out of question as it affects zookeeper but is there a way I can support large sized segments? Based on this commit, it looks like large segment sized can be created(>2GB), is there any suggestion how to handle large segment sized table?

Arush Kharbanda

04/10/2023, 11:25 AM

Hey, Is there a good talk about any production implementation of Pinot. We are in the process of setting up a Pinot cluster and looking to understand the best practices and learnings from anyone who has deployed pinot in production.

Barkha Herman

04/11/2023, 10:42 PM

The latest in the series of Pinot related RTA Video is out and it's great! Check it out: https://www.linkedin.com/posts/startreedata_pinot-kafka-flink-spideys-learn-about-act[…]631856697344-dwet?utm_source=share&utm_medium=member_desktop

Sumit Lakra

04/13/2023, 10:28 AM

Hi team, what tools do we have as an option for visualisation other than Superset ? We have tried using PowerBi via Presto interface but it doesn’t seem to have an option for direct queries

Malte Granderath

04/14/2023, 7:11 AM

Hey everyone 👋 We are currently setting up our production orchestration for Pinot and we have to decide between focusing on a single cluster vs multi cluster setup. Generally the standard approach at our company for any infrastructure is to go for multi cluster setup because of the increased fault tolerance and less possible cross use-case impact. What are the main advantages of going for a single cluster with multiple tenants approach that are not directly obvious?

Vladyslav Shamaida

04/17/2023, 9:05 AM

Hi 👋 Please advice a better solution for a pretty standard case. I have a table:

Copy code

dim1, dim2, dim3, dim4, daysSinceEpoch, hoursSinceEpoch, metric

I need to query

sum(metric)

grouped by days OR by hours with optional dimensions. Cardinality of dimensions is pretty low: from 3 to few hundreds unique values. I decided to use star-tree index for that. Is having both

daysSinceEpoch

and

hoursSinceEpoch

dimensionsSplitOrder

a good approach here?

Copy code

"dimensionsSplitOrder": [dim1, dim2, dim3, dim4, daysSinceEpoch, hoursSinceEpoch]

TTL of data is 3 months.

Zhengfei

04/18/2023, 4:46 AM

Hi team, is their any timeline to fix this thread safety issue in https://github.com/apache/pinot/pull/9802? We would like to use

Copy code

{
  "dimensionTableConfig": {
    "disablePreload": true
  }
}

but a bit worried about this thread safety issue.

Padma Malladi

04/18/2023, 6:13 PM

Hi, how can I get the pinot version from a running pinot instance?

Padma Malladi

04/18/2023, 6:14 PM

Also, how is the pagination api expected to work in pinot? when I ran "select * from x where foo > 3 order by foo desc limit 5, 10", it gave me all the 10 results and not starting from the offset 5

Ryan Tomczik

04/19/2023, 5:06 PM

Hello we will be evaluating Apache Pinot for a realtime analytics use case but one of the requirements is to be able to retrieve large result sets of ids (1 million+) for a particular query. We have seen that query result pagination is a requested feature but hasn't been implemented yet. Is this a use case that Pinot is used for?

Scott deRegt

04/19/2023, 8:15 PM

❓ We occasionally do

server

reboots for installing latest security patches on instances. We've noticed elevated failure rates on pinot queries during these times. We use replication factor of 3 to ensure HA. My understanding was that

broker

would gracefully handle

dead

server and scatter queries to a replica group that contains only

alive

servers. Is that not the case? If not, is there a recommended path to taking a server at-a-time offline to perform reboot while maintaining full availability of the cluster? Does `disable`/`enable`-ing the server state (using

/instances/{instanceName}/state

) during reboot cycle help here?

Ashish Koirala

04/19/2023, 11:03 PM

I am interested to learn what people have been doing for encrypting data and backing up data for disaster recovery. • I learned that S3 can be used as deep store where data can be stored encrypted. • Are there provided solution to encrypt data on disk and in transit? • How often is deep store updated. I am wondering if there can be data loss if cluster goes down and the data has not been written to deep store yet and has only been written to the local disk. Is that true? • Can the deep store segment be stored as backup, copying it to a different location. How can we consume it later?

Adam Erickson

04/20/2023, 12:04 AM

Hey all, quick question that I'm hearing conflicting answers about. If I set a column as

"JSON"

, do I have to worry about

maxLength

and the JSON getting truncated or no? Is it internally stored as a string with fixed max length or does it grow for variable length JSON? Here are some links I found online that touch on the topic, but seem to conflict. 1) https://github.com/apache/pinot/issues/7051 , which says

You may set the json field as data type JSON (introduced in #6878) and the value won't be truncated

2) https://docs.pinot.apache.org/basics/data-import/complex-type, which says

...Additionally, you need to overwrite the maxLength of the field group_json on the schema, because by default, a string column has a limited length. For example,...