Apache Pinot #general

Qianbo Wang

09/01/2021, 4:29 PM

Hi Pinot experts, I’m new to this analytics realm with Pinot and I have a general question: Does pinot support something like “view” that is common in OLTP? What I’m looking for is a way to optimize frequently used queries that require aggregation over data entries, e.g,: sum of total sales for the past 30, 60, 90 days which aggregates on a designated time column. Another option I’m thinking of is to create separate table for this aggregation which is derived from the fact table and use a scheduled job to update it. Any idea? Thanks in advance!

Charles

09/06/2021, 1:26 AM

Hi @User Do we have have plan to support ingestion by jdbc or flink? Then We can save resource in some cases .

Shishpal Vishnoi

09/06/2021, 6:33 AM

# Kubernetes deployment 1. git clone https://github.com/apache/pinot.git 2. cd pinot/kubernetes/helm 3. helm install -n pinot-test pinot .

Error: Chart.yaml file is missing

Should I run

helm install -n pinot-test pinot ./pinot

? Or I am making any mistake in above steps which are mentioned in documentation

Grace Lu

09/07/2021, 10:58 PM

Hello Pinot experts, I wonder if anyone here running Pinot on k8s in production have suggestions for pinot disaster recovery plan from k8s cluster downtime. Assume we are in a environment with multiple k8s clusters running, which of the following would you recommend to let Pinot be resilient to k8s cluster level outage or maintenance: 1. Setting up Pinot cluster across multiple k8s environment with each of them holding one set of data replication. --- (not sure if it is feasible or easy to do) 2. Setting up fully replicated redundant Pinot clusters in different k8s environment, also replicating the data ingestion and anything we did in main cluster. --- (seems costly) 3. Only setting up Pinot running in one k8s cluster, in the case of a k8s cluster outage, rebuild the server, controller, broker in another healthy k8s cluster and let it pick up the old states from kafka, zookeeper, s3, etc. --- (How hard is it for a newly build pinot cluster to inherit and resume the old states?) Any experience sharing on handling this in a prod environment is much appreciated 🙏🏻. Thanks in advance!

Karin Wolok

09/09/2021, 3:49 PM

👋 Please help us welcome our newest community members 👋 (also, help us celebrate reaching 1,700 members 🍷!! 🥳 ) ➡️ New members, please tell us who you are and what brought you here! How did you stumble across Apache Pinot pinot ? @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User @User

🍷 7

👋 4

Carl

09/09/2021, 7:59 PM

Hi team, does Pinot currently support pagination with group by using limit a,b; if not, is there any plan to support this?

Ashish

09/09/2021, 9:10 PM

Question on how group by aggregations are implemented in Pinot. From what I understand, groupKeys are generated from group by columns’s dictionary ids. How does this work for intersegment and inter server aggregations?

Slackbot

09/14/2021, 8:32 AM

This message was deleted.

prateek nigam

09/14/2021, 12:25 PM

Hi Team, We need to some preprocessing of the record before running ingestion job - create pinot segments outside and push to Pinot data store, what is the way in Pinot we can run some spark job (preprocessing) before running actual ingestion job for pinot segments.

09/15/2021, 10:26 AM

Hello friends, I want to test the ThirdEye solution for Pinot anomaly detection, so I followed the documentation https://docs.pinot.apache.org/integrations/thirdeye, but failed to connect to http://localhost:1426/

prateek nigam

09/16/2021, 8:30 AM

Hello All, https://docs.pinot.apache.org/basics/data-import/batch-ingestion/spark i am follwing this to configure spark job for ingestion, can we give input staging Directory as HDFS Directory. I don't want to use Hadoop utility jar to ingest data.

Sheetal

09/16/2021, 3:35 PM

Hi all, I want to do insert of a single record to Pinot from an application. Setting up Kafka for real time ingestion seems too complicated for very small volume insert calls. Is there some other way?

Dunith Dhanushka

09/17/2021, 3:12 AM

Hello folks, I’m looking for any articles, blog posts, or conference talks on using Pinot for real-time personalization. For example, how to use Pinot to do real-time content recommendations, product recommendations, etc. If you come across anything like that, please post them here :)

Carl

09/17/2021, 7:28 PM

Hi, does Pinot support query like this: “select count(case when boolean_field then id else null end) as cnt”, and “select * from table where nvl(field1, field2) < 100?

Bowen Wan

09/17/2021, 8:48 PM

Hi, I was trying to use spark to do batch ingestion. From this doc https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark, it seems Pinot support Spart 2.X version at least for Pinot 0.4. it seems there is some dependency issue when I was using Spark 2.2.3 like the tutorial but I was able to use Spark 2.4.8 to do the ingestion. Since latest version of Spark is 3.X and Pinot is already 0.8, I'm wondering what's current recommended compatible Spark version ?

Varun B

09/18/2021, 6:49 PM

Can I do batch insert or does it need to be though Kafka, what will be the fastest way to do it

arun muralidharan

09/19/2021, 3:28 PM

Hello Folks, Pinot, almost completely is competing on a space with Clickhouse yet there are no much direct comparisons between them which is very surprising for me. I am on a path to design an analytics pipeline + platform and have been on a investigation track on many big data technologies. Initially selected Clickhouse for its install simplicity, scale and performance when tested with our dataset. I am now wondering what Pinot has to offer and would really love to evaluate it as well. If anybody else has been on this journey or has some thoughts about the architectural anf feature differences between these two systems then please do share. Thanks in advance!

Dan DC

09/20/2021, 9:39 AM

Hello there, the docs say that a shared volume is required for controllers if more than one is to be deployed. Can someone shed some light as why this is needed instead of each controller having its own storage? Will all the controllers be active? Will they all write to the volume simultaneously? Any considerations we should take into account in a multi-controller environment? My deployment is on k8s

Map

09/22/2021, 5:53 PM

Hi there, with 0.8.0, Pinot has got the complex type support as per https://docs.pinot.apache.org/basics/data-import/complex-type. My question is: can the unnested fields like “group.group_topics.urlkey” in the meetupRsvp example be supplied as arguments to the transform functions? The answer seems to be No in my testing.

Ken Krugler

09/23/2021, 9:34 PM

We have about 1500 segments in our HDFS deep store directory. We push these to our Pinot cluster via a metadata push, so only the URI is sent to the controller, which works well. But when we add a single new segment, our push job still has to download/untar all 1500 segments, because we can’t specify a pattern to filter the output directory files to only the new file. We could add per-month subdirectories in HDFS to restrict the number of files being processed this way, but is there a better approach? Note that the files in HDFS can’t be moved around, as their deep store URIs are part of the Zookeeper state.

Sheetal

09/23/2021, 10:21 PM

I have a everyday load to Pinot tables and I am planning to create a segment everyday. Problem is on some days I expect to have very low records so the resulting segment will be too small and will affect performance in longer run. One solution is to do periodic segment merges. Any suggestions?

Weixiang Sun

09/25/2021, 12:10 AM

Hi Pinot team, I am trying to create a realtime Pinot table ingesting the data from Kafka topic.

Copy code

1. The Kafka stream data has two time columns: processed_at and created_at. 
2. The processed_at column is in-order inside Kafka stream.
3. The created_at is out-of-order inside Kafka stream

The retention of realtime pinot table is depending on created_at. If we want to use created_at as timeColumnName, since created_at can be very old, a lot of stale segments can be created. If we want to use processed_at as timeColumnName, a lot of old orders can live in the realtime table. Do you guys have any suggestion about which one to choose as timeColumnName?

👍 1

Rajesh Narayan

09/27/2021, 2:32 PM

Hey Guys - I am looking to explore Pinot for some of the use cases. Also looking for enterprise level support.. who can help?

❤️ 1

ನಾಗೇಶ್ ಸುಬ್ರಹ್ಮಣ್ಯ

09/27/2021, 3:43 PM

I am new to Pinot. I am trying to understand if Pinot query in a say, Java client (https://docs.pinot.apache.org/users/clients/java) can be made to work similar to KStream example (https://kafka.apache.org/28/documentation/streams/tutorial). That is, the KStream example does not "loop" to look for new messages to apply transformations. Whereas, it is not clear to me if the Pinot example will keep running until stopped. My scenario is as follows. For every new message that arrives, I want a batch of records between

[current_timestamp - 1 minute, current_timestamp]

where,

current_timestamp

corresponds to most recently arrived message. Can a Pinot query client be written to run as soon as a new message arrives? Thanks.

Ken Krugler

09/27/2021, 9:12 PM

I thought this was a good write-up of how TimescaleDB works with approximate percentiles (focusing on time-series data): https://blog.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/

👍 1

Gaurav Jindal

09/28/2021, 3:04 AM

I have a question on Pinot. In our company we are capturing user behavioral data as events on our application. Since it’s a clickstream data we are looking for a platform that can enable adhoc query on this data with low latency. Is Pinot a recommended option? Has anyone tried it? Product Analytics tool such as Amplitude, Mixpanel etc also enable real time analytics on clickstream data. So wondering if Pinot has a similar or better technology to do adhoc analysis on event data

Dan DC

09/28/2021, 9:13 AM

Hello, it looks like minion doesn't seem to expose any pinot metric at the moment - release 0.8.0. I'm interested in monitoring task failures, is there an easy way to do this other than monitoring logs?

Dan DC

09/28/2021, 10:28 AM

I've also got a question about partitioning in hybrid tables. If I understand correctly this only applies to offline tables. Does

segmentPartitionConfig

play together with the time column? The field only accepts 1 value at the moment and I was wondering whether segments are generated using

timeColumnName

and further partitioned using the

segmentPartitionConfig

? If I don't specify partitioning then the segments are effectively replicated to all the servers?

Saoirse Amarteifio

09/28/2021, 11:22 AM

Hello - is there a way to use the controller's REST interface to submit OFFLINE table ingestion tasks? Details are; • Parquet files on S3 • Have created a schema and table spec already on Pinot (which is deployed on K8s (Argo/Helm) Have seen there is an ingestion task that can be triggered e.g. using the scripts/utils in the pinot distribution but would like to directly do this using REST commands. I would like to apply the strategy of

SegmentCreationAndUriPush

- and i would either setup something that is running on a daily or hourly schedule OR just trigger once off tasks myself. Either works.

Saoirse Amarteifio

09/28/2021, 5:58 PM

Hello... i have a question related to Kafka and SSL specifically. I submitted schema and REALTIME table specs but i can see that my SSL configuration is not correct. I would like to understand for a standard deployment of Pinot using Helm on K8s where i would expect the SSL cert location to be to i can configure SLL correctly for my table - adding a segment of my spec to the thread