Apache Pinot #getting-started

Join Slack

Channels

aggregators

announcements

enable-generic-offsets

events

feat-better-schema-evolution

feat-compound-types

feat-geo-spatial-index

feat-logical-table

feat-partial-upsert

feat-pausless-consumption

feat-pravega-connector

feat-presto-connector

fix_llc_segment_upload

fix-numerical-predicate

flink-pinot-connector

latency-during-segment-commit

odsc-europe-2022-workshop

pinot-contributor-calls

pinot-realtime-table-rebalance

pinot_website_improvement_suggestions

pinot-youtube-channel

pql-sql-regression

presto-pinot-connector

time-based-segment-pruner

v2_engine_beta_feedback

Nickel Fang

05/21/2024, 7:05 AM

Hi, Pinot team. How can I add a new metric field to an existing realtime table? I got en error as below when I add a new metric field.

Copy code

"error": "Invalid schema: **. Reason: Schema is incompatible with tableConfig with name: ** and type: REALTIME"

Thanks!

Amit Singh

05/24/2024, 8:27 AM

Hi Team, What's the right way to increase retention period for a RealTime table. which has been already running.

Amit Singh

05/24/2024, 1:19 PM

Hey Team, I have restarted server,broker,controller by 'helm upgrade' (memory upgraded) and after that broker is giving error: 2024/05/24 130823.908 INFO [ClusterChangeMediator] [ClusterChangeHandlingThread] Finish processing EXTERNAL_VIEW change in 16ms 2024/05/24 130828.618 INFO [ClusterChangeMediator] [ZkClient-EventThread-70-10.108.97.183:2181] Enqueueing EXTERNAL_VIEW change 2024/05/24 130828.619 INFO [ClusterChangeMediator] [ClusterChangeHandlingThread] Start processing EXTERNAL_VIEW change 2024/05/24 130828.619 INFO [BrokerRoutingManager] [ClusterChangeHandlingThread] Processing segment assignment change 2024/05/24 130828.633 INFO [BaseInstanceSelector] [ClusterChangeHandlingThread] Got 0 new segments: {} for table: CaRealTimeEvent_REALTIME by processing existing states, current time: 1716556108633 2024/05/24 130828.633 WARN [BaseInstanceSelector] [ClusterChangeHandlingThread] Failed to find servers hosting old segment: CaRealTimeEvent__4__1__20240522T0858Z for table: CaRealTimeEvent_REALTIME (all candidate instances: [] are disabled, counting segment as unavailable) 2024/05/24 130828.634 WARN [BaseInstanceSelector] [ClusterChangeHandlingThread] Failed to find servers hosting old segment: CaRealTimeEvent__0__6__20240523T1235Z for table: CaRealTimeEvent_REALTIME (all candidate instances: [] are disabled, counting segment as unavailable) 2024/05/24 130828.634 WARN [BaseInstanceSelector] [ClusterChangeHandlingThread] Failed to find servers hosting old segment: CaRealTimeEvent__4__0__20240522T0801Z for table: CaRealTimeEvent_REALTIME (all candidate instances: [] are disabled, counting segment as unavailable) what could be the possible cause? Few of the segments are in BAD status. Even after waiting for one hour segments are bad. and segment-reload doesn't help. Edit : Below is info from zookeeper Under INSTANCES=>server-0 "mapFields": { "CaRealTimeEvent__1__0__20240522T0801Z": { "CURRENT_STATE": "ERROR", "END_TIME": "1716555897488", "INFO": "", "PREVIOUS_STATE": "OFFLINE", "START_TIME": "1716555897422", "TRIGGERED_BY": "pinot-controller-0.pinot-controller-headless.gci-pinot.svc.cluster.local_9000" }, "CaRealTimeEvent__1__1__20240522T0858Z": { "CURRENT_STATE": "ERROR", "END_TIME": "1716555897489", "INFO": "", "PREVIOUS_STATE": "OFFLINE", "START_TIME": "1716555897422", "TRIGGERED_BY": "pinot-controller-0.pinot-controller-headless.gci-pinot.svc.cluster.local_9000" }, But under LIVEINSTANCES -> SERVER-0: { "mapFields": {}, "listFields": {} }

Benito

05/31/2024, 1:54 PM

Hi, I have just got land into Pinot. My use case is store time series of metrics in real time? Is there any piece of doc or even here describing best practices for doing that? E.g. how to obtain the values sorted even though the out of order can happen, how to define the schema and/or the table?

Shubham

06/05/2024, 6:02 AM

Hi all, getting this error while creating schema Kafka is running locally

Shobhita Agarwal

06/10/2024, 9:22 AM

Hi team, I am looking for help to assess if pinot is right for our production use case as below: We process data from various tenants and store and consume in two ways: 1. for user facing analytics : read queries with filters, search, aggregation on tenant data 2. for BI and ML : jobs that run complex queries across tenants. (currently, this is being run on databricks in medallion architecture) We have data size of ~10TB and currently use Mongodb and databricks for these use cases. Can we write data once to pinot and use it both of our above use cases and get rid of mongodb and databricks completely? Can we create bronze,silver,gold layers of the data per databricks terminology on pinot?

Benito

06/10/2024, 11:18 AM

Hi team. I have a java process that needs to retrieve data from Apache Pinot. The queries, up to now, are not very complicated. Which is the recommended way to go? Through REST, JDBC or native java client?

Benito

06/10/2024, 11:18 AM

The result of the queries can be very long in terms of number of records

Vojtech Honzik

06/11/2024, 7:18 AM

Hi team, I am trying to read Pinot offline table data from Spark 3.5.1 and got stuck with pinot-spark-3-connector as the https schema is not supported yet. What is the best practice to read/write Pinot data within Spark Java job if not by using pinot-spark-3-connector? And for pinot-spark-3-connector, how do I pass basic auth credentials to Pinot controller? There is no such option in the documentation (https://github.com/apache/pinot/blob/master/pinot-connectors/pinot-spark-3-connector/documentation/read_model.md) The best error message I got so far is: 24/06/11 091549 INFO RetryExec: I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://node2.clouderalab.test:9443: The target server failed to respond Thank you, these are my first steps in Pinot and it seems a bit complicated yet.

Enda Sexton

06/13/2024, 11:56 AM

Anyone have a guide for exporting to ADLS from Pinot with a minion had one and lost it

raghav

06/18/2024, 1:54 PM

Hey, I am trying to ingest datasketches in pinot, but after ingestion percentileKLL is not giving correct results. Is there any example on what should be the table config/schema and how should we convert kll sketch object from apache datasketches lib? Thanks

Kiril Kalchev

06/18/2024, 6:14 PM

Hey. Does anyone know how to start a Minion Task manager? If I start the cluster manually I run zookeeper, controller, broker, minion but I don't see minion task manager in the dashboard

raghav

06/19/2024, 12:03 PM

Hey, We have a use-case where we need to ingest kll sketches in pinot. Currently we are ingesting raw data in pinot using which we will create sketch. Is it possible/recommended to create sketches internally within pinot as a minion task or we should ingest prebuilt sketches in pinot? Thanks!

Hugo Gonçalves

06/22/2024, 5:30 PM

Hello everyone, Quick question, can Pinot ingest the record timestamp from Kafka or does the timestamp have to be in the record value? I'm using protobuf serialized values and trusting Kafka's creation timestamp. Thanks, HG

Enda Sexton

06/23/2024, 3:12 AM

If I want to scale storage to store more segments do I scale the server pvc or controller?

Mahesh Venugopal

07/03/2024, 5:31 PM

Hello I was exploring different databases for the analytical use case at my work and I had just started exploring Apache Pinot from last week and came across this special index, Star Tree that appeared fascinating to me and thought would be a great fit to drive the insights queries for the use case. The use case is that of showing insights based on the performance of the ads and the tags associated to the ads. We pull the ad performance metrics daily via a cron task from the ad platform and generate tags from the ad assets using gpt apis and store the ad metrics against the generated tags duplicated per ad and vice versa. There is use case to show the tags performance metrics like impressions, spend, ctr, cpm, cpa, roas, cpc etc across the sector spanning multiple workspaces within the sector and not just the current workspace, and it could span around 600 million rows. The performance data that we currently have is not exactly incremental but cumulative metrics that we pull daily. But I am testing using both incremental and cumulative data with my experimental tests to see how the Star Tree index performs for the queries. The queries will have both pre- and post-aggregation filters and post aggregation sort as well. So one use case might be if we consider incremental data, given a time period window, we first sum the metrics for a tag across different days(within the window) for the same ad and then perform average over that across different ads for the same tag to obtain the tag metrics for that time period. For the tests that I am performing I have omitted the requirement of post- filtering and sort just to get familiar with usage of Star Tree index and see how it performs in the ideal and most favourable case for the index. So although the actual query is different, the query that I am trying to optimise using the Star Tree index is as

Copy code

select sum(impressions) as t_impressions, sum(spend) as t_spend, channel, arraySliceString(split(distinct_tag_group, ':'), 0, 1) as ad_id,
 arraySliceString(split(distinct_tag_group, ':'), 1, 2) as tag_type,
  arraySliceString(split(distinct_tag_group, ':'), 2, 3) as tag
from large_table
where created_date >= '2024-03-01' and created_date < '2024-06-01'
   and ad_start_date < '2024-06-01'
 group by distinct_tag_group, channel

Here created_date is the time series date, ad_start_date is when the ad became live and channel is the platform that we pull ad from, distinct_tag_group is just a derived column that is a concatenation of ad_id, tag_type and tag. I had them originally as separate columns/dimensions for the index but then understanding that they are only needed in grouping to uniquely identify a tag and not for filtering and also too many levels in the tree would affect the performance, decided to combine them. The query was taking anywhere between 800ms-1s in almost all cases with different configurations for segment size, maxLeafRecords. Also there are more columns to filter by which are not included for the tests. So based on my tests, these are my understanding or observations about the performance of Star Tree index. It is dependent on, not limited to i. Segment size/No. of segments ii. Value of max leaf records iii. Number of dimension columns/levels in the star tree iv. Cardinality of the dimension columns The most recent star tree index config used was: "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "distinct_tag_group", "ad_start_date", "created_date", "channel" ], "functionColumnPairs": [ "SUM__impressions", "SUM__spend", "SUM__ctr", "SUM__cpc", "SUM__cpa", "SUM__cpm", "SUM__roas", "COUNT__*", "MAX__created_at" ], "maxLeafRecords": 10000 } ], The cardinality for distinct_tag_group was 50000, delivery_start_date and created date around 100, channel was just 6 and scan size for the query was close to 120 million records. Segment size was of 7.5 million records. Had tried with different segment sizes and maxLeafRecords, but the response time for the query hovered between 800ms - little over a second. I understand that one big reason for the slowness would be high cardinality or the no. of results that would be returned by the query based on the grouping condition. Please validate if my understandings and observations make sense and suggest if this is expected or there are ways to optimise this use case with the Star Tree index and what would be optimum values(understanding there is no one-size fits all or hard and fast) for segment sizes and maxLeafRecords and whether they would be relative to the total number of records or atleast to the number of records that would need to be scanned by the queries. Just including one sample response among different config. combinations that were tried

Copy code

With data of 120M records with segments of 7.5M records and star tree index of form 

"starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "distinct_tag_group",
            "ad_start_date",
            "created_date",
            "channel"
          ],
          "functionColumnPairs": [
            "SUM__impressions",
            "SUM__spend",
            "SUM__ctr",
            "SUM__cpc",
            "SUM__cpa",
            "SUM__cpm",
            "SUM__roas",
            "COUNT__*",
            "MAX__created_at"
          ],
          "maxLeafRecords": 10000
        }
      ],
      
      No. of segments: 18/18
      Avg Segment size: 479MB
      Storage size: 14.45GB

Query:
 select sum(impressions) as t_impressions, sum(spend) as t_spend, channel, arraySliceString(split(distinct_tag_group, ':'), 0, 1) as ad_id,
 arraySliceString(split(distinct_tag_group, ':'), 1, 2) as tag_type,
  arraySliceString(split(distinct_tag_group, ':'), 2, 3) as tag
from large_table
where created_date >= '2024-03-01' and created_date < '2024-06-01'
   and delivery_start_date < '2024-06-01'
 group by distinct_tag_group, channel
 
 timeUsedMs	numDocsScanned	totalDocs	numServersQueried	numServersResponded	numSegmentsQueried	numSegmentsProcessed	numSegmentsMatched	numConsumingSegmentsQueried	numEntriesScannedInFilter	numEntriesScannedPostFilter	numGroupsLimitReached	partialResponse	minConsumingFreshnessTimeMs	offlineThreadCpuTimeNs	realtimeThreadCpuTimeNs	offlineSystemActivitiesCpuTimeNs	realtimeSystemActivitiesCpuTimeNs	offlineResponseSerializationCpuTimeNs	realtimeResponseSerializationCpuTimeNs	offlineTotalCpuTimeNs	realtimeTotalCpuTimeNs
833	81960042	120000000	1	1	18	16	16	2	219123287	327840168	
false
-	1720002700442	0	0	0	0	0	0	0	0

🌟 1

Mahesh Venugopal

07/06/2024, 5:50 AM

Hi @Mayank Thanks a lot for the discussion we had the previous day and I am attaching the different relevant queries related to same single query and the response metadata for each, the explain plan result as well. Kindly go through them and advise the next steps in your convenient time.

PinotFreshMetrics

🙏 1

Vineeth Modon

07/15/2024, 7:02 AM

Trying to configure s3 as deep store. Segments are getting created in S3 in a loop. A single segment is getting multiplied in a loop and stored with unique file extensions. Appreciate any pointers to this problem

Vineeth Modon

07/15/2024, 9:39 AM

This is how my S3 bucket looks like while in ZK only this segment is created “id”: “Clicks__0__0__20240715T0917Z”,

Vineeth Modon

07/16/2024, 11:24 AM

https://github.com/apache/pinot/issues/12264 this issue relates to what i am facing

Indrajeet Ray

07/18/2024, 4:06 PM

Need some guidance on what would be the best practice to setup disaster recovery setup with apache pinot. 1. We could have multiple instances of controller, broker, servers, zookeeper at different sites, that are part of same cluster and use some replication factor to ensure that all the data is replicated atleast in one server at each site. or

Indrajeet Ray

07/18/2024, 4:07 PM

or... we setup completely independent cluster and have some mechanism to replicate the data across the independent clusters

Indrajeet Ray

07/18/2024, 4:08 PM

can we get a new set of pinot components instantiated at a new site, and have some way to get the old data at the new site too

Matias Guerson

07/21/2024, 10:28 PM

Hi community, how are you? I have a question. I'm testing Pinot with 3 particular analytical queries. When I check the plans and trace the queries, I can see that they are using the star tree index and that most of the time is used by the GroupByCombineOperator. I guess this makes sense, since I have some segments per day, and I'm trying to get metrics for different time ranges (7d, 30d, 90d). As I increase the number of concurrent requests, when I check the servers logs, I see that the totalExecMs remains practically constant and below 100ms, and that the schedulerWaitMs starts to increase reaching the second and then it remains stable around this value. Checking the cluster resources, in particular de CPU usage and memory of the brokers and servers, I see that the CPU usage of the server reaches the limit, but only after being executing for one minute around 100 QPS (I have two servers with 14 cores assigned to each one). I saw this article, https://startree.ai/blog/capacity-planning-in-apache-pinot-part-1 , and even though I understand that this is a guideline, my queries execution time is below 100ms and I'm assigning much more than the 4 cores suggested in order to be able to process around 100 QPS. Is there any configuration or advice in order to improve the throughput?

🦾 1

Baseer Baheer

07/24/2024, 7:43 AM

This command returns, image not found, may I know what image can I use for Apple Chip M2?

Copy code

docker run -p 9000:9000 \
apachepinot/pinot:1.1.0-arm64 \
QuickStart -type hybrid

ulagaraja j

08/07/2024, 5:10 AM

Hey all , We have configured pinot and it's working fine now. Now we want to scale the Pinot servers. we have deployed in Gke using helm , how to scale the servers ? Do we have to scale with HPA or VPA? is there any proper documentation regarding this. And we have tried with HPA for server , it got scaled up ,but when we cleared the kafka topics and started from scratch , still those servers are running. The pods are not scaling down

Islombek Toshev

08/07/2024, 10:36 AM

👋 Hello, everyone! Can I ask questions in this channel?

💯 1

Peter Corless

08/07/2024, 5:54 PM

Any time you mention "Aggregations" it's a good use case for Apache Pinot. You can also do query-time JOINs with Pinot.

Peter Corless

08/07/2024, 5:55 PM

Check out this blog (and video): https://startree.ai/blog/query-time-joins-in-apache-pinot-1-0

Sharon

08/08/2024, 1:12 AM

Hi all. I have looked into many sites searching for how the data is stored in Pinot…. A few sites say that the table data is stored in-memory and others talk about memory mapped disk storage…. Could someone please clarify which mechanism is internally used for storage of the table records data