Apache Pinot #random

TimeoutException: Timeout expired while fetching topic metadata [2024-03-29 121753,773] ERROR [Worker clientId=connect-2, groupId=A-mm2] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:324) org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata [2024-03-29 121753,775] INFO Kafka MirrorMaker stopping (org.apache.kafka.connect.mirror.MirrorMaker:184) [2024-03-29 121753,776] INFO [Worker clientId=connect-1, groupId=B-mm2] Herder stopping (org.apache.kafka.connect.runtime.distributed.DistributedHerder:718) [2024-03-29 121753,776] INFO [Worker clientId=connect-1, groupId=B-mm2] Stopping connectors and tasks that are still assigned to this worker. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:683) [2024-03-29 121753,778] INFO Stopping connector MirrorHeartbeatConnector (org.apache.kafka.connect.runtime.Worker:387) [2024-03-29 121753,778] INFO Scheduled shutdown for WorkerConnector{id=MirrorHeartbeatConnector} (org.apache.kafka.connect.runtime.WorkerConnector:248) [2024-03-29 121753,778] INFO Stopping connector MirrorSourceConnector (org.apache.kafka.connect.runtime.Worker:387) [2024-03-29 121753,779] INFO Scheduled shutdown for WorkerConnector{id=MirrorSourceConnector} (org.apache.kafka.connect.runtime.WorkerConnector:248) [2024-03-29 121753,779] INFO Stopping connector MirrorCheckpointConnector (org.apache.kafka.connect.runtime.Worker:387) [2024-03-29 121753,782] INFO Scheduled shutdown for WorkerConnector{id=MirrorCheckpointConnector} (org.apache.kafka.connect.runtime.WorkerConnector:248) [2024-03-29 121753,784] INFO Completed shutdown for WorkerConnector{id=MirrorCheckpointConnector} (org.apache.kafka.connect.runtime.WorkerConnector:268) [2024-03-29 121753,785] INFO Completed shutdown for WorkerConnector{id=MirrorSourceConnector} (org.apache.kafka.connect.runtime.WorkerConnector:268) [2024-03-29 121753,792] INFO Completed shutdown for WorkerConnector{id=MirrorHeartbeatConnector} (org.apache.kafka.connect.runtime.WorkerConnector:268) [2024-03-29 121753,792] INFO Stopping task MirrorHeartbeatConnector-0 (org.apache.kafka.connect.runtime.Worker:836) The mirror maker connect configuration is usual for replication from cluster A to cluster B unidirectionally. Earlier the replication was working but now am facing this error. The issue isn't related to SASL or JDK, have checked that. No SSL protocol is implemented, have used PLAINTEXT in conf. Please help resolve. @Xiang Fu Can you please help?

Harish Bohara

04/03/2024, 8:09 AM

Why do we see table status as "BAD" in pinot controller view? With diff between "Ideal State=ONLINE" and "External View=OFFLINE"? Any Idea why this should happen? • All controller, broker, servers show are in Alive state

Vishal Bisht

04/11/2024, 10:08 AM

I am using ZkBasicAuthAccessControlFactory for authentication. Is there a way we can change the default admin password for this?

prasanna

04/26/2024, 9:07 AM

Hi All, Does pinot support logging log4j2 messages in json format. we are using the pinot docker image and for some reason cannot maintain a separate image with json format related changes. Tried to replace the log4j2.xml properties with json layout ones for server component but the log still does not reflect in json format. if this works we plan to replace the existing log4j2 with required configs via configmap or relevant method. Has anyone tried something like this.

Zach Lorimer

04/30/2024, 9:01 PM

Just curious, can Pinot use workload identity with GCS instead of a JSON key?

ayush

05/02/2024, 5:37 AM

Hi All, Can we have a TTL for segments of Offline tables also?

Ashley Allen

05/02/2024, 2:40 PM

Has anyone tried using an object relational mapper or persistence layer with Pinot? This would be for a read only use case. Ingestion and upserts would use the utilities / framework provided with Pinot.

Rashpal Singh

05/02/2024, 6:11 PM

Hi All, I have a quick ques, can I somehow enable the helix logs in Pinot 1.1?

piby

05/12/2024, 6:31 PM

Hey all! Can we have different retention times for different segments based on filename pattern? The background is that we do not like mergerollup task as we then no longer have 1:1 segment file mapping. With merge rollup, we lose the freedom of replacing a file and regenerating segment with same name. We want to merge our daily files into one monthly file on our own and push it to the same table. We then want to have different retention periods for daily and monthly files.

Yusuf Nar

07/15/2024, 10:23 AM

Hi, what options do we have to connect to Pinot from Ruby clients?

Dor Levi

07/30/2024, 9:31 PM

appreciation post: Earlier today we launched our product www.sim.io (post), Pinot has been pretty central to us and will be to our users, thanks @Kishore G @Mayank @Jackie and the rest of the team for their support.

🍷 11

Philippe Noël

08/08/2024, 11:54 PM

For those interested in search, we wrote a post on our learnings on full text search in Postgres vs Elastic: https://blog.paradedb.com/pages/elasticsearch_vs_postgres

Sumitra Saksham

09/03/2024, 5:30 PM

Hi Everyone, I have a doubt. How to decide the: 1. The number of rows 2. The segment size 3. flush threshold time What should be the parameters for these judgements?

Sumitra Saksham

09/04/2024, 2:24 PM

I have one doubt. Let say my consuming segment configuration is set to to certain values. 1. Size: 300 MB 2. Rows: 500k 3. Time: 24 Hours Then I want to change the rows and bring it down to 100k. So, now I made that change. So will my consuming segment which has consumed more than that number of rows will perform the commit? If yes, it will be after I perform rebalance of servers?

Hassan Ait Brik

10/09/2024, 12:54 PM

Hi community, I'm currently using Pinot 1.0.0 with a Kafka real-time table setup, and I have a use case where I would like to trigger an action immediately after a message is consumed by the Pinot real-time table. Is there a way for Pinot to acknowledge the consumption of a Kafka message, allowing me to execute a specific action upon successful ingestion of that message ? Any suggestions or best practices on how to achieve this would be greatly appreciated !

Ayoade Abel Adegbite

11/09/2024, 11:27 PM

https://behindthedata.substack.com/p/issue-004-meet-chris-love?r=4jib1l&utm_campaign=post&utm_medium=web&triedRedirect=true

Cristina Munteanu

11/13/2024, 9:40 PM

📣 Hello all! The Open Source Analytics Conference 2024 is next week! 🚀 Some really cool talks on databases, orchestration, Bi/visualization tools! • When: Nov 19-21 • Where: Online! • More info: osacon.io Hope to see you there!

Indira Vashisth

11/15/2024, 2:20 AM

Hi, does anyone know of ways to dynamically construct SQL queries for Pinot? Our data is spread across multiple tables, and we're looking for an efficient way to query it. Thanks!

Yusuf Nar

11/28/2024, 7:13 PM

Hey all! We use relational database as the source of truth and we are looking for options to replicate the data to Pinot as real-time as possible. We may need some transformations in-between. We don't have a "large" dataset, mostly under 1TB. What options do we have? As a side note we want to avoid CDC.

Akhil Dubey

01/05/2025, 7:38 AM

Hi everyone, I am new here, just want to explore features of pinot. I have use case where customer realtime transactions are coming in Kafka. and i do like to show last 1 minute/hours/seconds transaction total amount and number on a dashboard how can i achieve this with apache pinot.

Saurabh Lambe

01/23/2025, 3:24 AM

@Max

Navdeep Gaur

03/11/2025, 11:18 AM

Cheers Pinot🍷 Afficionados! I am trying to convince people in my org to consider Pinot for an analytics platform that can serve results directly to user app (web/ios/android). Description For an online marketplace, the use cases we are trying to achieve are 1. view counts of a post 2. count of connection requests that originated after viewing a particular post 3. list of 10 recently viewed posts. we intend to show counts of views each time a post is served. The source for this information is a stream of events that says which user viewed which post. Load Details We get 25000 such events every minute. that is also roughly the traffic we want to support that will request this data(queries) Questions • Is Pinot even a good candidate to solve this problem for just three use cases like above? (From what I have read so far, use-case 3 is not possible with sub-second performance without a key-value store to serve this data. But other 2 are achievable. ) • If someone has achieved this kind of use-case can you please correct any incorrect assumptions I am making? • What kind of costs can we estimate for a pinot cluster that will have this data? I am told that over a year we expect ingested stream of data to total <1TB. • I saw StarTree pricing and it seems that their premium tier supports 3tb of storage. So their managed service is sufficient for us? is this the correct lens to look at this problem? This is my first time trying to design a system like this and trying to get a buy-in for a system like this. All guidance and critique is highly appreciated, thanks for taking the time to read 🙂

Peter Corless

03/11/2025, 4:34 PM

So, 25,000 requests per minute ≈ 420 queries per second. Though it also depends on whether there is a 1:1 correlation for an "event" to a "query." It might be that for each "event" you need to actually fire off multiple queries. In any regard, that is precisely the kind of aggregations and filtering that Apache Pinot does super well at. LinkedIn does exactly that sort of thing with Pinot for post views and likes. And they have a billion+ users worldwide. I'm just spitballing here, but the 10 recently-viewed posts — might you be able to be solve this by using a Timestamp index, and then an ORDER BY and LIMIT? (SQL experts, please chime in if I am off-base.)

Roshan

03/18/2025, 3:35 PM

hi guys i am facing data ingestion issue after adding authentication to the pinot. does it have anything to do with the table config? i am getting error: Upload response: 500 Error: {"code":500,"error":"Caught exception when ingesting file into table: test_tenant_OFFLINE. Caught exception while uploading segments. Push mode: TAR, segment tars: [[file:/opt/pinot/tmp/ingestion_dir/working_dir_test_tenant_OFFLINE_1742312090194_FRKYyJJrdp/segment_tar_dir/test_tenant_1742312090217.tar.gz]]"} table_config = { "tableName": f"{tenant_name}", "tableType": "OFFLINE", "segmentsConfig": { "minimizeDataMovement": False, "deletedSegmentsRetentionPeriod": "7d", "retentionTimeUnit": "DAYS", "retentionTimeValue": "30", "segmentPushType": "REFRESH", "replication": "1" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [ "Time_of_Resolution", "Category", "Description", "Escalate", "Ticket_Status", "Customer_Gender", "Ticket_Subject", "product_purchased", "Source", "Relevance", "Customer_Email", "Customer_Name", "Tags", "Agent", "Number_of_conversations", "Intent", "Ticket_Priority", "Sentiment", "Type", "Created_date", "First_Response_Time", "Customer_Satisfaction_Rating", "Date_of_Purchase", "Resolution", "Created_date_month", "Created_date_year" ], "autoGeneratedInvertedIndex": False, "varLengthDictionaryColumns": [ "Time_of_Resolution", "Category", "Description", "Escalate", "Ticket_Status", "Customer_Gender", "Ticket_Subject", "product_purchased", "Source", "Relevance", "Customer_Email", "Customer_Name", "Tags", "Agent", "Intent", "Ticket_Priority", "Sentiment", "Type", "Created_date", "First_Response_Time", "Date_of_Purchase", "Resolution" ], "enableDefaultStarTree": False, "nullHandlingEnabled": True, "createInvertedIndexDuringSegmentGeneration": True, "rangeIndexVersion": 2, "aggregateMetrics": False, "optimizeDictionary": False, "loadMode": "MMAP", "enableDynamicStarTreeCreation": False, "columnMajorSegmentBuilderEnabled": False, "optimizeDictionaryForMetrics": False, "noDictionarySizeRatioThreshold": 0 }, "metadata": {}, "ingestionConfig": { "batchIngestionConfig": { "segmentIngestionType": "APPEND", "consistentDataPush": False }, "segmentTimeValueCheck": False, "transformConfigs": [ { "columnName": "First_Response_Time_epoch_millis", "transformFunction": "CASEWHEN(strcmp(Number_of_conversations,'null')=0,null,DATETIMECONVERT(First_Response_Time, 'SIMPLE_DATE_FORMAT|dd-MM-yyyy HH:mm', '1MILLISECONDSEPOCH', 'MINUTES|1'))" }, { "columnName": "Time_of_Resolution_epoch_millis", "transformFunction": "CASEWHEN(strcmp(Ticket_Status,'Closed')=0,DATETIMECONVERT(Time_of_Resolution, 'SIMPLE_DATE_FORMAT|dd-MM-yyyy HH:mm', '1MILLISECONDSEPOCH', 'MINUTES|1'),null)" }, { "columnName": "Created_date_timestamp", "transformFunction": "DATETIMECONVERT(Created_date, 'SIMPLE_DATE_FORMAT|dd-MM-yyyy HH:mm', 'SIMPLE_DATE_FORMAT|yyyy-MM-dd HH:mm', 'MINUTES|1')" }, { "columnName": "Created_date_epoch_millis", "transformFunction": "DATETIMECONVERT(Created_date, 'SIMPLE_DATE_FORMAT|dd-MM-yyyy HH:mm', '1MILLISECONDSEPOCH', 'MINUTES|1')" }, { "columnName": "Created_date_month", "transformFunction": "Month(Created_date_epoch_millis, 'UTC')" }, { "columnName": "Created_date_year", "transformFunction": "YEAR(Created_date_epoch_millis)" }, { "columnName": "Sentiment_score", "transformFunction": "CASEWHEN(strcmp(Sentiment,'Positive')=0,2,CASEWHEN(strcmp(Sentiment,'Neutral')=0,1,0))" }, { "columnName": "Sentiment_Positive", "transformFunction": "CASEWHEN(strcmp(Sentiment,'Positive')=0,2,0)" }, { "columnName": "Sentiment_Neutral", "transformFunction": "CASEWHEN(strcmp(Sentiment,'Neutral')=0,1,0)" }, { "columnName": "Sentiment_Negative", "transformFunction": "CASEWHEN(strcmp(Sentiment,'Negative')=0,0,0)" }, { "columnName": "Customer_Satisfaction_Rating", "transformFunction": "\"Customer Satisfaction Rating\"" }, { "columnName": "Number_of_conversations", "transformFunction": "\"Number of conversations\"" }, { "columnName": "Ticket_Subject", "transformFunction": "\"Ticket Subject\"" }, { "columnName": "Customer_Name", "transformFunction": "\"Customer Name\"" }, { "columnName": "Time_of_Resolution", "transformFunction": "\"Time to Resolution\"" }, { "columnName": "Customer_Age", "transformFunction": "\"Customer Age\"" }, { "columnName": "First_Response_Time", "transformFunction": "\"First Response Time\"" }, { "columnName": "Customer_Email", "transformFunction": "\"Customer Email\"" }, { "columnName": "Ticket_Status", "transformFunction": "\"Ticket Status\"" }, { "columnName": "Created_date", "transformFunction": "\"Created date\"" }, { "columnName": "Date_of_Purchase", "transformFunction": "\"Date of Purchase\"" }, { "columnName": "Agent_overall_performance", "transformFunction": "\"Agent overall performance\"" }, { "columnName": "Agent_politeness", "transformFunction": "\"Agent politeness\"" }, { "columnName": "Customer_Gender", "transformFunction": "\"Customer Gender\"" }, { "columnName": "Agent_communication", "transformFunction": "\"Agent communication\"" }, { "columnName": "Agent_Patience", "transformFunction": "\"Agent Patience\"" }, { "columnName": "Ticket_ID", "transformFunction": "\"Ticket ID\"" }, { "columnName": "Ticket_Priority", "transformFunction": "\"Ticket Priority\"" } ], "continueOnError": True, "rowTimeValueCheck": True }, "isDimTable": False }

Roshan

03/18/2025, 4:41 PM

please can anybody help i have been stuck for 4 days now!!!!!!!!!!alert the code i used was properly working fine with a namespace that didnt have authentication setup. when i did the authorization only i am facing this issue. the schema creation and table creation is working fine after authentication done in pinot. but the data ingestion is having issues. these are the config files: when i create the schema and table for a sample table. when i try ingest a sample csv file i cant do that: i am getting the above mentioned error in the chat. Upload response: 500 Error: {"code":500,"error":"Caught exception when ingesting file into table: test_tenant_OFFLINE. Caught exception while uploading segments. Push mode: TAR, segment tars: [[file:/opt/pinot/tmp/ingestion_dir/working_dir_test_tenant_OFFLINE_1742312090194_FRKYyJJrdp/segment_tar_dir/test_tenant_1742312090217.tar.gz]]"} ingest-pythoncode:

Copy code

import requests
import json
from pathlib import Path
import time

PINOT_CONTROLLER = "********"
PINOT_BROKER = "*********"
AUTH_HEADERS = {
    "Authorization": "Basic YWRtaW46dmVyeXNlY3JldA==",
    "Content-Type": "application/json"
}

def verify_segment(tenant_name, segment_name):
    max_retries = 10
    retry_interval = 2  # seconds
    
    for i in range(max_retries):
        print(f"\nChecking segment status (attempt {i+1}/{max_retries})...")
        
        response = requests.get(f"{PINOT_CONTROLLER}/segments/{tenant_name}/{segment_name}/metadata",headers=AUTH_HEADERS)
        if response.status_code == 200:
            print("Segment is ready!")
            return True
            
        print(f"Segment not ready yet, waiting {retry_interval} seconds...")
        time.sleep(retry_interval)
    
    return False

def simple_ingest(tenant_name):
    # Verify schema and table exist
    schema_response = requests.get(f'{PINOT_CONTROLLER}/schemas/{tenant_name}',headers=AUTH_HEADERS)
    table_response = requests.get(f'{PINOT_CONTROLLER}/tables/{tenant_name}',headers=AUTH_HEADERS)
    
    if schema_response.status_code != 200 or table_response.status_code != 200:
        print(f"Schema or table missing for tenant {tenant_name}. Please run create_schema.py and create_table.py first")
        return

    csv_path = Path(f"data/{tenant_name}_data.csv")
    
    print(f"\nUploading data for tenant {tenant_name}...")
    with open(csv_path, 'rb') as f:
        files = {'file': (f'{tenant_name}_data.csv', f, 'text/csv')}
        
        # Using a dictionary for column mapping first
        column_map = {
            "Ticket ID": "Ticket_ID",
            "Customer Name": "Customer_Name",
            "Customer Email": "Customer_Email",
            "Company_name": "Company_name",
            "Customer Age": "Customer_Age",
            "Customer Gender": "Customer_Gender",
            "Product purchased": "product_purchased",
            "Date of Purchase": "Date_of_Purchase",
            "Ticket Subject": "Ticket_Subject",
            "Description": "Description",
            "Ticket Status": "Ticket_Status",
            "Resolution": "Resolution",
            "Ticket Priority": "Ticket_Priority",
            "Source": "Source",
            "Created date": "Created_date",
            "First Response Time": "First_Response_Time",
            "Time to Resolution": "Time_of_Resolution",
            "Number of conversations": "Number_of_conversations",
            "Customer Satisfaction Rating": "Customer_Satisfaction_Rating",
            "Category": "Category",
            "Intent": "Intent",
            "Type": "Type",
            "Relevance": "Relevance",
            "Escalate": "Escalate",
            "Sentiment": "Sentiment",
            "Tags": "Tags",
            "Agent": "Agent",
            "Agent politeness": "Agent_politeness",
            "Agent communication": "Agent_communication",
            "Agent Patience": "Agent_Patience",
            "Agent overall performance": "Agent_overall_performance",
            "Conversations": "Conversations"
        }

        config = {
            "inputFormat": "csv",
            "header": "true",
            "delimiter": ",",
            "fileFormat": "csv",
            "multiValueDelimiter": ";",
            "skipHeader": "false",
            # Convert dictionary to proper JSON string
            "columnHeaderMap": json.dumps(column_map)
        }
        
        params = {
            'tableNameWithType': f'{tenant_name}_OFFLINE',
            'batchConfigMapStr': json.dumps(config)
        }

        upload_headers = {
            "Authorization": AUTH_HEADERS["Authorization"]
        }
        
        response = <http://requests.post|requests.post>(
            f'{PINOT_CONTROLLER}/ingestFromFile',
            files=files,
            params=params,
            headers=upload_headers
        )
    
    print(f"Upload response: {response.status_code}")
    if response.status_code != 200:
        print(f"Error: {response.text}")
        return

    if response.status_code == 200:
        try:
            response_data = json.loads(response.text)
            print(f"Response data: {response_data}")
            segment_name = response_data["status"].split("segment: ")[1]
            print(f"\nWaiting for segment {segment_name} to be ready...")
            
            if verify_segment(tenant_name, segment_name):
                query = {
                    "sql": f"SELECT COUNT(*) FROM {tenant_name}_OFFLINE",
                    "trace": False
                }
                
                query_response = <http://requests.post|requests.post>(
                    f"{PINOT_BROKER}/query/sql",
                    json=query,
                    headers=AUTH_HEADERS
                )
                
                print("\nQuery response:", query_response.status_code)
                if query_response.status_code == 200:
                    print(json.dumps(query_response.json(), indent=2))
            else:
                print("Segment verification timed out")
        except Exception as e:
            print(f"Error processing segment: {e}")
            print(f"Full response text: {response.text}")

if __name__ == "__main__":
    simple_ingest("test_tenant")

Slack Conversation

telugu bharadwaj

04/04/2025, 12:43 PM

Hello team, I am trying to set up S3 as the deep store for Pinot, but I’m facing issues. The configuration provided in the documentation is for version 0.6.0, and in this version, the joins are not working as expected. I want to use the latest version, but the configuration for that version isn’t working as it does in 0.6.0. Can you please assist me with this?

Cristina Munteanu

04/30/2025, 8:00 PM

🎤 Got something exciting to share? The OSACon 2025 CFP is now officially open! 🚀 We're going online Nov 4–5, and we want YOU to be a part of it! Submit your proposal and be a speaker at the leading event for open-source analytics. 👉 Submit here: https://sessionize.com/osacon-2025/

Slackbot

05/12/2025, 7:34 PM

This message was deleted.

Blux Chang

05/13/2025, 6:11 PM

Hello, everyone. I hope you are doing well. I am Blux Chang and looking for US-based developer. Feel free to contact me with DM.

VISHAL PARASHURAM DHABALI

07/03/2025, 1:09 PM

New eBook Alert - "Building the Control Layer for Agentic AI with AI Gateway and MCP Servers" Learn how enterprises are deploying scalable, secure, and context-aware GenAI systems on-prem, in VPCs, or hybrid environments. A practical and technical roadmap for AI officers, Data scientists, Platform heads, MLOps engineers, CDOs, CIOs, and architects building production-grade AI infrastructure with AI Gateways and Model Context Protocol (MCP). Key takeaways are - Why AI Gateways are emerging as the central abstraction and orchestration layer in the GenAI stack How MCP allows agents to discover, authenticate, and invoke enterprise tools through machine-readable, context-aware schemas Deep dive into the architecture, protocols, and design principles shaping the next generation of enterprise AI infrastructure Access the free ungated version of the eBook here - https://gdurl.com/0RO8/download

TFY-Ebook.pdf