https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • s

    San Kumar

    06/03/2025, 7:05 PM
    Hello Team we are doing batch injection to a offline table based on event_time column.We are receiving updated or new record for particular event_time and doing pusing to offline table by jobspec, is it really required for below configuration to offline TABLE
    Copy code
    ingestionConfig": {
        "batchIngestionConfig": {
          "segmentIngestionType": "APPEND",
          "segmentIngestionFrequency": "DAILY"
        }
      },
    Without this also segment is pusing.What is the use of above property.
    m
    • 2
    • 1
  • m

    Mannoj

    06/04/2025, 12:58 PM
    A quick question,
    cluster config setting will take priority or client setting will take priority in pinot ?
    i.e: am setting pinot.broker.timeoutMs = 60000 in broker and server config file itself, but if some clients want to override this setting, can they set at client level a new setting like pinot.broker.timeoutMs = 1200000 and get the query executed ?
    m
    • 2
    • 1
  • a

    Alexander Maniates

    06/04/2025, 3:53 PM
    Hello, I am looking to understand a bit more how the combination of
    segmentPrunerTypes
    works when pruning segments for a query: đź§µ
    m
    s
    • 3
    • 8
  • g

    guru

    06/04/2025, 5:32 PM
    Super excited to share MCP Server for Apache Pinot and We open sourced it !!! Give it a try and share any feedback you may have, Have fun querying Pinot with MCP 🙂 https://startree.ai/resources/startree-mcp-server-for-apache-pinot
    x
    r
    n
    • 4
    • 14
  • s

    San Kumar

    06/04/2025, 6:24 PM
    HI I am able to get the some timestamp values to Hour using below function. Select toEpochSeconds(DATETRUNC('hour', event_time)) AS hourly_event_time.. i want to get 30 minutes granularity of event_time.How can I achieved. can you provide example
    x
    • 2
    • 3
  • s

    San Kumar

    06/05/2025, 1:27 AM
    Duplicate Records Issue in Pinot Segment PUSH Hello Team, I am encountering a duplicate record issue in the Pinot segment PUSH process. I am loading the same file with incremental data into an offline table, but I am receiving duplicate records. Details: Columns: state identifier event_time (this is the segment column) Issue: I am observing that each segment's file has the same epoch milliseconds: 1749007120076. I would like to know if there is a way to create hourly segments to avoid this issue. Example: First Push (record1.csv): Outstanding, 9937825, 1749007120076 Query from the Pinot table returns 1 record. Second Push (secondfile.csv): Outstanding, 919210623, 1739852763827 Outstanding, 9937825, 1748987320076 // Previous record Now, the query returns 3 records instead of the expected 2. Job Specification: Here is the job specification I am using: executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '$input_dir_path' includeFileNamePattern: 'glob:**/*.csv' outputDirURI: '$out_dir_path' overwriteOutput: true pushJobSpec: pushFileNamePattern: 'glob:**/*.tar.gz' pushParallelism: 10 pushAttempts: 2 pushRetryIntervalMillis: 1000 pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' tableSpec: tableName: '$dest_table' schemaURI: '$pinot_ctl_url/tables/$dest_table/schema' tableConfigURI: '$pinot_ctl_url/tables/$dest_table' pinotClusterSpecs: - controllerURI: '$pinot_ctl_url' Could you please help me understand why this process is producing duplicate records? Any insights or suggestions on how to resolve this issue would be greatly appreciated.
  • s

    San Kumar

    06/05/2025, 12:53 PM
    Hello Team we created offline table and segment column is in miliseconds epoch time.When we push the record we see that 1 segment is created per miliseconds. we have planned for 20 billion record ..As per my understanding 20billion segments file will created..How we can optimize.is it limitation on pinot?
    m
    • 2
    • 5
  • s

    San Kumar

    06/05/2025, 12:54 PM
    how many segments can support by offline table.Can you help me on to create minimum segments
  • s

    Suresh PERUML

    06/07/2025, 11:51 AM
    hi Team, I have below scenario, Using below URL to upload my segments... @http//localhost9000/admin/segments/upload Passing TAR ZIp file for uploading. It is working fine. Note: in this pinot server there is no kafka ingestion or no data ingestion. I am running or doing only upload task. Problem: At times, after sometime, say 1 hour or 2 hour or 24 hours, we do see addition segment with file size 1.2KB( Basically numblock or numrows of object count shows "0"). Because of this auto generated segment, our further upload sequence for next iterations were ending up with 235 error code. Can i get the reason and how do i stop creating or seeing one segment with Empty rows?
    x
    • 2
    • 5
  • s

    Suresh PERUML

    06/09/2025, 6:51 AM
    Hi All, i have configured REALTIME tables with below configuration "stream.kafka.topic.name": "JOB_summary", "realtime.segment.flush.threshold.segment.rows": "60000000", "realtime.segment.flush.threshold.time": "6h", "stream.kafka.consumer.type": "lowlevel", For some tables, i have below configuration "stream.kafka.topic.name": "pm_datas", "realtime.segment.flush.threshold.rows": "2000000", "realtime.segment.flush.threshold.time": "6h", "stream.kafka.consumer.type": "lowlevel", Query: 1. There is no event ingested in KAFKA in one server. But i am able to see after 6h once there is empty "0" records segments were created. 2. But at the same time, it is not generated for pm_datas file and it is generated only on JOB_summary table. Why there is a empty segment created, even there is no data stream of data in KAFKA? What was the design of pinot and what scenarios other than the above configuration, SEGMENTS were created in deepstore? Thanks SP.
    x
    • 2
    • 5
  • a

    Ankit Kumar

    06/09/2025, 9:27 AM
    Hi Team, We have receive CVE-2025-30065 in parquet-avro for out pinot setup. we are using 1.1.0. I want to know whether upgrading pinot to new version 1.3.0 will solve vulnerability. Asking here because I am unable to find it in the release notes.
    m
    • 2
    • 1
  • k

    kranthi kumar

    06/10/2025, 3:44 AM
    Hi team, There are 5 tables in my pinot cluster, I ran batch ingestion to pinot using spark jobs to these tables . The deepstore is an s3 bucket , the issue is after i ran the ingestion jobs, table A segments are getting pushed into table B deepstore folder, their schemas are different aswell. But I have verified in pinot controller UI table A segments are visible in the table B segments, so somehow controller is even registering these and servers are reading these into table B . I have confidence that my execution spec files are correct and verified. So please let me know of any plausible information regarding this.
    m
    • 2
    • 7
  • u

    Utsav Jain

    06/10/2025, 7:21 AM
    Hi Team can someone please explain this scenario where if i am executing 2 queries which logically means the same based on date flilter i am getting the results as different • For 1st it gives output within the ask range • For the 2nd it gives output of records +5:30 date range Is there some timezone setting which we have to setup Inherently in the payload the dateTime is always in IST Query 1
    Copy code
    SELECT recordId,
    identityElement,
    transactionDate
                    FROM table1
                    WHERE trailId = 'xyz'
                      AND partnerId = 'zyr'
    				  AND reconcilableAttribute = 'kk'
                      AND transactionDate >= '2025-06-03 00:00:00'
                      AND transactionDate < '2025-06-03 01:00:00'
    			
    order by transactionDate asc;
    				 
    
    results
    "rows": [
          [
            "record1",
            "tid1",
            "2025-06-03 00:00:00.0"
          ]
    Query2
    Copy code
    SELECT
            recordId,
            updatedAt,
    		identityElement,
            transactionDate,
            ROW_NUMBER() OVER (PARTITION BY recordId ORDER BY updatedAt ASC) AS rn
        FROM
            orbis_baggage
        WHERE
            trailId = 'xyz'
            AND partnerId = 'zyr'
            AND reconcilableAttribute = 'TOTAL_AMOUNT'
            AND transactionDate >= '2025-06-03 00:00:00'
            AND transactionDate < '2025-06-03 01:00:00'
    		order by transactionDate asc
    
    result :
    [
            "record2",
            "tide2",
            "2025-06-03 05:30:00.0",
            1
          ],
  • i

    Indrajeet Ray

    06/11/2025, 5:29 AM
    I have a usecase to plan setting up independent pinot clusters as disaster recovery in active/standby mode. To make sure that both the clusters have all the data when they are active, I am thinking of two approaches: 1. Send the kafka messages to both clusters so that data is ingested in both clusters. 2. Periodically transfer the segments from deepstore of the active cluster to standby cluster and upload the same. What steps are recommended if we want to go by second option: Is it OK to upload the segments from the active cluster (realtime_table) to standby cluster in offline_table. I think that uploading in realtime table leads to some problem and sometimes segment missing issues - probably due to offsets not following the same sequence in independent clusters. Any guidance is appreciated! Thankyou
    m
    • 2
    • 5
  • a

    Anish Nair

    06/13/2025, 9:33 AM
    Hi Team, Does pinot support/uses vectorized query execution or tuple-at-a-time model?
    m
    • 2
    • 3
  • j

    Jeremy Aguilon

    06/13/2025, 9:51 PM
    Hello, I’m very interested in querying Pinot via the GRPC service and returning Apache Arrow tables out: https://docs.pinot.apache.org/users/api/broker-grpc-api The trouble is the relevant settings don’t seem to be settable on
    1.3.0
    :
    > Add gRPC port config in pinot broker to enable the `BrokerGrpcServer`:
    Copy code
    pinot.broker.grpc.port=8010
    Is this something that’s available in
    1.3.0
    ? Or is this a planned feature that isn’t in a stable version of Pinot yet?
    m
    x
    g
    • 4
    • 10
  • z

    Zhuangda Z

    06/15/2025, 2:00 PM
    Hi team, we have a user case where we would want to get the latest event and the said event can just occur or occur months(or even years) ago. What would be the recommended approach to make such query performant if possible?
    m
    • 2
    • 4
  • s

    San Kumar

    06/16/2025, 4:39 AM
    Hello TEAM
  • s

    San Kumar

    06/16/2025, 4:42 AM
    is there any pinot direct query to round my input epochtime miliseconds yo nearest 15minutes epochmiliseconds.We did not find such things in document. can you help me on this with query,
    m
    • 2
    • 2
  • s

    San Kumar

    06/16/2025, 4:47 AM
    Thanks a lot.If I want to round nearest 15 minutes then below will be the query
    Copy code
    select ToEpochMinutesRounded(1613472303000, 15) AS epochMins
    FROM ignoreMe
  • s

    Suresh PERUML

    06/16/2025, 11:06 AM
    Dear Team, This is regarding Pinot server catchup mechanism. I am using single node setup where I have a downtime of pinot server for about 1 hour..... But during this window I have a kafka stream data where the source system keep on publishing the data in KAFKA pipe and I see around 10milion records. Once the server started, to read all the OFFSET datasets from kafka is not all working and I am not able to see the KAFKA topic data is consumed, segments created. What are the best options to improve the catch-up mechanism here. Good options are welcomed. Partition: Current Partition count is = 0, i shall be okay to go with 5 partitions as well. Configurations used so far to improve the catch-up as follows. "stream.kafka.consumer.prop.max.poll.records": "5000", "stream.kafka.consumer.prop.auto.commit.interval.ms": "1000", "stream.kafka.num.consumer.fetch.threads": "5", "realtime.segment.flush.threshold.segment.rows": "10000000" Need some good options to fix the catch-up issue. Please don't say, use cluster or move the OFFSET to LATEST and let the data shall get LOST.... Thanks
    x
    • 2
    • 1
  • k

    kaushik

    06/16/2025, 11:39 AM
    Hi Team, Need your help to join the dots and correct my understanding or any generic suggestion is welcome. We have a realtime table XYZ with below config values (partially provided): "segmentsConfig": { "replication": "1", "replicasPerPartition": "1", "segmentPushType": "APPEND", "segmentPushFrequency": "HOURLY", "retentionTimeUnit": "DAYS", "retentionTimeValue": "2" }, streamConfigs": { "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.threshold.segment.size": "1G" } We have not modified any config to alter the default purge job which runs every 6hrs. Case 1: table created at 020000 June 10 UTC. first record ingested at 021500 June 10 UTC second record ingested at 012000 June 11 UTC third record ingested at 022000 June 11 UTC Q1: Is it correct that the first record is expected to be deleted from the database sometime between 021500 and 081500 on June 13 UTC ? My current understanding is that the segment created at 021500 June 10 UTC will be closed at 021500 June 11 UTC (due to realtime.segment.flush.threshold.time = 24h) and then 2 days of retention will be calculated (due to segmentsConfig.retentionTimeValue = 2 [segmentsConfig.retentionTimeUnit=DAYS]). Now as per schedule of Retention Manager maximum another 6hrs before the record is permanently deleted from the table. Since second record should fall in the same segment as the first record(1G won't cross) and purged at same time as first record. Case 2: We created the table and records as in case 1, but we altered the segment retention to 1 day at 043000 June 12 UTC. Q2: Will Retention Manager drop the segment having first and second record anytime between 043000 - 103000 on June 12 UTC ? Q3: What happens if we restart Pinot at 010000 on June 12 UTC, does Pinot run Retention Manager immediately or wait another 6 hrs ? Thanks for your inputs in advance!
    m
    • 2
    • 6
  • m

    Mannoj

    06/17/2025, 11:26 AM
    A qq, Possible to give multiple data directories for pinot server? # Pinot Server Data Directory pinot.server.instance.dataDir=/data/hdp09/pinotdata,/data/hdp08/pinotdata,/data/hdp07/pinotdata,/data/hdp06/pinotdata
  • m

    Mannoj

    06/17/2025, 11:27 AM
    we have set it , but we see only data is growing only on /data/hdp09 others are not in use, looks like pinot doesn't support multiple directories, nor I dont find an option in documents. if you also confirm , then we will either work in making a lvm method.
    m
    • 2
    • 2
  • t

    T Dev Kumar

    06/18/2025, 8:51 AM
    đź‘‹ Hi everyone!,
  • t

    T Dev Kumar

    06/18/2025, 8:56 AM
    Hi All, How can we persist the data in Pinot table for long term? I have using upsert table. My requirement is Pinot should not delete the record, if only one record is exist for a primary key. Thanks
    m
    r
    • 3
    • 3
  • s

    Suresh PERUML

    06/19/2025, 1:38 AM
    Hi All, Pinot server stopped or crashed for 1 hour. Kafka ingestion from clients are in progress and they are not stopped. There are more data inflows say 10Million records as offset. I am using one Pinot Server. After an hour, server got restarted. Problem: Pinot Server to catch up those 10M kafka offset records for a topic takes more time - 2-4 hours as well. How to avoid this and i am not planning to loose the KAFKA data as well, Retention period for KAFKA is 6hrs. Also 10M records continue to increase, since the KAFKA client ingestion is not all stopped..... What kind of design or solution that pino supports to overcome or waiting for 2-5 hours for the catch-up mechanism.....?
    m
    • 2
    • 6
  • s

    Suresh PERUML

    06/19/2025, 1:53 AM
    Hi all, i am planning to start multiple instances of Pinot Server in the same host machine... It means my host name is same.... Below is the configuration file.... I am using PINO - 1.3.0. # Server Instance 1 ./pinot-admin.sh StartServer \ -clusterName=wip-server1 \ -config=../conf/pinot-server.conf \ -dataDir=/wipro/server/1/index \ -serverAdminPort=9255 \ -serverGrpcPort=9260 \ -serverHost=wipserver-demo \ -zkAddress=localhost:2191 \ -serverMultiStageRunnerPort=43456 \ -serverMultiStageServerPort=43457 & # Server Instance 2 ./pinot-admin.sh StartServer \ -clusterName=wip-server1 \ -config=../conf/pinot-server.2.conf \ -dataDir=/wipro/server/2/index \ -serverAdminPort=9355 \ -serverGrpcPort=9361 \ -serverHost=wipserver-demo \ -zkAddress=localhost:2191 \ -serverMultiStageRunnerPort=53456 \ -serverMultiStageServerPort=53457 &
  • a

    Anish Nair

    06/19/2025, 10:34 AM
    Hey Team, We’re currently observing an inconsistency with the formatting of DOUBLE values in the results of simple aggregation queries using Apache Pinot version 1.0. Details: Use Case: Running a simple aggregation query (e.g., SELECT SUM(metric) FROM table). Data Type: The metric column is of type DOUBLE. Query Type: Single-stage Observation: For the same underlying value (e.g., SUM(metric) = 1.0), the result sometimes appears as 1.000000 and other times as 1.00000000000000. Note: The issue is intermittent and seems to happen randomly, without a consistent pattern. We’re trying to understand whether: This is an expected behavior? It relates to the default float/double precision settings? There’s a recommended way to consistently format or round the DOUBLE results in responses?
    m
    • 2
    • 3
  • p

    Patrick Stevens

    06/27/2025, 3:20 PM
    good morning! Does Pinot have any spatiotemporal indexing or query functions today or planned this year?
    m
    • 2
    • 1