https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • p

    Prathamesh

    08/09/2025, 9:52 AM
    Hello Team We are trying to explore apache pinot to go away from postgres db to leverage its capabilites We are using hive data as raw and iceberg data at the final layer which is used in postgres using dbt-trino. Now the final data we want to ingest in pinot and use it to view on UI. Below are some queries - 1. Is pinot capable of handling iceberg data 2. As for now it is batch upload and the structure needs to be build for table/schema is it feasible to use "batchIngestionConfig": { "segmentIngestionType": "REFRESH", "segmentIngestionFrequency": "DAILY" } Happy to take suggestion as we are still in exploratory phase Thanks
    m
    • 2
    • 1
  • s

    San Kumar

    08/12/2025, 3:25 AM
    Hello Team we want to replace /create a segment with combination of dd-mm-yy-hh-<productid>-<country> in offline table .is it possible to do so and help me how can i define segment
  • z

    Zaeem Arshad

    08/12/2025, 3:47 AM
    Are there any docs/videos exploring the structure of Pinot and what makes it so performant and what are the scaling/performance boundaries?
    m
    • 2
    • 7
  • a

    Arnav

    08/12/2025, 4:23 AM
    Hello team I’m currently aiming to keep all segments generated during batch ingestion close to 256MB in size. To achieve this, I’ve implemented a logic that sets a maximum document count per segment, which I adjust dynamically based on the characteristics of the data, so that the segment size stays approximately within the target. I’m wondering if there’s a more efficient or standardized approach to achieve this?
    m
    • 2
    • 1
  • a

    arnavshi

    08/12/2025, 7:05 AM
    Hi team, I’ve set up an EKS cluster for Pinot stack in our ArrowEverywhereCDK package. The cluster is already running, and I’m now trying to configure Deep Store for a Pinot table using this guide. While deploying the changes, I’m encountering the following error:
    Copy code
    Forbidden: updates to statefulset spec for fields other than \'replicas\', \'ordinals\', \'template\', \'updateStrategy\', \'persistentVolumeClaimRetentionPolicy\' and \'minReadySeconds\' are forbidden\n'
    While I understand that this is a Kubernetes issue/limitation, I wanted your guidance on what can be done to resolve this.
  • s

    San Kumar

    08/12/2025, 11:09 AM
    hello team how can i create a segment with 1hou_product_id..can we create this segment and append to segment when i got product id for same hour
  • a

    am_developer

    08/12/2025, 11:31 AM
    Creating realtime one big table in pinot for all analytics use case. How big is too big for pinot in terms of number of columns in one table? In this case there are 250 columns.
    j
    m
    • 3
    • 2
  • a

    Abdulaziz Alqahtani

    08/14/2025, 11:02 AM
    Hey, I’m trying to measure ingestion lag and came across two metrics: •
    availabilityLagMsMap
    from
    /consumingSegmentsInfo
    → reports ~200–400 ms for me. •
    endToEndRealtimeIngestionDelayMs
    from Prometheus → shows a “saw-tooth” pattern, peaking around 5 seconds. Can someone explain the difference between these two metrics, why they report different values, and whether the saw-tooth pattern is expected?
    j
    • 2
    • 2
  • i

    Idlan Amran

    08/18/2025, 2:38 AM
    hi team. by right at the moment we managed to work on a POC to “roll/dedup” our data on realtime table by querying historical data using python for a fixed time range like last 1 week, grouped it, flushed to json and push segments to our historical offline table using ingestion job spec. managed to reduce the segment size from 130GB++ on realtime table to 13GB++ for segment size on the offline table. by right i guess this is an unconventional ways of doing things since its kinda hard for us to use upsert table because its pretty memory consuming and been taking down our server for few times last time, are there anyone that do this kind of workaround or something similar to support your needs / use case ? our server spec : EC2 m7a.xlarge 4VCPU 16GB RAM running all components: ZK, kafka , 1 controller, 1 broker, 1 server, 1 minion we are targeting not that huge volume of query, most likely 10 - 15 QPS but not that frequent since this data is a historical data and rarely used. only used during debug and some handful of use cases for our application. plus we are resorting to this because there are too many duplicates. and the difference between this duplicates is 2 column, timestamp and log ID column ( we refers this log ID to our main Postgres DB). so i grouped it to this query and flushed the response to json for each of this
    profile
    , each JSON will have around 5M rows so it will have consistent JSON and segment size:
    Copy code
    SELECT shop, svid, spid, type, profile, "key", message, product,
                       CAST(MAX(created_at) AS TIMESTAMP) AS created_at,
                       ARRAY_AGG(product_log, 'STRING', TRUE) AS product_log
                FROM   product_tracking
                WHERE  profile = {profile}
                  AND  created_at >= CAST(DATE_TRUNC('DAY', timestampAdd(DAY,{-lookback_days},NOW()), 'MILLISECONDS','GMT-04:00') AS TIMESTAMP)
                  AND  created_at <  CAST(DATE_TRUNC('DAY', timestampAdd(DAY,0,NOW()), 'MILLISECONDS','GMT-04:00') AS TIMESTAMP)
                GROUP BY shop, svid, spid, type, profile, "key", message, product
                LIMIT 999999999
    need help for any insights/feedback from other Pinot OSS users, thanks.
  • r

    Rishabh Sharma

    08/18/2025, 12:37 PM
    Hi Team, We have an analytics use case with a special requirement where we provide dynamic columns to the user which need not to be defined beforehand while deciding the schema and we provide querying capabilities on those fields as well. We have been exploring pinot, it fits well except for these dynamic fields. To solve this we first explored json type columns but the performance was not up to the mark, now we are looking into dynamically changing schema and adding column whenever we see a new dynamic field (which should not happen frequently) while processing the record and then putting that record into pinot. I have a few questions : 1. The number of records in the table when a new field appears can be a 100s of millions, would that be an issue when the schema changes or when segments are reloaded after schema change? 2. We are planning to keep sending records which do not have new fields to pinot even while we see some record with new field and we are processing schema changes for that. In pinot docs we found instructions to pause data consumption while changing schema. We are halting the records with new fields but if there is no new field in some other record we are continuing putting those records into pinot kafka topic. Can it result in corrupt data?
    m
    j
    g
    • 4
    • 14
  • s

    San Kumar

    08/19/2025, 5:28 AM
    Hello Team In our offlinetable we have many many small segments which is per hour .I,e segment created perhour.Some time we get 20 to 50 records., is there any Minion task configuration to merge smaller segments to larger segment where segment is older than 30 days,Also how minion job will triger what configuration I need to follow.
    p
    • 2
    • 1
  • s

    San Kumar

    08/19/2025, 5:54 AM
    is merge rollup support for OFFLINE tables, on APPEND only?is it support on REFRESH. can we schedule MergeRollupTask with a cron expression.Can you please help me on this
  • k

    kranthi kumar

    08/19/2025, 1:29 PM
    Hi team, I want to understand how the consumption flow works during a server restart after a crash or dead state . So, for my usecase each individual record is very critical and i want to have no duplicate in my pinot . As per my understanding, when a server crashes, the segment which is actively consuming is paused . And when the server gets restarted the paused segment should start reconsuming from last committed offset from ZK , this way duplicates might intrude. Is it the correct flow and if yes, are there any ways to avoid duplicates without losing any record ?.
    m
    • 2
    • 4
  • m

    Milind Chaudhary

    08/20/2025, 5:49 AM
    Hi Team, Can I override the field value to blank in ingestion transformation?
    m
    • 2
    • 7
  • i

    Indira Vashisth

    08/21/2025, 12:52 PM
    Hi team, we are planning to eliminate the intermediate step where the server sends the segment to the controller, and the controller pushes it to deepstore. Instead, the proposal is for the server to write directly to deepstore. Could someone help us understand the pros and cons of both approaches so that we can make a more informed decision?
    m
    • 2
    • 4
  • s

    Shubham Kumar

    08/21/2025, 1:00 PM
    Hi Team, I have a couple of queries regarding Apache Pinot: 1. Does Pinot support segment compression formats other than
    tar.gz
    , such as zstd or Snappy? 2. I created an index on a column (
    col1
    ) and ingested data. Suppose a segment contains 50 records, and I run a query with the condition
    col1 = 'xyz'
    . In this case, does Pinot load the entire segment into memory and then filter the records, or does it directly fetch only the matching data from the segment?
    m
    • 2
    • 38
  • s

    Sandeep R

    08/25/2025, 11:36 PM
    Pinot server: single big LV vs multiple mount points for segment storage?
    m
    s
    +2
    • 5
    • 16
  • j

    Jan Siekierski

    08/27/2025, 11:33 AM
    I understand that Iceberg support on Apache Pinot is only available in StarTree cloud right now, correct? Are there plans to add this to Apache Pinot in the near future?
    🌟 1
  • j

    John Solomon J

    08/28/2025, 7:17 PM
    Hi all, I have opened apache/pinot#16707 to add cursor pagination in pinot-java-client. I don’t have label permission; could you please review & apply required labels?
  • v

    Vatsal Agrawal

    08/29/2025, 5:43 AM
    Hi Team, We are facing an issue with MergeRollupTask in our Pinot cluster. After the task runs, the original segments are not getting deleted, and we end up with both the original and the merged segments in the table. Retention properties: left as default. Any guidance on what we might be missing would be super helpful. Adding task, table and segments related details in the thread.
    r
    • 2
    • 11
  • a

    Arnav

    08/29/2025, 5:52 PM
    Hi team, is there way to parse below kafka event and ingest to RT pinot ?
    Copy code
    {
      "start_time_new": {
        "long": 1756489188000
      },
      "event_time_new": {
        "long": 1756489188000
      }
    }
    i tried below configuration but it's not parsing
    Copy code
    "ingestionConfig": {
        "transformConfigs": [
          {
            "columnName": "start_time_new",
            "transformFunction": "jsonPathLong(__raw__start_time_new, '$.long', 0)"
          },
          {
            "columnName": "event_time_new",
            "transformFunction": "jsonPathLong(__raw__event_time_new, '$.long', 0)"
          }
        ],
        "continueOnError": false,
        "rowTimeValueCheck": false,
        "segmentTimeValueCheck": true
      }
    r
    • 2
    • 1
  • r

    Rajkumar

    08/30/2025, 6:23 PM
    Hi All, Very impressed with what Apache Pinot can do, and I am considering Pinot for a critical use case, and we are not Java experts in our team - Would Java be a key skill to adopt Pinot successfully? An additional question, will join between two realtime tables work? Information online seem to suggest, that join between two realtime tables are not recommended for Production, just checking if anyone here has experiences around this - thanks.
    m
    • 2
    • 13
  • a

    Arnav

    09/01/2025, 7:07 AM
    Hi team, I enabled "stream.kafka.metadata.populate": "true", to get below fields. These fields i have added in schema also. __key __metadata$offset __metadata$partition __metadata$recordTimestamp But on querying table __metadata$offset __metadata$partition __metadata$recordTimestamp these are populated properly but __key is coming as blank. Since my kafka event and key are avro encoded. I used following config:
    Copy code
    "stream.kafka.decoder.prop.format": "AVRO",
    "stream.kafka.decoder.prop.schema.registry.schema.name": "schema-name",
    "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
    "stream.kafka.decoder.prop.schema.registry.rest.url": "schema-url",
    "stream.kafka.decoder.prop.key.format": "AVRO",
    "stream.kafka.decoder.prop.key.schema.registry.schema.name": "schema-name-key",
    "stream.kafka.decoder.prop.key.schema.registry.rest.url": "schema-url",
    data is also properly deserialised. Only __key is blank. My guess is that below configs i added is not able to deserialise it. Is there any other way to deserialise the key?
    Copy code
    "stream.kafka.decoder.prop.key.format": "AVRO",
    "stream.kafka.decoder.prop.key.schema.registry.schema.name": "schema-name-key",
    "stream.kafka.decoder.prop.key.schema.registry.rest.url": "schema-url",
  • a

    Abdulaziz Alqahtani

    09/01/2025, 7:17 PM
    Hi team, we have a multi-tenant hybrid table where each row has a
    tenant_id
    (ULID). The column is low cardinality, and most queries include a
    tenant_id
    predicate. What’s the best way to index this column?
    m
    • 2
    • 12
  • c

    cesho

    09/04/2025, 2:16 PM
    Can someone explain how Apache Pinot integrates with Confluent Schema Registry during Kafka stream ingestion? Specifically: 1. Does Pinot use Schema Registry only for deserialization of Avro/Protobuf messages, or can it automatically generate Pinot table schemas from the registered schemas? 2. If auto-generation is supported, what are the limitations or required configurations? 3. How does Pinot handle schema evolution in Schema Registry (e.g., backward/forward compatibility) during ingestion? 4. Are there any best practices for defining Pinot schemas when using Schema Registry to avoid data type mismatches? Context: I’m setting up real-time ingestion from Kafka topics with Avro schemas stored in Schema Registry and want to minimize manual schema mapping work.
    m
    m
    • 3
    • 3
  • a

    Abdulaziz Alqahtani

    09/07/2025, 8:34 PM
    Hi team, What’s the recommended approach for one-off batch ingestion of data from S3 into Pinot, Minion-based ingestion vs standalone ingestion? For context: • I currently have a real-time table. • I want to import historical data into a separate offline table. • My source data is in PostgreSQL, and I can export and chunk it into S3 first.
    m
    • 2
    • 1
  • m

    mg

    09/08/2025, 8:09 PM
    Hi Team, I'm running into an issue with the Pinot Controller UI and the Swagger REST API when using an NGINX Ingress with a subpath. I'm hoping someone has encountered this and can help. Here goes the problem summary: I've configured my ingress to expose the Pinot Controller at
    <https://example.com/pinot/>
    . The main UI works fine and most links are correctly routed. Those that works open on
    <https://example.com/pinot/#/>...
    However, the Swagger REST API UI link is not. Swagger API button, it tries to access
    <https://example.com/help>
    instead of
    <https://example.com/pinot/help>
    , resulting in a 404 Not Found error. I don't see an obvious way to enforce the swagger link subpath to something other than (/) ? I am using helm, and I have been looking for different options in https://github.com/apache/pinot/blob/master/helm/pinot/README.md but nothing worked.. thanks in advance..
    m
    • 2
    • 2
  • s

    Soon

    09/11/2025, 5:19 PM
    Hello team! I had a quick question if query plan shows
    FILTER_SORTED_INDEX
    would it be the same as using
    FILTER_INVERTED_INDEX
    like sorted inverted index?
    m
    r
    • 3
    • 5
  • i

    Indira Vashisth

    09/15/2025, 9:57 AM
    Hi team, i triggered server rebalance in my pinot cluster with 3 servers, but the segment reassignment shows the target server for all the segments as only one server. How can i make it assign the data to all 3 servers.
    m
    y
    • 3
    • 3
  • i

    Indira Vashisth

    09/15/2025, 10:02 AM
    Also what is the recommended size of data we should be storing per server? We will need to store more than 150TB of data and hit this data with complex queries including distinct, json match and sorting.