https://pinot.apache.org/ logo
Join Slack
Powered by
# getting-started
  • m

    Manish G

    03/22/2025, 3:46 PM
    I have defined a schema as:
    Copy code
    {
      "schemaName": "my-test-schema",
      "enableColumnBasedNullHandling": true,
      "dimensionFieldSpecs": [
        {
          "name": "field",
          "dataType": "FLOAT",
          "fieldType": "DIMENSION"
        }
      ]
    }
    I want to insert null value in column: field 1.43 null It throws error:
    Copy code
    Caused by: java.lang.NumberFormatException: For input string: "null"
            at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
            at java.base/jdk.internal.math.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
            at java.base/java.lang.Float.parseFloat(Float.java:570)
            at org.apache.pinot.common.utils.PinotDataType$11.toFloat(PinotDataType.java:617)
            at org.apache.pinot.common.utils.PinotDataType$7.convert(PinotDataType.java:425)
            at org.apache.pinot.common.utils.PinotDataType$7.convert(PinotDataType.java:375)
            at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:118)
    What is correct way of having null values?
    a
    • 2
    • 1
  • b

    baarath

    04/23/2025, 6:02 AM
    Hi All, Could someone share a job spec or relevant documentation for setting up Pinot offline batch ingestion from CSV files? I tried to follow the job spec given in official doc But got stuck with below error.
    Copy code
    java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
            at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:152)
            at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:121)
            at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
            at org.apache.pinot.tools.Command.call(Command.java:33)
            at org.apache.pinot.tools.Command.call(Command.java:29)
            at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
            at picocli.CommandLine.access$1500(CommandLine.java:148)
            at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
            at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
            at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
            at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
            at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
            at picocli.CommandLine.execute(CommandLine.java:2174)
            at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:173)
            at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:204)
    Caused by: java.lang.IllegalArgumentException
            at java.base/sun.nio.fs.UnixFileSystem.getPathMatcher(UnixFileSystem.java:286)
            at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.listMatchedFilesWithRecursiveOption(SegmentGenerationUtils.java:263)
            at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:177)
            at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:150)
            ... 14 more
    x
    • 2
    • 6
  • b

    baarath

    04/24/2025, 10:41 AM
    Hi Everyone Im trying to do batch ingestion using Spark as standalone is not recommended for production Im following below link to implement the same https://dev.startree.ai/docs/pinot/recipes/ingest-parquet-files-from-s3-using-spark But Im kind of confused if spark need to be installed in same instance as pinot or can it be ran using EMR where i usually run spark jobs. Can anyone help me understand ?
    x
    • 2
    • 5
  • b

    baarath

    04/28/2025, 8:54 AM
    Hi All, I have CSV data with header having column name in CamelCase. I want it be snake_case in pinot offline table How to map column when doing batch ingestion
    • 1
    • 1
  • r

    Ram Makireddi

    04/28/2025, 3:17 PM
    Hi, this is Ram. I am asked to build a Data Analytics platform in my PG and after extensive research i decided to go with Pinot. I am curious to know what the orchestrater tools it can seamlessly integrate with to orchestrate data pipelines? Any recommendations.
    x
    p
    • 3
    • 4
  • v

    Vivekanand

    05/12/2025, 7:17 AM
    Hi, how does Pinot compare to Cassandra? Just curious thanks
    x
    p
    • 3
    • 9
  • v

    Vishruth Raj

    06/09/2025, 1:40 AM
    Hi, im interested in deploying a Ceph cluster as my deep storage layer for Apache Pinot. Are there any docs on how to do this?
    x
    • 2
    • 2
  • r

    Ross Morrow

    06/09/2025, 1:14 PM
    Hey all, are there any general sizing guidelines/practices published? I'm playing around with tables I can take in stages from orders like 1M, 10M, 100M (last one in progress) and modifying PVC size, heap/pod mem size, segment size, server count etc to try to build some intuition. Though I would ask if there are docs about how to think about this in general.
    m
    • 2
    • 4
  • a

    Ajinkya

    06/17/2025, 8:50 AM
    Hello Team, I am exploring whether Apache Pinot would be a good fit for our use case and would appreciate your insights please :) We are currently on AWS and use S3 as our data lake. After cleaning and transforming the data with Spark, we write the output to MySQL RDS tables, which are then used in various API calls. The challenge we are facing is CPU spike on RDS when Spark writes to it, especially since these jobs run every 3 hours. Slowing down the writes helps somewhat, but as we add more of these jobs, it becomes a bottleneck ... both on RDS and EMR. Scaling RDS isn’t a viable option due to cost concerns. Would it make sense to write the output from Spark directly to Pinot instead of MySQL, and serve the API reads from Pinot? Could this be a cost-effective and scalable strategy? Thanks in advance!
    m
    • 2
    • 8
  • n

    Nithish

    07/04/2025, 9:29 AM
    Hey Team, I am trying pinot offline table ingestion using below spec file, the tar file is getting created but the spark job is failing after 3 attempts as specified in the config below Note: It has worked when tried with •
    SegmentCreationAndTarPush
    but getting 413 error for large tar files •
    SegmentCreationAndMetadataPush
    this works fine but it has known issue as per the thread - https://apache-pinot.slack.com/archives/CDRCA57FC/p1715293105121389
    Copy code
    jobType: SegmentCreationAndUriPush
    
    inputDirURI: '<gs://bucket-name/warehouse/dataengineering.db/ems_attributes/data>'
    includeFileNamePattern: 'glob:**/*.parquet'
    excludeFileNamePattern: 'glob:**/_SUCCESS,glob:**/*.crc,glob:**/*metadata*,glob:**/*.json'
    outputDirURI: '<gs://bucket-name/pinot-segments/poc_ems_attributes>'
    overwriteOutput: true
    
    # Execution Framework
    executionFrameworkSpec:
      name: 'spark'
    
      # replace spark with spark3 for versions > 3.2.0
      segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
      segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
      segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
      segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
      extraConfigs:
        stagingDir: '<gs://bucket-name/pinot-batch-ingestion/staging>'
    
    # Record Reader Configuration for Parquet
    recordReaderSpec:
      dataFormat: 'parquet'
      className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
    
    # Pinot File System
    pinotFSSpecs:
      - scheme: 'gs'
        className: 'org.apache.pinot.plugin.filesystem.GcsPinotFS'
    
    # Table Configuration
    tableSpec:
      tableName: 'poc_ems_attributes'
      schemaURI: '<https://prod-dp-pinot-controller.in/schemas/poc_ems_attributes>'
      tableConfigURI: '<https://prod-dp-pinot-controller.in/tables/poc_ems_attributes>'
    
    # Segment Name Generation
    segmentNameGeneratorSpec:
      type: simple
      configs:
        segment.name.prefix: 'poc_ems_attributes'
        segment.name.postfix: 'uri_push'
        exclude.sequence.id: false
    
    # Pinot Cluster Configuration
    pinotClusterSpecs:
      - controllerURI: '<https://prod-dp-pinot-controller.in>'
    
    # Push Job Configuration
    pushJobSpec:
      pushAttempts: 3
      pushRetryIntervalMillis: 15000
      pushParallelism: 2
    ERROR:
    Copy code
    java.lang.RuntimeException: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
    	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:130)
    	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:118)
    	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352)
    	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352)
    	at scala.collection.Iterator.foreach(Iterator.scala:943)
    	at scala.collection.Iterator.foreach$(Iterator.scala:943)
    	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1028)
    	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1028)
    	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2455)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    	at org.apache.spark.scheduler.Task.run(Task.scala:141)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
    	at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
    	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:231)
    	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:115)
    	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:128)
    	... 20 more
    m
    • 2
    • 1
  • v

    Vitor Mattioli

    07/04/2025, 3:03 PM
    Hi all, Regarding the use of “stream.kafka.consumer.type” , the pinot documentation states that we don’t have support for using high level kafka consumer (HCL), however we believe it works. Do we have any other documentation about ou usage of high level in any application? If it actually works, what limitations would we have?
    k
    • 2
    • 1
  • k

    Kamil

    07/10/2025, 8:23 PM
    hi, does pinot automatically drops nodes from discovery when they get unreachable ? What i have been seeing is that dead nodes are kept in the cluster, they are part of an ideal state of an offline table, i can't drop them because they are part of the ideal state. I tried to rebalance to move segments from dead nodes to alive nodes as segments ultimately are stored in the deep store, to remove dead ones from an ideal state, it didn't work, as cluster didn't want to rebalance, saying that segments are well balanced across the cluster.
    m
    • 2
    • 4
  • k

    Kamil

    07/11/2025, 6:38 PM
    hi, i am curious, does usually segments weight 10x bytes on servers comparing to what we have in deep store ? Is there a default compression applied to dimensions, i read that they should use lz4 by default, but something makes my data explode in size on servers.
    m
    • 2
    • 3
  • k

    Kamil

    07/15/2025, 9:02 PM
    hi, how fragile is MergeRollupTask ? I have run this task twice, both attempts failed with timeout and thats fine with some tuning it would get better, but the table in the former case had too much data, in the latter some data went missing.
    m
    • 2
    • 1
  • k

    Kamil

    07/17/2025, 1:18 PM
    hi, what is the easiest way to copy or backup a table in Pinot ? I want to perform some experiments on my table, but i don't want to load it from source again, is there an easy way to duplicate a table ?
    m
    • 2
    • 7
  • n

    Nitish Goyal

    07/30/2025, 1:07 AM
    Hi. Going to start a POC in Pinot. Have questions regarding the setup. Can anyone help answer the below query - We want to start ingesting from a single kafka topic into 100s of Pinot tables (1 table per tenant). Data for the tables comes into single Kafka topic for the ease of management of single topic. What are our options for the above use case? 1. Re ingest the data from single Kafka topic into 100s of Kafka topics and then do ingestion 1-1 from Kafka topic to Pinot table 2. Use flink to do fanout from Kafka topic into multiple Pinot tables via HTTP requests
    /ingest/batch
    3. Use flink to do fanout from Kafka topic into multiple Pinot tables and push segments directly 4. Use Spark for option 2 and 3 list above I have put more details and pros and cons of each approach in the document attached. Can someone guide what is the right way forward
    Pinot-Ingestion-Approaches.pdf
    m
    m
    • 3
    • 30
  • z

    Zhuangda Z

    08/03/2025, 3:06 AM
    Hi team, I would like to hear your thoughts on our table design: when does a table become too big? Currently, we use one single table for all customers and the daily volume is approaching 1B. With a segment size 1M-1.5M, we are generating ~1K segments a day. And if we were to keep 2y of data, the table would have 730K segments with the current traffic. We have been exploring the idea of sharding it into smaller tables by customer hash. And this would introduce new operation/maintenance load, e.g., one Kafka topic for each shard, expand/shrink the num of shards(consistent hashing). Wonder if there is any prior example with a similar design or what are your recommendations on simplifying the design if possible? 🙏
    m
    • 2
    • 4
  • b

    Boris Tashkulov

    08/11/2025, 2:38 PM
    Hi team, I have local docker compose Pinot setup , zookeeper, controller and server have mounted volumes. So I restarted docker machine (i’m using mac) and now cluster contains each component twice(dead and alive). Queries doesn’t work cause all data was on dead server. Alive server has the same data cause mounted volume but Pinot doesn’t know about that. After that I’ve set static ip’s for each cluster node and did the same, and everything works well. Cluster started, all data is ok, ingessions works well too. It looks like i have miss configuration, could you help me with it please?That the best practice?
    m
    • 2
    • 1
  • s

    San Kumar

    08/12/2025, 3:24 AM
    Hello Team we want to replace /create a segment with combination of dd-mm-yy-hh-<productid>-<country> in offline table .is it possible to do so and help me how can i define segment
  • y

    Yeshwanth

    08/14/2025, 10:11 AM
    Hi Team, Does pinot publish a list of recommended alerts to be configured for proper observability ? I couldn't find anything in the docs.
    m
    • 2
    • 1
  • b

    Boris Tashkulov

    08/15/2025, 8:56 AM
    Hey team, how I can make sure star tree index using for a query? To double check I used star tree recipe, and here is plans for queries in inverted index table (FILTER_INVERTED_INDEX) and stree table(FILTER_FULL_SCAN).
    m
    • 2
    • 6
  • r

    Rishika

    09/11/2025, 3:05 AM
    Hello, I am trying to load data from a csv file into Apache Pinot. I have mounted my local directory to my podman VM. Now I'm stuck in writing the job-spec.yml file. Has anyone ingested csv data? If yes, please help.
    x
    • 2
    • 11
  • r

    Rishika

    09/17/2025, 4:24 AM
    Hello, I was referring to the docker-compose which is available in pinot-docs But everytime I get timeout issues. Anyone able to spin up Pinot in there local machines ? If yes, could you please pass on the docker-compose.yml ?
    x
    • 2
    • 2
  • n

    Neha

    09/18/2025, 5:20 PM
    Hello Team, I am trying to integrate an ssl authenticated kafka topic with Apache Pinot with the attached schema and table configurations. I am able to create the schema successfully, but am facing "Timeout expired while fetching topic metadata" no matter how much I've increased the timeout limit. I've also verified the schema registry accessibility from inside my controller (image of docker compose), it is being accessed fine. Could someone here be able to help me figure out if I am missing something in the process?
    Test-TableConfig.jsonTest-Schema.json
    x
    • 2
    • 9
  • t

    Tarek Salha

    09/23/2025, 5:39 AM
    hi guys, I am new to apache pinot and am analyzing if it is a good replacement for our power bi database, that powers all analytics queries throughout the company. While reading through the docs, it seems like the tool does not allow for storage of the definition of measures centrally (e.g. "Total Amount":=SUM(SalesTable[myAmountColumn]). I consider this a crucial part of centralized analytics to prohibit, that every department defines "Total Amount" differently. Of course, the given example is trivial, but you get the idea. What is the strategy of Pinot to achieve consistency throughout the company in the provision of these business measures? Or is it simply not the right tool to provide this metadata? thanks for your thoughts on this topic 🙂
    👋 1
    m
    • 2
    • 6
  • r

    Rajkumar

    10/08/2025, 1:53 PM
    Hi All,
  • r

    Rajkumar

    10/08/2025, 1:54 PM
    We are trying to deploy Pinot to AKS following the documentation from here https://docs.pinot.apache.org/basics/getting-started/kubernetes-quickstart, and are running into issues with the zookeeper image "Error: ImagePullBackOff Back-off pulling image "docker.io/bitnami/zookeeper:3.9.3-debian-12-r21""
  • r

    Rajkumar

    10/08/2025, 1:56 PM
    It looks like the earlier bitnami images are now moved under legacy, and there is a commercial version available now. May I know what is a feasible way to get this working - I have been trying to use the repo here https://github.com/apache/pinot, and trying to get Pinot setup.
    m
    y
    • 3
    • 3
  • r

    RANJITH KUMAR

    10/17/2025, 12:37 PM
    Hi Team, We're seeing task drops that cause segment creation failures and data loss. Can you help explain: 1. Why do tasks get dropped? 2. Do dropped tasks retry automatically? 3. How can we prevent task drops? 4. What happens to minion jobs when subtasks are dropped?
    x
    • 2
    • 1
  • a

    Alaa Halawani

    10/21/2025, 6:47 PM
    Hi everyone, I’ve recently started using Apache Pinot 1.4 and set up a real-time table with upsert enabled, consuming data from Kafka. I ingested about 1.7 million rows across 12 segments, and during the initial load test, query performance was blazing fast. However, after restarting the server, I noticed: • The server’s memory usage dropped noticeably • A significant spike in query latency, especially in
    schedulerWaitMs
    Additional details: • Ingestion is stopped (so no extra Kafka load) • Increasing
    pinot.query.scheduler.query_runner_threads
    helped slightly, but performance is still slower than before the restart • Tried both MMAP and HEAP loading modes with similar results • I am running Pinot cluster on k8s nodes Has anyone run into similar behavior after a restart? Any recommendations or configuration tips to improve performance would be much appreciated
    x
    • 2
    • 1