Apache Pinot #getting-started

baarath

04/23/2025, 6:02 AM

Hi All, Could someone share a job spec or relevant documentation for setting up Pinot offline batch ingestion from CSV files? I tried to follow the job spec given in official doc But got stuck with below error.

Copy code

java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:152)
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:121)
        at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
        at org.apache.pinot.tools.Command.call(Command.java:33)
        at org.apache.pinot.tools.Command.call(Command.java:29)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
        at picocli.CommandLine.execute(CommandLine.java:2174)
        at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:173)
        at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:204)
Caused by: java.lang.IllegalArgumentException
        at java.base/sun.nio.fs.UnixFileSystem.getPathMatcher(UnixFileSystem.java:286)
        at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.listMatchedFilesWithRecursiveOption(SegmentGenerationUtils.java:263)
        at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:177)
        at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:150)
        ... 14 more

baarath

04/24/2025, 10:41 AM

Hi Everyone Im trying to do batch ingestion using Spark as standalone is not recommended for production Im following below link to implement the same https://dev.startree.ai/docs/pinot/recipes/ingest-parquet-files-from-s3-using-spark But Im kind of confused if spark need to be installed in same instance as pinot or can it be ran using EMR where i usually run spark jobs. Can anyone help me understand ?

baarath

04/28/2025, 8:54 AM

Hi All, I have CSV data with header having column name in CamelCase. I want it be snake_case in pinot offline table How to map column when doing batch ingestion

Ram Makireddi

04/28/2025, 3:17 PM

Hi, this is Ram. I am asked to build a Data Analytics platform in my PG and after extensive research i decided to go with Pinot. I am curious to know what the orchestrater tools it can seamlessly integrate with to orchestrate data pipelines? Any recommendations.

Vivekanand

05/12/2025, 7:17 AM

Hi, how does Pinot compare to Cassandra? Just curious thanks

Vishruth Raj

06/09/2025, 1:40 AM

Hi, im interested in deploying a Ceph cluster as my deep storage layer for Apache Pinot. Are there any docs on how to do this?

Ross Morrow

06/09/2025, 1:14 PM

Hey all, are there any general sizing guidelines/practices published? I'm playing around with tables I can take in stages from orders like 1M, 10M, 100M (last one in progress) and modifying PVC size, heap/pod mem size, segment size, server count etc to try to build some intuition. Though I would ask if there are docs about how to think about this in general.

Ajinkya

06/17/2025, 8:50 AM

Hello Team, I am exploring whether Apache Pinot would be a good fit for our use case and would appreciate your insights please :) We are currently on AWS and use S3 as our data lake. After cleaning and transforming the data with Spark, we write the output to MySQL RDS tables, which are then used in various API calls. The challenge we are facing is CPU spike on RDS when Spark writes to it, especially since these jobs run every 3 hours. Slowing down the writes helps somewhat, but as we add more of these jobs, it becomes a bottleneck ... both on RDS and EMR. Scaling RDS isn’t a viable option due to cost concerns. Would it make sense to write the output from Spark directly to Pinot instead of MySQL, and serve the API reads from Pinot? Could this be a cost-effective and scalable strategy? Thanks in advance!

Nithish

07/04/2025, 9:29 AM

Hey Team, I am trying pinot offline table ingestion using below spec file, the tar file is getting created but the spark job is failing after 3 attempts as specified in the config below Note: It has worked when tried with •

SegmentCreationAndTarPush

but getting 413 error for large tar files •

SegmentCreationAndMetadataPush

this works fine but it has known issue as per the thread - https://apache-pinot.slack.com/archives/CDRCA57FC/p1715293105121389

Copy code

jobType: SegmentCreationAndUriPush

inputDirURI: '<gs://bucket-name/warehouse/dataengineering.db/ems_attributes/data>'
includeFileNamePattern: 'glob:**/*.parquet'
excludeFileNamePattern: 'glob:**/_SUCCESS,glob:**/*.crc,glob:**/*metadata*,glob:**/*.json'
outputDirURI: '<gs://bucket-name/pinot-segments/poc_ems_attributes>'
overwriteOutput: true

# Execution Framework
executionFrameworkSpec:
  name: 'spark'

  # replace spark with spark3 for versions > 3.2.0
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
  extraConfigs:
    stagingDir: '<gs://bucket-name/pinot-batch-ingestion/staging>'

# Record Reader Configuration for Parquet
recordReaderSpec:
  dataFormat: 'parquet'
  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'

# Pinot File System
pinotFSSpecs:
  - scheme: 'gs'
    className: 'org.apache.pinot.plugin.filesystem.GcsPinotFS'

# Table Configuration
tableSpec:
  tableName: 'poc_ems_attributes'
  schemaURI: '<https://prod-dp-pinot-controller.in/schemas/poc_ems_attributes>'
  tableConfigURI: '<https://prod-dp-pinot-controller.in/tables/poc_ems_attributes>'

# Segment Name Generation
segmentNameGeneratorSpec:
  type: simple
  configs:
    segment.name.prefix: 'poc_ems_attributes'
    segment.name.postfix: 'uri_push'
    exclude.sequence.id: false

# Pinot Cluster Configuration
pinotClusterSpecs:
  - controllerURI: '<https://prod-dp-pinot-controller.in>'

# Push Job Configuration
pushJobSpec:
  pushAttempts: 3
  pushRetryIntervalMillis: 15000
  pushParallelism: 2

ERROR:

Copy code

java.lang.RuntimeException: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:130)
	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:118)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1028)
	at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1028)
	at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2455)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
	at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:231)
	at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:115)
	at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:128)
	... 20 more

Vitor Mattioli

07/04/2025, 3:03 PM

Hi all, Regarding the use of “stream.kafka.consumer.type” , the pinot documentation states that we don’t have support for using high level kafka consumer (HCL), however we believe it works. Do we have any other documentation about ou usage of high level in any application? If it actually works, what limitations would we have?

Kamil

07/10/2025, 8:23 PM

hi, does pinot automatically drops nodes from discovery when they get unreachable ? What i have been seeing is that dead nodes are kept in the cluster, they are part of an ideal state of an offline table, i can't drop them because they are part of the ideal state. I tried to rebalance to move segments from dead nodes to alive nodes as segments ultimately are stored in the deep store, to remove dead ones from an ideal state, it didn't work, as cluster didn't want to rebalance, saying that segments are well balanced across the cluster.

Kamil

07/11/2025, 6:38 PM

hi, i am curious, does usually segments weight 10x bytes on servers comparing to what we have in deep store ? Is there a default compression applied to dimensions, i read that they should use lz4 by default, but something makes my data explode in size on servers.

Kamil

07/15/2025, 9:02 PM

hi, how fragile is MergeRollupTask ? I have run this task twice, both attempts failed with timeout and thats fine with some tuning it would get better, but the table in the former case had too much data, in the latter some data went missing.

Kamil

07/17/2025, 1:18 PM

hi, what is the easiest way to copy or backup a table in Pinot ? I want to perform some experiments on my table, but i don't want to load it from source again, is there an easy way to duplicate a table ?

Nitish Goyal

07/30/2025, 1:07 AM

Hi. Going to start a POC in Pinot. Have questions regarding the setup. Can anyone help answer the below query - We want to start ingesting from a single kafka topic into 100s of Pinot tables (1 table per tenant). Data for the tables comes into single Kafka topic for the ease of management of single topic. What are our options for the above use case? 1. Re ingest the data from single Kafka topic into 100s of Kafka topics and then do ingestion 1-1 from Kafka topic to Pinot table 2. Use flink to do fanout from Kafka topic into multiple Pinot tables via HTTP requests

/ingest/batch

3. Use flink to do fanout from Kafka topic into multiple Pinot tables and push segments directly 4. Use Spark for option 2 and 3 list above I have put more details and pros and cons of each approach in the document attached. Can someone guide what is the right way forward

Pinot-Ingestion-Approaches.pdf

Zhuangda Z

08/03/2025, 3:06 AM

Hi team, I would like to hear your thoughts on our table design: when does a table become too big? Currently, we use one single table for all customers and the daily volume is approaching 1B. With a segment size 1M-1.5M, we are generating ~1K segments a day. And if we were to keep 2y of data, the table would have 730K segments with the current traffic. We have been exploring the idea of sharding it into smaller tables by customer hash. And this would introduce new operation/maintenance load, e.g., one Kafka topic for each shard, expand/shrink the num of shards(consistent hashing). Wonder if there is any prior example with a similar design or what are your recommendations on simplifying the design if possible? 🙏

Boris Tashkulov

08/11/2025, 2:38 PM

Hi team, I have local docker compose Pinot setup , zookeeper, controller and server have mounted volumes. So I restarted docker machine (i’m using mac) and now cluster contains each component twice(dead and alive). Queries doesn’t work cause all data was on dead server. Alive server has the same data cause mounted volume but Pinot doesn’t know about that. After that I’ve set static ip’s for each cluster node and did the same, and everything works well. Cluster started, all data is ok, ingessions works well too. It looks like i have miss configuration, could you help me with it please?That the best practice?

San Kumar

08/12/2025, 3:24 AM

Hello Team we want to replace /create a segment with combination of dd-mm-yy-hh-<productid>-<country> in offline table .is it possible to do so and help me how can i define segment

Yeshwanth

08/14/2025, 10:11 AM

Hi Team, Does pinot publish a list of recommended alerts to be configured for proper observability ? I couldn't find anything in the docs.

Boris Tashkulov

08/15/2025, 8:56 AM

Hey team, how I can make sure star tree index using for a query? To double check I used star tree recipe, and here is plans for queries in inverted index table (FILTER_INVERTED_INDEX) and stree table(FILTER_FULL_SCAN).

Rishika

09/11/2025, 3:05 AM

Hello, I am trying to load data from a csv file into Apache Pinot. I have mounted my local directory to my podman VM. Now I'm stuck in writing the job-spec.yml file. Has anyone ingested csv data? If yes, please help.

Rishika

09/17/2025, 4:24 AM

Hello, I was referring to the docker-compose which is available in pinot-docs But everytime I get timeout issues. Anyone able to spin up Pinot in there local machines ? If yes, could you please pass on the docker-compose.yml ?

Neha

09/18/2025, 5:20 PM

Hello Team, I am trying to integrate an ssl authenticated kafka topic with Apache Pinot with the attached schema and table configurations. I am able to create the schema successfully, but am facing "Timeout expired while fetching topic metadata" no matter how much I've increased the timeout limit. I've also verified the schema registry accessibility from inside my controller (image of docker compose), it is being accessed fine. Could someone here be able to help me figure out if I am missing something in the process?

Test-TableConfig.json Test-Schema.json

Tarek Salha

09/23/2025, 5:39 AM

hi guys, I am new to apache pinot and am analyzing if it is a good replacement for our power bi database, that powers all analytics queries throughout the company. While reading through the docs, it seems like the tool does not allow for storage of the definition of measures centrally (e.g. "Total Amount":=SUM(SalesTable[myAmountColumn]). I consider this a crucial part of centralized analytics to prohibit, that every department defines "Total Amount" differently. Of course, the given example is trivial, but you get the idea. What is the strategy of Pinot to achieve consistency throughout the company in the provision of these business measures? Or is it simply not the right tool to provide this metadata? thanks for your thoughts on this topic 🙂

👋 1

Rajkumar

10/08/2025, 1:53 PM

Hi All,

Rajkumar

10/08/2025, 1:54 PM

We are trying to deploy Pinot to AKS following the documentation from here https://docs.pinot.apache.org/basics/getting-started/kubernetes-quickstart, and are running into issues with the zookeeper image "Error: ImagePullBackOff Back-off pulling image "docker.io/bitnami/zookeeper:3.9.3-debian-12-r21""

Rajkumar

10/08/2025, 1:56 PM

It looks like the earlier bitnami images are now moved under legacy, and there is a commercial version available now. May I know what is a feasible way to get this working - I have been trying to use the repo here https://github.com/apache/pinot, and trying to get Pinot setup.

RANJITH KUMAR

10/17/2025, 12:37 PM

Hi Team, We're seeing task drops that cause segment creation failures and data loss. Can you help explain: 1. Why do tasks get dropped? 2. Do dropped tasks retry automatically? 3. How can we prevent task drops? 4. What happens to minion jobs when subtasks are dropped?

Alaa Halawani

10/21/2025, 6:47 PM

Hi everyone, I’ve recently started using Apache Pinot 1.4 and set up a real-time table with upsert enabled, consuming data from Kafka. I ingested about 1.7 million rows across 12 segments, and during the initial load test, query performance was blazing fast. However, after restarting the server, I noticed: • The server’s memory usage dropped noticeably • A significant spike in query latency, especially in

schedulerWaitMs

Additional details: • Ingestion is stopped (so no extra Kafka load) • Increasing

pinot.query.scheduler.query_runner_threads

helped slightly, but performance is still slower than before the restart • Tried both MMAP and HEAP loading modes with similar results • I am running Pinot cluster on k8s nodes Has anyone run into similar behavior after a restart? Any recommendations or configuration tips to improve performance would be much appreciated

Sudeesh Gandhi

10/31/2025, 10:42 AM

👋 Hello, team!

👋 2