Manish G
03/22/2025, 3:46 PM{
"schemaName": "my-test-schema",
"enableColumnBasedNullHandling": true,
"dimensionFieldSpecs": [
{
"name": "field",
"dataType": "FLOAT",
"fieldType": "DIMENSION"
}
]
}
I want to insert null value in column:
field
1.43
null
It throws error:
Caused by: java.lang.NumberFormatException: For input string: "null"
at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
at java.base/jdk.internal.math.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
at java.base/java.lang.Float.parseFloat(Float.java:570)
at org.apache.pinot.common.utils.PinotDataType$11.toFloat(PinotDataType.java:617)
at org.apache.pinot.common.utils.PinotDataType$7.convert(PinotDataType.java:425)
at org.apache.pinot.common.utils.PinotDataType$7.convert(PinotDataType.java:375)
at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:118)
What is correct way of having null values?baarath
04/23/2025, 6:02 AMjava.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:152)
at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:121)
at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:132)
at org.apache.pinot.tools.Command.call(Command.java:33)
at org.apache.pinot.tools.Command.call(Command.java:29)
at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
at picocli.CommandLine.execute(CommandLine.java:2174)
at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:173)
at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:204)
Caused by: java.lang.IllegalArgumentException
at java.base/sun.nio.fs.UnixFileSystem.getPathMatcher(UnixFileSystem.java:286)
at org.apache.pinot.common.segment.generation.SegmentGenerationUtils.listMatchedFilesWithRecursiveOption(SegmentGenerationUtils.java:263)
at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:177)
at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:150)
... 14 morebaarath
04/24/2025, 10:41 AMbaarath
04/28/2025, 8:54 AMRam Makireddi
04/28/2025, 3:17 PMVivekanand
05/12/2025, 7:17 AMVishruth Raj
06/09/2025, 1:40 AMRoss Morrow
06/09/2025, 1:14 PMAjinkya
06/17/2025, 8:50 AMNithish
07/04/2025, 9:29 AMSegmentCreationAndTarPush but getting 413 error for large tar files
• SegmentCreationAndMetadataPush this works fine but it has known issue as per the thread - https://apache-pinot.slack.com/archives/CDRCA57FC/p1715293105121389
jobType: SegmentCreationAndUriPush
inputDirURI: '<gs://bucket-name/warehouse/dataengineering.db/ems_attributes/data>'
includeFileNamePattern: 'glob:**/*.parquet'
excludeFileNamePattern: 'glob:**/_SUCCESS,glob:**/*.crc,glob:**/*metadata*,glob:**/*.json'
outputDirURI: '<gs://bucket-name/pinot-segments/poc_ems_attributes>'
overwriteOutput: true
# Execution Framework
executionFrameworkSpec:
name: 'spark'
# replace spark with spark3 for versions > 3.2.0
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentMetadataPushJobRunner'
extraConfigs:
stagingDir: '<gs://bucket-name/pinot-batch-ingestion/staging>'
# Record Reader Configuration for Parquet
recordReaderSpec:
dataFormat: 'parquet'
className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
# Pinot File System
pinotFSSpecs:
- scheme: 'gs'
className: 'org.apache.pinot.plugin.filesystem.GcsPinotFS'
# Table Configuration
tableSpec:
tableName: 'poc_ems_attributes'
schemaURI: '<https://prod-dp-pinot-controller.in/schemas/poc_ems_attributes>'
tableConfigURI: '<https://prod-dp-pinot-controller.in/tables/poc_ems_attributes>'
# Segment Name Generation
segmentNameGeneratorSpec:
type: simple
configs:
segment.name.prefix: 'poc_ems_attributes'
segment.name.postfix: 'uri_push'
exclude.sequence.id: false
# Pinot Cluster Configuration
pinotClusterSpecs:
- controllerURI: '<https://prod-dp-pinot-controller.in>'
# Push Job Configuration
pushJobSpec:
pushAttempts: 3
pushRetryIntervalMillis: 15000
pushParallelism: 2
ERROR:
java.lang.RuntimeException: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:130)
at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:118)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1028)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1028)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2455)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts
at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:231)
at org.apache.pinot.segment.local.utils.SegmentPushUtils.sendSegmentUris(SegmentPushUtils.java:115)
at org.apache.pinot.plugin.ingestion.batch.spark3.SparkSegmentUriPushJobRunner$1.call(SparkSegmentUriPushJobRunner.java:128)
... 20 moreVitor Mattioli
07/04/2025, 3:03 PMKamil
07/10/2025, 8:23 PMKamil
07/11/2025, 6:38 PMKamil
07/15/2025, 9:02 PMKamil
07/17/2025, 1:18 PMNitish Goyal
07/30/2025, 1:07 AM/ingest/batch
3. Use flink to do fanout from Kafka topic into multiple Pinot tables and push segments directly
4. Use Spark for option 2 and 3 list above
I have put more details and pros and cons of each approach in the document attached. Can someone guide what is the right way forwardZhuangda Z
08/03/2025, 3:06 AMBoris Tashkulov
08/11/2025, 2:38 PMSan Kumar
08/12/2025, 3:24 AMYeshwanth
08/14/2025, 10:11 AMBoris Tashkulov
08/15/2025, 8:56 AMRishika
09/11/2025, 3:05 AMRishika
09/17/2025, 4:24 AMNeha
09/18/2025, 5:20 PMTarek Salha
09/23/2025, 5:39 AMRajkumar
10/08/2025, 1:53 PMRajkumar
10/08/2025, 1:54 PMRajkumar
10/08/2025, 1:56 PMRANJITH KUMAR
10/17/2025, 12:37 PMAlaa Halawani
10/21/2025, 6:47 PMschedulerWaitMs
Additional details:
• Ingestion is stopped (so no extra Kafka load)
• Increasing pinot.query.scheduler.query_runner_threads helped slightly, but performance is still slower than before the restart
• Tried both MMAP and HEAP loading modes with similar results
• I am running Pinot cluster on k8s nodes
Has anyone run into similar behavior after a restart?
Any recommendations or configuration tips to improve performance would be much appreciated