Hello, I have some trouble ingesting data from `cs...
# troubleshooting
w
Hello, I have some trouble ingesting data from
csv
file. I have the fallowing configs
tableConfig
Copy code
{
  "tableName": "transactions",
  "tableType": "OFFLINE",
  "segmentsConfig": {
    "replication": 1,
    "timeColumnName": "Timestamp",
    "schemaName": "transactions"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP"
  },
  "ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "DAILY"
    },
    "transformConfigs": [
      {
        "columnName": "Timestamp",
        "transformFunction": "fromDateTime(dateTimeStr, 'yyyy-MM-dd''T''HH:mm:ss''Z')"
      }
    ]
  },
  "metadata": {}
}
schema
Copy code
{
  "schemaName": "transactions",
  "metricFieldSpecs": [
    {
      "name": "TakerAmount",
      "dataType": "DOUBLE"
    },
    {
      "name": "TakerVolumeUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "MakerAmount",
      "dataType": "DOUBLE"
    },
    {
      "name": "MakerVolumeUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "GasLimit",
      "dataType": "LONG"
    },
    {
      "name": "GasUsed",
      "dataType": "LONG"
    },
    {
      "name": "GasPrice",
      "dataType": "LONG"
    },
    {
      "name": "GasFees",
      "dataType": "DOUBLE"
    },
    {
      "name": "TipGasFees",
      "dataType": "DOUBLE"
    },
    {
      "name": "BurntGasFees",
      "dataType": "DOUBLE"
    },
    {
      "name": "ReimbursedGasFees",
      "dataType": "DOUBLE"
    },
    {
      "name": "GasFeesUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "TipGasFeesUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "BurntGasFeesUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "ReimbursedGasFeesUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "RakerTokenPriceUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "MakerTokenPriceUSD",
      "dataType": "DOUBLE"
    },
    {
      "name": "VolumeUSD",
      "dataType": "DOUBLE"
    }
  ],
  "dimensionFieldSpecs": [
    {
      "name": "TransactionHash",
      "dataType": "STRING"
    },
    {
      "name": "LockNumber",
      "dataType": "LONG"
    },
    {
      "name": "ChainName",
      "dataType": "STRING"
    },
    {
      "name": "TransactionFrom",
      "dataType": "STRING"
    },
    {
      "name": "TransactionTo",
      "dataType": "STRING"
    },
    {
      "name": "Affiliate",
      "dataType": "STRING"
    },
    {
      "name": "FeeRecipient",
      "dataType": "STRING"
    },
    {
      "name": "Taker",
      "dataType": "STRING"
    },
    {
      "name": "Maker",
      "dataType": "STRING"
    },
    {
      "name": "LiquiditySource",
      "dataType": "STRING"
    },
    {
      "name": "App",
      "dataType": "STRING"
    },
    {
      "name": "Router",
      "dataType": "STRING"
    },
    {
      "name": "TakerToken",
      "dataType": "STRING"
    },
    {
      "name": "TakerTokenSymbol",
      "dataType": "STRING"
    },
    {
      "name": "MakerToken",
      "dataType": "STRING"
    },
    {
      "name": "MakerTokenSymbol",
      "dataType": "STRING"
    },
    {
      "name": "IsGasless",
      "dataType": "BOOLEAN"
    },
    {
      "name": "IsMutihop",
      "dataType": "BOOLEAN"
    },
    {
      "name": "IsMultiplex",
      "dataType": "BOOLEAN"
    },
    {
      "name": "HasRFQ",
      "dataType": "BOOLEAN"
    },
    {
      "name": "HasLimitOrder",
      "dataType": "BOOLEAN"
    },
    {
      "name": "HasDirect",
      "dataType": "BOOLEAN"
    },
    {
      "name": "NativeOrderType",
      "dataType": "STRING"
    },
    {
      "name": "TransformerFeeRecipient",
      "dataType": "STRING"
    },
    {
      "name": "TransformerFeeToken",
      "dataType": "STRING"
    },
    {
      "name": "TransformerFeeTokenSymbol",
      "dataType": "STRING"
    },
    {
      "name": "TransformerFeeTokenAmount",
      "dataType": "STRING"
    },
    {
      "name": "TransformerFeeVolumeUSD",
      "dataType": "STRING"
    },
    {
      "name": "CalledFunction",
      "dataType": "STRING"
    },
    {
      "name": "MaxFeePerGas",
      "dataType": "STRING"
    },
    {
      "name": "MaxPriorityFeePerGas",
      "dataType": "STRING"
    },
    {
      "name": "BaseFeePerGas",
      "dataType": "STRING"
    },
    {
      "name": "Type",
      "dataType": "INT"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "Timestamp",
      "dataType": "TIMESTAMP",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:DAYS"
    }
  ]
}
Ingestion job throws
Caused by: java.lang.NumberFormatException: For input string: "2022-09-08T22:22:09Z"
If I do not have transformFunction for Timestamp column, there’s an exception about Invalid Segment Name for
"2022-09-08T22:22:09Z"
Not sure what is the right approach here, thanks for any hints
n
can you share the whole stack trace, and a sample couple of records?
w
stack trace with transformationFn
Copy code
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:152)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:121)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:130)
	at org.apache.pinot.tools.Command.call(Command.java:33)
	at org.apache.pinot.tools.Command.call(Command.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:167)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:198)
Caused by: java.lang.RuntimeException: Failed to generate Pinot segment for file - <s3://0x-wojciech/transactions-small.csv>
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:286)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: Caught exception while transforming data type for column: Timestamp
	at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:146)
	at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:83)
	at org.apache.pinot.segment.local.segment.creator.TransformPipeline.processPlainRow(TransformPipeline.java:97)
	at org.apache.pinot.segment.local.segment.creator.TransformPipeline.processRow(TransformPipeline.java:92)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:67)
	at org.apache.pinot.segment.local.segment.creator.RecordReaderSegmentCreationDataSource.gatherStats(RecordReaderSegmentCreationDataSource.java:37)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:181)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:153)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:102)
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:118)
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:264)
	... 5 more
Caused by: java.lang.NumberFormatException: For input string: "2022-09-08T22:22:09Z"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Long.parseLong(Long.java:692)
	at java.base/java.lang.Long.parseLong(Long.java:817)
	at org.apache.pinot.spi.data.DateTimeFormatSpec.fromFormatToMillis(DateTimeFormatSpec.java:300)
	at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:94)
	... 15 more
stack trace without transformationFn
Copy code
java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:152)
	at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:121)
	at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:130)
	at org.apache.pinot.tools.Command.call(Command.java:33)
	at org.apache.pinot.tools.Command.call(Command.java:29)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
	at picocli.CommandLine.access$1300(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
	at picocli.CommandLine.execute(CommandLine.java:2078)
	at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:167)
	at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:198)
Caused by: java.lang.RuntimeException: Failed to generate Pinot segment for file - <s3://0x-wojciech/transactions-small.csv>
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:286)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Invalid partial or full segment name: 2022-08-20T00:00:07Z
	at org.apache.pinot.segment.spi.creator.name.SegmentNameUtils.validatePartialOrFullSegmentName(SegmentNameUtils.java:40)
	at org.apache.pinot.segment.spi.creator.name.SimpleSegmentNameGenerator.generateSegmentName(SimpleSegmentNameGenerator.java:63)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.handlePostCreation(SegmentIndexCreationDriverImpl.java:279)
	at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:269)
	at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:119)
	at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.lambda$submitSegmentGenTask$1(SegmentGenerationJobRunner.java:264)
	... 5 more
n
there’s an issue with the function definition. I changed it to this, and it works,
Copy code
"transformConfigs": [
      {
        "columnName": "TimestampMillis",
        "transformFunction": "fromDateTime(\"Timestamp\", 'yyyy-MM-dd''T''HH:mm:ss''Z')"
      }
also had to change Timestamp to TimestampMillis in schema and tableConfig:timeColumnName. For the function, the input args should be in the record, and then you collect it into another field
as for why it didnt work in the first place without function, tagging @Kartik Khare and @Tim Santos ^ . There’s some segmentNameGenerator setting needed.
@Tim Santos could we automatically detect this and use that setting, instead of failing on this?
t
I believe the batchConfig is missing
"segmentNameGenerator.type" : "normalizedDate"
Since the dateTimeStr was in simple date format.
n
thanks Tim, could you also paste that docs link where this is metioned?
Thank you @Neha Pawar and @Tim Santos
s
@Neha Pawar what to do in scenarios where we are getting timestamp with varying number of decimal values.
k
Can you share an example sid
s
I currently get the error : Caused by: java.lang.IllegalArgumentException: Invalid format: "2023-03-24T121920.496884592Z" is malformed at "884592Z" it has 9 decimal values followed by Z in the timestamp. I also get the error when there is 8 decimal values: Caused by: java.lang.IllegalArgumentException: Invalid format: "2023-03-24T055225.49329297Z" is malformed at "29297Z"