Hello everyone, I am trying to ingest some data lo...
# troubleshooting
a
Hello everyone, I am trying to ingest some data locally into PINOT but my date field keeps getting set to null, all other fields are properly ingested Relevant schema section
Copy code
"dateTimeFieldSpecs": [
    {
        "name": "orderingDate",
        "dataType": "STRING",
        "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd",
        "granularity": "1:DAYS"
      }
  ]
Table specs
Copy code
{
  "tableName": "sales_by_order_table",
  "segmentsConfig": {
    "timeColumnName": "orderingDate",
    "timeType": "DAYS",
    "replication": "1",
    "schemaName": "sales_by_order"
  },
  "tableIndexConfig": {
    "invertedIndexColumns": [],
    "loadMode": "MMAP"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableType": "OFFLINE",
  "metadata": {}
}
So I identified the issue, apparently, when running the ingestion job, 3 segments are being created 2 of which have null date and one has the proper date, after deleting the extra two segments and rebalancing segments the correct segment is used but the question is still why this is happening?
k
How are you creating the segments? And what version of Pinot are you using?
a
I'm using the Docker image for version 0.9.3, ingestion was batch, created a job-spec file and put my csv data in a customer folder in bin/data2 and then simply ran the ingestion job, 3 segments are created whenever I run batch ingestion, no idea why
k
Do you have one CSV file, or 3?
a
One
k
What does your job spec file look like?
a
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: 'bin/data2/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/data/segments/order_sales/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'sales_by_order_table'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'
k
In your
inputDirURI
field, provide an absolute path to the directory containing your input CSV file. Then in your
includeFileNamePattern
field, use
glob:*.csv
. Delete/recreate your
sales_by_order_table
, and re-run the job. This will help determine if you have additional csv files somewhere on the input path that are being processed, and thus creating extra segments.
a
Tried this put the ingestion is still creating three segments, the data isn't even divided among segments, all segments have 349 records, csv has full 349 records. The segment with correct dates is
sales_by_order_table_OFFLINE_2020-08-22_2022-06-28_0
k
The
sales_by_order_schema_OFFLINE_0
and
sales_by_order_table_OFFLINE_0
segment names make no sense to me at all.
a
sales_by_order_schema_OFFLINE_0
is from a now deleted table, I accidentally added schema to the name, will the table config help?
Copy code
{
  "tableName": "sales_by_order_table",
  "segmentsConfig": {
    "timeColumnName": "ordering_date",
    "timeType": "DAYS",
    "replication": "1",
    "schemaName": "sales_by_order"
  },
  "tableIndexConfig": {
    "invertedIndexColumns": [],
    "loadMode": "MMAP"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableType": "OFFLINE",
  "metadata": {}
}
k
If you change the schema, you want to delete the table (to get rid of any residual segments) before rebuilding the table.
I had previously said “Delete/recreate your
sales_by_order_table
, and re-run the job…” Did you not delete the table?
a
I did
k
You said “`sales_by_order_schema_OFFLINE_0` is from a now deleted table”. If the table was deleted, the segments would normally also be removed. Can you make sure you have a clean setup (no tables, schemas, or segments) and then recreate the table?
a
Okay, I'll do a cleanup