I m trying to get Pinot running on a linux VM outside of doc Apache Pinot #troubleshooting

I’m trying to get Pinot running on a linux VM, out...

David Cyze

04/05/2022, 4:01 PM

I’m trying to get Pinot running on a linux VM, outside of docker, with the quick start “transcript” data. Then, I want to query the data using the presto connector Last week, the docs recommended version 0.9.3. There were two issues I needed to resolve to get this to work: • The

timestamp

column in the schema needed to be renamed (I chose

timestamparoo

), because presto queries interpreted

timestamp

as a casting function as opposed to a column • The

timeFieldSpec

field in the table schema needed to change to

dateTimeFieldSpec

After making this changes, I could ingest and query (mostly) fine The docs have since recommended changing to 0.10.0, which I have tried doing. However, now when I run

./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ~/pinot-tutorial/transcript/batch-job-spec.yml

, I get an exception related to the timestamp column:

Copy code

Exception while collecting stats for column:timestamparoo in row:{
  "fieldToValueMap" : {
    "studentID" : 200,
    "firstName" : "Lucy",
    "lastName" : "Smith",
    "score" : 3.8,
    "gender" : "Female",
    "subject" : "Maths",
    "timestamparoo" : null
  },
  "nullValueFields" : [ ]
}

or.collect(LongColumnPreIndexStatsCollector.java:50) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
        at org.apache.pinot.segment.local.segment.creator.impl.stats.SegmentPreIndexStatsCollectorImpl.collectRow(SegmentPreIndexStatsCollectorImpl.java:96) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]

It seems Pinot isn’t parsing the values for this column from the CSV. Why would that be? (More supporting files in thread)

David Cyze

04/05/2022, 4:03 PM

job spec:

Copy code

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/home/vagrant/pinot-tutorial/transcript/rawData/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/home/vagrant/pinot-tutorial/transcript/segments/'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
  configs:
          fileFormat: 'default'
tableSpec:
  tableName: 'transcript'
pinotClusterSpecs:
  - controllerURI: '<http://localhost:9000>'

csv:

Copy code

studentID,firstName,lastName,gender,subject,score,timestamparoo
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000

Table schema:

Copy code

{
  "schemaName": "transcript",
  "dimensionFieldSpecs": [
    {
      "name": "studentID",
      "dataType": "INT"
    },
    {
      "name": "firstName",
      "dataType": "STRING"
    },
    {
      "name": "lastName",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    },
    {
      "name": "subject",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "score",
      "dataType": "FLOAT"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "timestamparoo",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

table config:

Copy code

{
  "tableName": "transcript",
  "segmentsConfig": {
    "timeColumnName": "timestamparoo",
    "timeType": "MILLISECONDS",
    "replication": "2",
    "schemaName": "transcript"
  },
  "tableIndexConfig": {
    "invertedIndexColumns": [],
    "loadMode": "MMAP"
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableType": "OFFLINE",
  "metadata": {}
}

Mark Needham

04/05/2022, 4:05 PM

so you get that error with 0.10.0, but not with 0.9.3?

David Cyze

04/05/2022, 4:05 PM

Correct

Ken Krugler

04/05/2022, 5:27 PM

Wasn’t there some change in 0.10 for handling null values? I see

timestamparoo" : null

in the record that was rejected.

Ken Krugler

04/05/2022, 5:28 PM

Or is that just what gets displayed with a timestamp field that can’t be parsed?

Open in Slack

Previous Next