David Cyze
04/05/2022, 4:01 PMtimestamp
column in the schema needed to be renamed (I chose timestamparoo
), because presto queries interpreted timestamp
as a casting function as opposed to a column
• The timeFieldSpec
field in the table schema needed to change to dateTimeFieldSpec
After making this changes, I could ingest and query (mostly) fine
The docs have since recommended changing to 0.10.0, which I have tried doing.
However, now when I run ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ~/pinot-tutorial/transcript/batch-job-spec.yml
, I get an exception related to the timestamp column:
Exception while collecting stats for column:timestamparoo in row:{
"fieldToValueMap" : {
"studentID" : 200,
"firstName" : "Lucy",
"lastName" : "Smith",
"score" : 3.8,
"gender" : "Female",
"subject" : "Maths",
"timestamparoo" : null
},
"nullValueFields" : [ ]
}
or.collect(LongColumnPreIndexStatsCollector.java:50) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
at org.apache.pinot.segment.local.segment.creator.impl.stats.SegmentPreIndexStatsCollectorImpl.collectRow(SegmentPreIndexStatsCollectorImpl.java:96) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]
It seems Pinot isn’t parsing the values for this column from the CSV.
Why would that be?
(More supporting files in thread)David Cyze
04/05/2022, 4:03 PMexecutionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/home/vagrant/pinot-tutorial/transcript/rawData/'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/home/vagrant/pinot-tutorial/transcript/segments/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
configs:
fileFormat: 'default'
tableSpec:
tableName: 'transcript'
pinotClusterSpecs:
- controllerURI: '<http://localhost:9000>'
csv:
studentID,firstName,lastName,gender,subject,score,timestamparoo
200,Lucy,Smith,Female,Maths,3.8,1570863600000
200,Lucy,Smith,Female,English,3.5,1571036400000
201,Bob,King,Male,Maths,3.2,1571900400000
202,Nick,Young,Male,Physics,3.6,1572418800000
Table schema:
{
"schemaName": "transcript",
"dimensionFieldSpecs": [
{
"name": "studentID",
"dataType": "INT"
},
{
"name": "firstName",
"dataType": "STRING"
},
{
"name": "lastName",
"dataType": "STRING"
},
{
"name": "gender",
"dataType": "STRING"
},
{
"name": "subject",
"dataType": "STRING"
}
],
"metricFieldSpecs": [
{
"name": "score",
"dataType": "FLOAT"
}
],
"dateTimeFieldSpecs": [
{
"name": "timestamparoo",
"dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}
]
}
table config:
{
"tableName": "transcript",
"segmentsConfig": {
"timeColumnName": "timestamparoo",
"timeType": "MILLISECONDS",
"replication": "2",
"schemaName": "transcript"
},
"tableIndexConfig": {
"invertedIndexColumns": [],
"loadMode": "MMAP"
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant"
},
"tableType": "OFFLINE",
"metadata": {}
}
Mark Needham
David Cyze
04/05/2022, 4:05 PMKen Krugler
04/05/2022, 5:27 PMtimestamparoo" : null
in the record that was rejected.Ken Krugler
04/05/2022, 5:28 PM