https://pinot.apache.org/ logo
k

Kha

02/05/2021, 9:41 PM
Hi everyone, I'm currently trying to batch import some data into a Pinot Offline table and currently running into some issues. My current Pinot version is 0.7.0, currently in a docker container. I have successfully added an
offline_table_config.json
and a
schema.json
file to Pinot, however creating a segment doesn't appear to be working. A
SEGMENT-NAME.tar.gz
file isn't being created. My current docker-job-spec.yml looks like this:
Copy code
# docker-job-spec.yml

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-manual-test/rawdata/100k'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-manual-test/segments/100k'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'rows_100k'
  schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
  tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-controller-test:9000>'
Some of the error messages I'm getting are
Copy code
Failed to generate Pinot segment for file - file:/tmp/pinot-manual-test/rawdata/100k/rows_100k.csv
Caught exception while gathering stats
java.lang.NumberFormatException: For input string: "5842432235322161941"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_282]
        at java.lang.Integer.parseInt(Integer.java:583) ~[?:1.8.0_282]
Any leads on this would be appreciated. Thanks
n

Neha Pawar

02/05/2021, 10:08 PM
this looks like a mismatch in dataTypes between the Pinot schema and the actual data
1
can you share the Pinot schema and some sample rows?
k

Ken Krugler

02/05/2021, 10:12 PM
Isn’t 5842432235322161941 too big for an int type? I think your schema would need to use a long.
1
x

Xiang Fu

02/05/2021, 10:24 PM
yes, it’s long not int, glad it’s still smaller than LONG.MAX
1
otherwise maybe only double
k

Kha

02/05/2021, 10:27 PM
Yes that was the issue. However this brings up another issue with my DateTime. I have a date value in milliseconds Epoch and it seems to not be able to read it. Example date in my CSV is
Copy code
1096429682806
Schema for date is:
Copy code
# schema.json

"dateTimeFieldSpecs": [{
        "name": "date",
        "dataType": "LONG",
        "format" : "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
    }]
Table config is:
Copy code
"segmentsConfig": {
        "timeColumnName": "date",
        "timeType": "MILLISECONDS",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "row1",
        "replication": "1"
    },
Error in the image attached:
x

Xiang Fu

02/05/2021, 10:31 PM
from the value, it’s secondsSinceEpoch
1612471718
message has been deleted
1096429682806
<- this value is in the year 2004?
k

Kha

02/05/2021, 10:33 PM
yes, that value for 2004 is in my CSV
x

Xiang Fu

02/05/2021, 10:42 PM
I don’t see any problem with this so far, but seems that this column time is set to be seconds instead of milliseconds. @Neha Pawar anything else to check?
n

Neha Pawar

02/05/2021, 10:43 PM
1096429682806
is the value for 2004 rt? the error says Pinot found
1612471718
which is 1970
is that value expected? 1612471718
k

Kha

02/05/2021, 10:44 PM
from what I know, I don't have any values that match with 1612471718
n

Neha Pawar

02/05/2021, 10:45 PM
is it possible to share your input file with us? we can try to reproduce
x

Xiang Fu

02/05/2021, 10:45 PM
maybe 10 rows and your table config/schema
k

Kha

02/05/2021, 10:45 PM
Here is my direct CSV
n

Neha Pawar

02/05/2021, 10:47 PM
and you entire schema/table config too
n

Neha Pawar

02/05/2021, 11:07 PM
not able to reproduce with your table cnfig and schema. I can generate the segment just fine.
only possible issue i see in your configs is this
Copy code
tableName: 'rows_100k'
  schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
  tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
the schema is row1, but this says rows_100k
could it be referring to some old schema
k

Kha

02/05/2021, 11:08 PM
should the rows_100 that's not tableName reference the scehma?
n

Neha Pawar

02/05/2021, 11:09 PM
ah i see
k

Kha

02/05/2021, 11:09 PM
i'm not entirely sure, the documentation for the batch impot example uses the same value for
tableName
and
schemaName
x

Xiang Fu

02/05/2021, 11:09 PM
It's using `
Copy code
schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
can you check what's the response for
Copy code
schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
k

Kha

02/05/2021, 11:13 PM
Receiving the same error message for above
@Neha Pawar I can confirm that changing
rows_100k
to
row1
breaks it further
x

Xiang Fu

02/05/2021, 11:17 PM
@Neha Pawar can you share the schema and table config
and Kha can use it to create the table
n

Neha Pawar

02/05/2021, 11:18 PM
i just used what he shared, only thing diff is the batch-job-spec
x

Xiang Fu

02/05/2021, 11:20 PM
which pinot image are you using? is it
apachepinot/pinot:latest
k

Kha

02/05/2021, 11:21 PM
Yes 'm using the latest version of pinot, 0.7.0
x

Xiang Fu

02/05/2021, 11:24 PM
@Neha Pawar I think we need to add timeType into table config?
Copy code
{
	tableName: "foo",
	tableType: "OFFLINE",
	segmentsConfig: {
		timeColumnName: "date",
		timeType: "MILLISECONDS",
		replication: "1"
	},
	tenants: {},
	tableIndexConfig: {
		loadMode: "HEAP",
		invertedIndexColumns: [
			"id",
			"hash_one"
		]
	},
	metadata: {
		customConfigs: {}
	}
}
also do you have logs for batch ingestion job? @Kha?
k

Kha

02/05/2021, 11:27 PM
where would the logs be found in the docker instance
n

Neha Pawar

02/05/2021, 11:30 PM
so strange, i’m also able to take your exact configs, including the yml, and upload
i’m on the latest master, that could be the only difference
x

Xiang Fu

02/05/2021, 11:33 PM
docker log <docker-container-id>
k

Kha

02/05/2021, 11:36 PM
To clarify Neha, did you take my files, run them, and are able to successfully upload it?
As of now, I'm going to try to remove the time column and replace it with another time format. I will come back to this on Monday. Thank you guys so much for your help @Neha Pawar @Xiang Fu
x

Xiang Fu

02/05/2021, 11:38 PM
sure, please let us know
n

Neha Pawar

02/05/2021, 11:38 PM
yes, took exactly your files. I’m not running on docker so i changed the
s/pinot-controller-test/localhost
, and was able to upload
lets look at logs on Monday
x

Xiang Fu

02/06/2021, 12:09 AM
I’ve tried with docker setup and there is no issue on my side. Here are my steps: 1. Start pinot quickstart with docker
Copy code
docker run \
    --network=pinot-demo \
    --name pinot-quickstart \
    -p 9000:9000 \
    -d apachepinot/pinot:latest QuickStart \
    -type batch
2. Create Table
Copy code
docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-batch-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/foo-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/foo-table-offline.json \
    -controllerHost pinot-quickstart \
    -controllerPort 9000 -exec
3. Start Ingestion job
Copy code
docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-data-ingestion-job \
    apachepinot/pinot:latest LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/docker-job-spec-100k.yml
I put table conf and schema on my local directory and mount to docker:
n

Neha Pawar

02/06/2021, 12:11 AM
i suspect there’s some stray data in the input folder for you Kha
x

Xiang Fu

02/06/2021, 12:11 AM
and this is the updated docker-job-spec file:
Copy code
➜ cat /tmp/pinot-quick-start/docker-job-spec-100k.yml
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-manual-test/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'foo'
  schemaURI: '<http://pinot-quickstart:9000/tables/foo/schema>'
  tableConfigURI: '<http://pinot-quickstart:9000/tables/foo>'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-quickstart:9000>'
Kha: I feel you can delete the table and corresponding schema and retry
k

Kha

02/08/2021, 7:56 PM
Just to update on this, I restarted my Pinot cluster on docker (tested the insert after and it showed the same error as above), then I changed my time format to seconds (which the insert was a success), and changed the time format back to milliseconds where the insert was now successful. I then noticed that the query values in the Pinot UI didn't match my CSVs (it was referencing an old CSV that was deleted a while ago. I restarted docker, regenerated the CSVs, and it's now working. Thanks for your help @Neha Pawar @Xiang Fu. Please consider this issue closed
👍 2