Hi everyone, I'm currently trying to batch import ...
# troubleshooting
k
Hi everyone, I'm currently trying to batch import some data into a Pinot Offline table and currently running into some issues. My current Pinot version is 0.7.0, currently in a docker container. I have successfully added an
offline_table_config.json
and a
schema.json
file to Pinot, however creating a segment doesn't appear to be working. A
SEGMENT-NAME.tar.gz
file isn't being created. My current docker-job-spec.yml looks like this:
Copy code
# docker-job-spec.yml

executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-manual-test/rawdata/100k'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-manual-test/segments/100k'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'rows_100k'
  schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
  tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-controller-test:9000>'
Some of the error messages I'm getting are
Copy code
Failed to generate Pinot segment for file - file:/tmp/pinot-manual-test/rawdata/100k/rows_100k.csv
Caught exception while gathering stats
java.lang.NumberFormatException: For input string: "5842432235322161941"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_282]
        at java.lang.Integer.parseInt(Integer.java:583) ~[?:1.8.0_282]
Any leads on this would be appreciated. Thanks
n
this looks like a mismatch in dataTypes between the Pinot schema and the actual data
1
can you share the Pinot schema and some sample rows?
k
Isn’t 5842432235322161941 too big for an int type? I think your schema would need to use a long.
1
x
yes, it’s long not int, glad it’s still smaller than LONG.MAX
1
otherwise maybe only double
k
Yes that was the issue. However this brings up another issue with my DateTime. I have a date value in milliseconds Epoch and it seems to not be able to read it. Example date in my CSV is
Copy code
1096429682806
Schema for date is:
Copy code
# schema.json

"dateTimeFieldSpecs": [{
        "name": "date",
        "dataType": "LONG",
        "format" : "1:MILLISECONDS:EPOCH",
        "granularity": "1:MILLISECONDS"
    }]
Table config is:
Copy code
"segmentsConfig": {
        "timeColumnName": "date",
        "timeType": "MILLISECONDS",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "schemaName": "row1",
        "replication": "1"
    },
Error in the image attached:
x
from the value, it’s secondsSinceEpoch
1612471718
image.png
1096429682806
<- this value is in the year 2004?
k
yes, that value for 2004 is in my CSV
x
I don’t see any problem with this so far, but seems that this column time is set to be seconds instead of milliseconds. @Neha Pawar anything else to check?
n
1096429682806
is the value for 2004 rt? the error says Pinot found
1612471718
which is 1970
is that value expected? 1612471718
k
from what I know, I don't have any values that match with 1612471718
n
is it possible to share your input file with us? we can try to reproduce
k
Here is my direct CSV
x
maybe 10 rows and your table config/schema
n
and you entire schema/table config too
k
docker-job-spec-100k.yml,rows_100k_offline_table_config.json,rows_100k_schema.json
n
not able to reproduce with your table cnfig and schema. I can generate the segment just fine.
only possible issue i see in your configs is this
Copy code
tableName: 'rows_100k'
  schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
  tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
the schema is row1, but this says rows_100k
could it be referring to some old schema
k
should the rows_100 that's not tableName reference the scehma?
n
ah i see
k
i'm not entirely sure, the documentation for the batch impot example uses the same value for
tableName
and
schemaName
x
It's using `
Copy code
schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
can you check what's the response for
Copy code
schemaURI: '<http://pinot-controller-test:9000/tables/rows_100k/schema>'
tableConfigURI: '<http://pinot-controller-test:9000/tables/rows_100k>'
k
Receiving the same error message for above
@Neha Pawar I can confirm that changing
rows_100k
to
row1
breaks it further
x
@Neha Pawar can you share the schema and table config
and Kha can use it to create the table
n
i just used what he shared, only thing diff is the batch-job-spec
schema.json,table.json,batch-job-spec.yml
x
which pinot image are you using? is it
apachepinot/pinot:latest
k
Yes 'm using the latest version of pinot, 0.7.0
x
@Neha Pawar I think we need to add timeType into table config?
Copy code
{
	tableName: "foo",
	tableType: "OFFLINE",
	segmentsConfig: {
		timeColumnName: "date",
		timeType: "MILLISECONDS",
		replication: "1"
	},
	tenants: {},
	tableIndexConfig: {
		loadMode: "HEAP",
		invertedIndexColumns: [
			"id",
			"hash_one"
		]
	},
	metadata: {
		customConfigs: {}
	}
}
also do you have logs for batch ingestion job? @Kha?
k
where would the logs be found in the docker instance
n
so strange, i’m also able to take your exact configs, including the yml, and upload
i’m on the latest master, that could be the only difference
x
docker log <docker-container-id>
k
To clarify Neha, did you take my files, run them, and are able to successfully upload it?
As of now, I'm going to try to remove the time column and replace it with another time format. I will come back to this on Monday. Thank you guys so much for your help @Neha Pawar @Xiang Fu
x
sure, please let us know
n
yes, took exactly your files. I’m not running on docker so i changed the
s/pinot-controller-test/localhost
, and was able to upload
lets look at logs on Monday
x
I’ve tried with docker setup and there is no issue on my side. Here are my steps: 1. Start pinot quickstart with docker
Copy code
docker run \
    --network=pinot-demo \
    --name pinot-quickstart \
    -p 9000:9000 \
    -d apachepinot/pinot:latest QuickStart \
    -type batch
2. Create Table
Copy code
docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-batch-table-creation \
    apachepinot/pinot:latest AddTable \
    -schemaFile /tmp/pinot-quick-start/foo-schema.json \
    -tableConfigFile /tmp/pinot-quick-start/foo-table-offline.json \
    -controllerHost pinot-quickstart \
    -controllerPort 9000 -exec
3. Start Ingestion job
Copy code
docker run --rm -ti \
    --network=pinot-demo \
    -v /tmp/pinot-quick-start:/tmp/pinot-quick-start \
    --name pinot-data-ingestion-job \
    apachepinot/pinot:latest LaunchDataIngestionJob \
    -jobSpecFile /tmp/pinot-quick-start/docker-job-spec-100k.yml
I put table conf and schema on my local directory and mount to docker:
n
i suspect there’s some stray data in the input folder for you Kha
x
and this is the updated docker-job-spec file:
Copy code
➜ cat /tmp/pinot-quick-start/docker-job-spec-100k.yml
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/tmp/pinot-quick-start/rawdata'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: '/tmp/pinot-manual-test/segments'
overwriteOutput: true
pinotFSSpecs:
  - scheme: file
    className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
  dataFormat: 'csv'
  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
tableSpec:
  tableName: 'foo'
  schemaURI: '<http://pinot-quickstart:9000/tables/foo/schema>'
  tableConfigURI: '<http://pinot-quickstart:9000/tables/foo>'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-quickstart:9000>'
Kha: I feel you can delete the table and corresponding schema and retry
k
Just to update on this, I restarted my Pinot cluster on docker (tested the insert after and it showed the same error as above), then I changed my time format to seconds (which the insert was a success), and changed the time format back to milliseconds where the insert was now successful. I then noticed that the query values in the Pinot UI didn't match my CSVs (it was referencing an old CSV that was deleted a while ago. I restarted docker, regenerated the CSVs, and it's now working. Thanks for your help @Neha Pawar @Xiang Fu. Please consider this issue closed
👍 2