hello there getting the following exception `Caused by java Apache Pinot #general

hello there, getting the following exception.. `Ca...

Prakash Tirumalareddy

10/05/2020, 2:23 PM

hello there, getting the following exception..

Caused by: java.lang.IllegalArgumentException: Parameter 'Bucket' must not be null

I am using 0.5.0 GenerationJobRunner, segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner, segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner} includeFileNamePattern: glob:**/*.parquet inputDirURI: s3://edp-pinot-data/nem13/ jobType: SegmentCreationAndUriPush outputDirURI: s3://edp-pinot-segments/nem13/segments overwriteOutput: true pinotClusterSpecs: - {controllerURI: 'http://localhost:9000'} pinotFSSpecs: - {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file} - className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: {region: ap-southeast-2} scheme: s3 pushJobSpec: {pushAttempts: 1, pushParallelism: 1, pushRetryIntervalMillis: 1000, segmentUriPrefix: 's3://edp-pinot-segments', segmentUriSuffix: null} recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader, configClassName: null, configs: null, dataFormat: parquet} segmentNameGeneratorSpec: null tableSpec: {schemaURI: 'http://localhost:9000/tables/nem13/schema', tableConfigURI: 'http://localhost:9000/tables/nem13', tableName: nem13} Am I missing anything? Please help!!!

Kishore G

10/05/2020, 3:23 PM

@Kartik Khare ^^

Kartik Khare

10/05/2020, 3:38 PM

@Prakash Tirumalareddy Can you share the proper yaml config I am seeing recordReaderSpec contains '{}' and configClassName and configs are set as null

Prakash Tirumalareddy

10/05/2020, 11:18 PM

Please see attached file. Sorry for late reply. It was late night :)

nem13-ingestionJobSpec.yaml

Daniel Lavoie

10/05/2020, 11:19 PM

@Prakash Tirumalareddy can you also provide a complete stack rather than just a cause message? It usually greatly help with the analysis.

Prakash Tirumalareddy

10/05/2020, 11:19 PM

sure

Prakash Tirumalareddy

10/05/2020, 11:32 PM

console.log

Prakash Tirumalareddy

10/05/2020, 11:46 PM

just incase I remove all comments from jobSpec file to make simple to read.

nem13-ingestionJobSpec.yaml

Prakash Tirumalareddy

10/06/2020, 1:36 AM

@Daniel Lavoie any finding please suggest?

Prakash Tirumalareddy

10/06/2020, 2:44 AM

@Kartik Khare any hints pls?

Neha Pawar

10/06/2020, 6:15 AM

looks like someone had reported this same issue: https://github.com/apache/incubator-pinot/issues/5835

Neha Pawar

10/06/2020, 6:15 AM

and it has been fixed by @Kartik Khare on master: https://github.com/apache/incubator-pinot/pull/5836

Neha Pawar

10/06/2020, 6:16 AM

this commit is not part of 0.5.0.

Neha Pawar

10/06/2020, 6:16 AM

Could you try with the build from source @Prakash Tirumalareddy?

Prakash Tirumalareddy

10/06/2020, 6:16 AM

oh ok

Prakash Tirumalareddy

10/06/2020, 6:17 AM

Sure I will try from source. Thank you very much @Neha Pawar

Kartik Khare

10/06/2020, 6:59 AM

Prakash Tirumalareddy

10/06/2020, 6:59 AM

hello

Kartik Khare

10/06/2020, 6:59 AM

Actually there is a small hack also You can just mention prefix as s3:// and it should work

Kartik Khare

10/06/2020, 6:59 AM

In 0.5.0

Prakash Tirumalareddy

10/06/2020, 7:00 AM

you mean this :

Prakash Tirumalareddy

10/06/2020, 7:00 AM

pushJobSpec:

pushAttempts: 1

pushRetryIntervalMillis: 1000

segmentUriPrefix: "s3://"

segmentUriSuffix: ""

Kartik Khare

10/06/2020, 7:12 AM

My bad. This "hack" was for some other issue. Right now, you'll have to build from master only. You can simply clone the repo and run

mvn clean package -DskipTests -Pbin-dist

Prakash Tirumalareddy

10/06/2020, 7:12 AM

ok sure. I will build

Prakash Tirumalareddy

10/06/2020, 1:47 PM

@Kartik Khare yes that worked but got another issue (sorry for this inconvenience).

2020/10/07 00:44:54.420 INFO [PinotFSFactory] [main] Initializing PinotFS for scheme s3, classname org.apache.pinot.plugin.filesystem.S3PinotFS

2020/10/07 00:44:54.891 INFO [S3PinotFS] [main] mkdir <s3://edp-pinot-segments/nem13/segments>

2020/10/07 00:44:55.598 INFO [S3PinotFS] [main] Listed 1 files from URI: <s3://edp-pinot-data/nem13/>, is recursive: true

2020/10/07 00:44:56.043 INFO [S3PinotFS] [main] Copy <s3://edp-pinot-data/nem13/currentregisterreaddate=2002-06-14/active_ind=Y/part-00000-2c2c776c-12f8-45a0-96fa-e402b13fdb57.c000.snappy.parquet> to local /var/folders/xs/bknv88ln05g5z3dgzss7whw80000gn/T/pinot-956cf81e-458b-45f3-9669-c24019eeacd3/input/part-00000-2c2c776c-12f8-45a0-96fa-e402b13fdb57.c000.snappy.parquet

2020/10/07 00:44:56.176 WARN [SegmentIndexCreationDriverImpl] [main] Using class: org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader to read segment, ignoring configured file format: AVRO

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/Path

at org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader.init(ParquetRecordReader.java:46)

at org.apache.pinot.spi.data.readers.RecordReaderFactory.getRecordReaderByClass(RecordReaderFactory.java:133)

at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.getRecordReader(SegmentIndexCreationDriverImpl.java:120)

at org.apache.pinot.core.segment.creator.impl.SegmentIndexCreationDriverImpl.init(SegmentIndexCreationDriverImpl.java:96)

at org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationTaskRunner.run(SegmentGenerationTaskRunner.java:104)

at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:190)

at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142)

at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117)

at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:123)

at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:65)

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.Path

at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)

at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)

at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)

... 10 more

Prakash Tirumalareddy

10/06/2020, 1:47 PM

ANy thoughts/pointers?

Kartik Khare

10/06/2020, 1:48 PM

That's strange. Can you share the ingestion config again along with the java version?

Prakash Tirumalareddy

10/06/2020, 1:50 PM

:apache-pinot-incubating-0.6.0-SNAPSHOT-bin prakash$ java -version openjdk version "13.0.2" 2020-01-14 OpenJDK Runtime Environment (build 13.0.2+8) OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)

nem13-ingestionJobSpec.yaml

Kartik Khare

10/06/2020, 1:51 PM

Can you set

segmentUriPrefix: ""

and try again

Daniel Lavoie

10/06/2020, 1:51 PM

You need to build Pinot with Java 11

Daniel Lavoie

10/06/2020, 1:52 PM

I see JDK 13.

Kartik Khare

10/06/2020, 1:52 PM

@Daniel Lavoie We do have checks for quickstart on JDK 13 as well as 14 in github but the build has to be on JDK 11?

Daniel Lavoie

10/06/2020, 1:54 PM

Indeed, sorry read the stack a bit too quickly :

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/Path

Prakash Tirumalareddy

10/06/2020, 1:54 PM

tried with following, same error segmentUriPrefix: "" segmentUriSuffix: ""

Prakash Tirumalareddy

10/06/2020, 1:58 PM

I see there is open issue? is this related? https://github.com/apache/incubator-pinot/issues/5387

Kartik Khare

10/06/2020, 2:00 PM

Yes, it is related. Will find a long term fix for it. For now Can you try the solution mentioned in the issue

Prakash Tirumalareddy

10/06/2020, 2:02 PM

yes trying now

Prakash Tirumalareddy

10/06/2020, 2:08 PM

Attached logs..

console.log

Daniel Lavoie

10/06/2020, 2:09 PM

Copy code

Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented.

Kartik Khare

10/06/2020, 2:11 PM

That seems like a avro specific error

Daniel Lavoie

10/06/2020, 2:12 PM

Yes, I think your schema is too advanced for the once supported by the

org.apache.parquet.avro.AvroSchemaConverter

from Pinot

Prakash Tirumalareddy

10/06/2020, 2:13 PM

see attached schema I converted that from Athena DDL to JSON schema..

nem13-schema.json

Kartik Khare

10/06/2020, 2:15 PM

This schema is fine. The issue with avro schema i.e. avsc file

Prakash Tirumalareddy

10/06/2020, 2:17 PM

.avsc is this file generated by Pinot?

Prakash Tirumalareddy

10/06/2020, 2:21 PM

please let me know is there anything I am doing wrong?

Kartik Khare

10/06/2020, 2:21 PM

my bad. What parquet version are you using to write the data?

Prakash Tirumalareddy

10/06/2020, 2:23 PM

python-snappy==0.5.4 fastparquet==0.3.3

Kartik Khare

10/06/2020, 2:28 PM

https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write Can you set the

times

value to

int64

and try

Prakash Tirumalareddy

10/06/2020, 2:32 PM

Sorry I didn't get it. You mean to change in schema file? or changing generation of parquet file? { "name": "datetime", "dataType": "INT64" },

Kartik Khare

10/06/2020, 2:32 PM

No in the python parquet writer. See the last config on the link I mentioned

Prakash Tirumalareddy

10/06/2020, 2:39 PM

ok boss. that is big change, because it may impact of other things currently running. I need to discuss with team. I will try this tomorrow morning. Time to go to bed 1.40am now 🙂. Please let me know if anything I can do without changing source data. Thanks kindly for all support and help.

Neha Pawar

10/06/2020, 3:22 PM

Hi @Prakash Tirumalareddy, looks like INT96 is deprecated in parquet future versions, so changing it to INT64 will be the only solution here. https://github.com/apache/parquet-mr/pull/579/files

Prakash Tirumalareddy

10/07/2020, 7:52 AM

@Kartik Khare @Daniel Lavoie @Neha Pawar it worked. Thanks again for your kind help. Any performance guide to load data faster way? inputDirURI: "s3://edp-pinot-data/nem13/" includeFileNamePattern: "glob:**/*.parquet" outputDirURI: "s3://edp-pinot-segments/nem13/segments"

Neha Pawar

10/07/2020, 2:35 PM

Whats the raw data size right now? How many files? How long did it take? Also please share the table config/schema.

Neha Pawar

10/07/2020, 2:35 PM

Btw, can we move to troubleshooting channel?

Prakash Tirumalareddy

10/07/2020, 10:29 PM

Sure I will send all the details about data.

Open in Slack

Previous Next