Hi Team I am using hdfs as deepstore I am trying to do inges Apache Pinot #general

Hi. Team I am using hdfs as deepstore. I am trying...

coco

05/10/2022, 9:45 AM

Hi. Team I am using hdfs as deepstore. I am trying to do ingestion batch with spark on deepstore hdfs cluster. I am having difficulty trying to use another hdfs cluster as input of batch job spec. Is such a deployment configuration possible?

Kartik Khare

05/10/2022, 9:49 AM

Hi no currently we support only a single config per filesystem the work for multiple configs per filesystem is WIP

coco

05/10/2022, 9:57 AM

@Kartik Khare By single configuration you mean only one filesystem? Or do you mean only one kind of filesystem?

coco

05/10/2022, 9:59 AM

Let me share my test case. I succeeded in ingestion into deepstore hdfs by reading data from other hdfs in standalone mode with configuration as below.

executionFrameworkSpec:

name: 'standalone'

jobType: SegmentCreationAndTarPush

inputDirURI: '<hdfs://another-hdfs-cluster/data/pinot_poc/airlineStats>'

pinotFSSpecs:

- scheme: hdfs

className: org.apache.pinot.plugin.filesystem.HadoopPinotFS

configs:

hadoop.conf.path: '[deepstore hdfs hadoop config path]'

hadoop.kerberos.principle: '[kerberos principal]'

hadoop.kerberos.keytab: '[local filesystem keytab path]'

- scheme: hdfs

className: org.apache.pinot.plugin.filesystem.HadoopPinotFS

configs:

hadoop.conf.path: '[another cluster hdfs hadoop config path]'

hadoop.kerberos.principle: '[kerberos principal]'

hadoop.kerberos.keytab: '[local filesystem keytab path]'

coco

05/10/2022, 10:00 AM

I tried in spark mode the same way but it failed. I've tried various things other than this configuration, but it fails every time.

executionFrameworkSpec:

name: 'spark'

extraConfigs:

stagingDir: '<hdfs://deepstore-hdfs-cluster/test/pinot-poc/staging/airlineStats>'

inputDirURI: '<hdfs://another-hdfs-cluster/user/pinot-poc/airlineStats/rawdata>'

outputDirURI: '<hdfs://deepstore-hdfs-cluster/test/pinot-poc/output/airlineStats/>'

pinotFSSpecs:

- scheme: hdfs

className: org.apache.pinot.plugin.filesystem.HadoopPinotFS

configs:

hadoop.kerberos.principle: '[kerberos principal]'

- scheme: hdfs

className: org.apache.pinot.plugin.filesystem.HadoopPinotFS

configs:

hadoop.conf.path: '<hdfs://anoterh-hdfs-cluster/test/pinot-poc/another-hdfs-cluster-hadoop-conf>'

hadoop.kerberos.principle: '[kerberos principal]'

Kartik Khare

05/10/2022, 10:01 AM

yep, this won't work currently since we only use one of these configs for ‘hdfs’ scheme. support for multiple configs based on path + scheme is work in progress

coco

05/10/2022, 10:05 AM

@Kartik Khare in my tests the deployment was successful in standalone mode using another cluster as inputDir. Is the configuration of this successful test a different case?

Kartik Khare

05/10/2022, 10:06 AM

what hdfs cluster did your test pinot use?

coco

05/10/2022, 11:28 AM

@Kartik Khare In standalone mode above, cluster 'deepstore' and 'another cluster' are different hdfs clusters. The hdfs cluster is cloudera .

Kartik Khare

05/10/2022, 11:30 AM

moving this conversation to DM

Neha Pawar

05/12/2022, 9:01 PM

Was this issue resolved? afaik, it should work. you should be able to set any name in place of “hdfs”. So you could have hdfs1 and hdfs2, and set the appropriate prefix in the paths

coco

05/16/2022, 12:20 PM

@Neha Pawar I haven't been able to solve it yet.

coco

05/16/2022, 1:00 PM

@Neha Pawar @Kartik Khare This type of error occurs. java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:148) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:121) at org.apache.pinot.tools.Command.call(Command.java:33) at org.apache.pinot.tools.Command.call(Command.java:29) at picocli.CommandLine.executeUserObject(CommandLine.java:1953) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine.execute(CommandLine.java:2078) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:153) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:665) Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs1:/user/pinot-poc/airlineStats/rawdata, expected: hdfs://deepstore-hdfs-cluster at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:770) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:252) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1745) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1742) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1757) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1723) at org.apache.pinot.plugin.filesystem.HadoopPinotFS.listFiles(HadoopPinotFS.java:145) at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:163) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:146) ... 17 more

Open in Slack

Previous Next