Hi. Team I am using hdfs as deepstore. I am trying...
# general
c
Hi. Team I am using hdfs as deepstore. I am trying to do ingestion batch with spark on deepstore hdfs cluster. I am having difficulty trying to use another hdfs cluster as input of batch job spec. Is such a deployment configuration possible?
k
Hi no currently we support only a single config per filesystem the work for multiple configs per filesystem is WIP
c
@Kartik Khare By single configuration you mean only one filesystem? Or do you mean only one kind of filesystem?
Let me share my test case. I succeeded in ingestion into deepstore hdfs by reading data from other hdfs in standalone mode with configuration as below.
executionFrameworkSpec:
name: 'standalone'
jobType: SegmentCreationAndTarPush
inputDirURI: '<hdfs://another-hdfs-cluster/data/pinot_poc/airlineStats>'
pinotFSSpecs:
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
configs:
hadoop.conf.path: '[deepstore hdfs hadoop config path]'
hadoop.kerberos.principle: '[kerberos principal]'
hadoop.kerberos.keytab: '[local filesystem keytab path]'
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
configs:
hadoop.conf.path: '[another cluster hdfs hadoop config path]'
hadoop.kerberos.principle: '[kerberos principal]'
hadoop.kerberos.keytab: '[local filesystem keytab path]'
I tried in spark mode the same way but it failed. I've tried various things other than this configuration, but it fails every time.
executionFrameworkSpec:
name: 'spark'
extraConfigs:
stagingDir: '<hdfs://deepstore-hdfs-cluster/test/pinot-poc/staging/airlineStats>'
inputDirURI: '<hdfs://another-hdfs-cluster/user/pinot-poc/airlineStats/rawdata>'
outputDirURI: '<hdfs://deepstore-hdfs-cluster/test/pinot-poc/output/airlineStats/>'
pinotFSSpecs:
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
configs:
hadoop.kerberos.principle: '[kerberos principal]'
- scheme: hdfs
className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
configs:
hadoop.conf.path: '<hdfs://anoterh-hdfs-cluster/test/pinot-poc/another-hdfs-cluster-hadoop-conf>'
hadoop.kerberos.principle: '[kerberos principal]'
k
yep, this won't work currently since we only use one of these configs for ‘hdfs’ scheme. support for multiple configs based on path + scheme is work in progress
c
@Kartik Khare in my tests the deployment was successful in standalone mode using another cluster as inputDir. Is the configuration of this successful test a different case?
k
what hdfs cluster did your test pinot use?
c
@Kartik Khare In standalone mode above, cluster 'deepstore' and 'another cluster' are different hdfs clusters. The hdfs cluster is cloudera .
k
moving this conversation to DM
n
Was this issue resolved? afaik, it should work. you should be able to set any name in place of “hdfs”. So you could have hdfs1 and hdfs2, and set the appropriate prefix in the paths
c
@Neha Pawar I haven't been able to solve it yet.
@Neha Pawar @Kartik Khare This type of error occurs. java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:148) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:117) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:121) at org.apache.pinot.tools.Command.call(Command.java:33) at org.apache.pinot.tools.Command.call(Command.java:29) at picocli.CommandLine.executeUserObject(CommandLine.java:1953) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine.execute(CommandLine.java:2078) at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:153) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:665) Caused by: java.lang.IllegalArgumentException: Wrong FS: hdfs1:/user/pinot-poc/airlineStats/rawdata, expected: hdfs://deepstore-hdfs-cluster at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:770) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:252) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1745) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1742) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1757) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1723) at org.apache.pinot.plugin.filesystem.HadoopPinotFS.listFiles(HadoopPinotFS.java:145) at org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:163) at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:146) ... 17 more