What is the process to use HDFS as Pinot deepstrag...
# general
r
What is the process to use HDFS as Pinot deepstrage?
c
@User ^^ looks like we dont have a good doc. Mind updating it ?
t
ok.
r
Thanks @User @User
have your read the above tutorial?
HDFS setup is similar except the storage now is HDFS instead of s3
r
Ok @Thanks Ting Chen.Will read this and try. will connect again in case of any issue.
c
@User might be useful to copy that and modify with a working example
cause I'm sure others will have similar questions
t
sure would do that. Just that the s3 tutorial looks very close to HDFS setup too.
r
@User @User I followed the tutorials which you shared and created 3 files inside Apache-pinot-version-bin/bin 1. Controller.conf 2. Server.conf 3. IngestionJobSpec.yaml I couldn't understand some properties so need your help to modify those properties kindly guide me. I am attaching my 3 files which I have created new in pdf format.Kindly review and help
P.s. I have not done the spark Job and spark submit step. Since as a part of Poc I am not creating any spark job , I have already Kafka topic and it's integrated to pinot now want to use deepatorage as Hadoop.
@User @User @User
k
@User - I think you want something more like this for controller:
Copy code
pinot.controller.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.controller.segment.fetcher.protocols=file,http,hdfs
pinot.controller.segment.fetcher.hdfs.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Note you’re setting up various
.hdfs=xxx
configurations, NOT the
.s3=xxx
ones from the tutorial, since you want to use HDFS, right?
And something like this for the server.conf:
Copy code
pinot.server.storage.factory.class.hdfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.server.segment.fetcher.protocols=file,http,hdfs
pinot.server.segment.fetcher.hdfsclass=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher
Also your job spec isn’t going to work for URI push - the resulting segments have to be sent to a shared file system (HDFS, in your case). So the output dir has to be
<hdfs://some/path/to/dir>
, which is what’s meant by the comment
expected to have schema configured in PinotFS
. You need to ensure HDFS access is set up properly on the server where you’re running your standalone batch job, so that it’s able to push segments to HDFS.
m
@User thanks a lot. Is the pinot docs missing these steps for HDFS (I saw ali cloud only). Since you have first hand experience, would be super helpful if you got help add the docs (I can work with you on that)?
r
@User but in case of hdfs I need to pass namenode details ryt? How should I provide those details and in which file..? And these extras 3 files server.config controler.config and job files are required or do I need to update in existing files
k
@User I’ve got this on my to-do list, but sadly it’s gotten pushed down by things like “get my sewer connection working” 🙂
m
Uh-oh. Feel free to ping me and I can work with you on adding the docs.
k
Assuming I do want to document this end-to-end, what’s the right location? Or is there an existing page I should just update?
m
Would appreciate your input as a user on whether this is where you would look for it
r
@User Followed this link and created these 3 files (attached in earlier comments) , It's for S3. I am trying to store on HDFS.
k
@User - that location seems reasonable
m
Thanks @User. Let's add there once you get a chance. I saw several folks asking about it and I was surprised we didn't have any instructions on HDFS as deep-storage.
t
I am working on an example for HDFS setup based on our installation and will share it shortly.
@User The hadoop HFDS config files should be referenced from the server and controller conf. E.g., the following is our config (I will post a more detail tutorial later).
Copy code
controller.data.dir=root_dir_to_your_hdfs_dir
pinot.controller.segment.fetcher.protocols=file,http,hdfs,viewfs
pinot.controller.segment.fetcher.viewfs.hadoop.conf.path=/pathToYourHDFSConfigDir
pinot.controller.segment.fetcher.viewfs.class=YourVersionOfSegmentFetcher (Check out its subclasses)
pinot.controller.segment.fetcher.viewfs.hadoop.kerberos.principle=XXXXX (If you need secure access)
pinot.controller.segment.fetcher.viewfs.hadoop.kerberos.keytab=XXXXX (If you need secure access)

pinot.controller.storage.factory.class.viewfs=org.apache.pinot.plugin.filesystem.HadoopPinotFS
pinot.controller.storage.factory.viewfs.llc.hdfs.config.dir=/pathToYourHDFSConfigDir
pinot.controller.storage.factory.viewfs.hadoop.kerberos.principle=XXXXX (If you need secure access)
pinot.controller.storage.factory.viewfs.hadoop.kerberos.keytab=xxxxx (If you need secure access)
Copy code
pinot.server.segment.fetcher.protocols=file,http,hdfs,viewfs
pinot.server.segment.fetcher.viewfs.hadoop.conf.path=/pathToYourHDFSConfigDir
pinot.server.segment.fetcher.viewfs.class=YourVersionOfSegmentFetcher (Check out its subclasses)
pinot.server.segment.fetcher.viewfs.hadoop.kerberos.principle=XXXXX (If you need secure access)
pinot.server.segment.fetcher.viewfs.hadoop.kerberos.keytab=XXXXX (If you need secure access)
m
Thanks @User, that would be super useful.
r
@User thanks a lot .where can I find VersionOfSegmentFetcher
a
@User I have created a doc for the whole setup here. https://github.com/SleepyThread/pinot-docs/blob/master/basics/getting-started/hdfs-as-deepstorage.md Please validate, once you are ok with the docs I will add a pull request in the main Pinot docs.