https://pinot.apache.org/ logo
r

RK

06/02/2021, 11:26 AM
Hi all , I am trying to push hdfs data in hybrid table. I have added offline table in pinot and now trying to push the hdfs file. When I am executing the final Hadoop jar command. It's showing pinot-plugins.tar.gz doesn't exist. Someone kindly suggest. Error: File file:/home/rah/hybrid/staging/pinot-plugin.tar.gz doesn't exits. I am attaching my config file. Here /user/hdfs is my hdfs location and /home/rah is local location .P.s. for staging and outputdir if I am giving hdfs Location then it's giving error. "Wrong FS:" hdfs://location-of- inputdir/filename.txt, expected: file:/// @Ken Krugler @Elon @Alexander Pucher @Ting Chen @Neha Pawar @Xiang Fu @Mayank Kindly suggest.
e

Elon

06/02/2021, 5:10 PM
Hi @RK are you using the gcs plugin? Or are you on s3?
x

Xiang Fu

06/02/2021, 5:12 PM
I will take a look the plugin jar issue for hdfs
👍 1
This job is using hdfs not s3 I think
👍 1
r

RK

06/02/2021, 5:15 PM
Hi @Elon I m using hdfs
👍 1
k

Ken Krugler

06/02/2021, 5:43 PM
A few issues with your job spec: 1. You need to use
<hdfs://xxx>
for your staging directory. 2. You need to use
<hdfs://xxx>
for your
outputDirURI
.
And you need to have a
configs:
section inside of the pinot FS specs section, which has
hadoop.conf.path
. E.g. something like: ``````
Copy code
pinotFSSpecs:

  - # scheme: used to identify a PinotFS.
    # E.g. local, hdfs, dbfs, etc
    scheme: hdfs

    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
    configs:
        hadoop.conf.path: '/root/hadoop-ops/config/master/'
Also it would be good to include the stack trace with the error message.
I think if you don’t have the hadoop.conf.path set, then Pinot falls back to the default file system, which is why you get the errors about “wrong FS”
x

Xiang Fu

06/02/2021, 6:13 PM
@RK
r

RK

06/02/2021, 6:22 PM
@Xiang Fu @Ken Krugler this is complete log details when I am using local path as staging and output dir.
x

Xiang Fu

06/02/2021, 6:28 PM
so i guess you start the job from your local this means hadoop job tries to add this uri into dist cache
/home/rah/hybrid/staging/pinot-plugins.tar.gz
how do you submit the hadoop job?
r

RK

06/02/2021, 6:30 PM
Ok @Xiang Fu hadoop jar \ ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with- dependencies.jar \ org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpechybrid.yaml
x

Xiang Fu

06/02/2021, 6:34 PM
this staging dir should be on hdfs as well I think
r

RK

06/02/2021, 6:37 PM
Ok @Xiang Fu let me try
@Ken Krugler @Xiang Fu I have given output and staging dir as hdfs dire same as I given for input directory just created new dir on same location and passed in config. And added one extra property as hadoop.conf.path: '/etc/Hadoop/conf/' Where all my Hadoop configuration files are available i.e. hadoop-site.xml, core-site.xmo etc. But still it's giving the same error wrong FS.
This is how my files looks like now. Kindly suggest @Ken Krugler @Xiang Fu
k

Ken Krugler

06/02/2021, 6:57 PM
Your
hadoop.conf.path
is in the wrong section. You have it as part of the
file
specification, but it needs to be part of the
hdfs
specification.
You should be able to remove the
file
scheme section from the
pinotFSSpecs
configuration
r

RK

06/02/2021, 7:02 PM
Error logs
Ok let me remove file section and retry
Thanks alot @Xiang Fu @Ken Krugler @Elon It Worked.👏
x

Xiang Fu

06/02/2021, 7:20 PM
@Ken Krugler huge thanks! we should document this into FAQ
btw, what’s your final config file look like, wanna compare with the init one
so I can update the documentation to make it more clear
thankyou 1
r

RK

06/03/2021, 6:17 AM
This is the final file @Xiang Fu .
x

Xiang Fu

06/03/2021, 6:46 AM
got it, so staging dir is hdfs and adding
hadoop.conf.path
r

RK

06/03/2021, 6:48 AM
Yes staging/ output/ input all are hdfs
👍 1
Hi @Xiang Fu I have one more doubt. Not sure whether it's a expected behaviour of hybrid table or not. I have created one pinot offline table and pushed data from hdfs location.now when I am doing select * from hybrid it's showing me the hdfs file data which I have pushed. Now I am adding realtime table for the same. And now when I am doing select * from hybrid. It's overwriting the data now I am not able to see the data which I have loaded from hdfs file kindly suggest whether it's a expected behaviour or not.i was expecting both realtime and batch data in hybrid table.
x

Xiang Fu

06/03/2021, 8:41 AM
It’s expected behavior, please see this: https://docs.pinot.apache.org/basics/components/table#hybrid-table
pinot uses time boundary to split query to realtime and offline table
typically we use (earliest, offline max time-1 day ) from offline table + (offline max time-1 day, now ) from realtime table
r

RK

06/03/2021, 9:26 AM
Oh okay @Xiang Fu so is there any way to test this hybrid table? like individually I have tested and I am able to see both real time as well as batch data, but if I want to explain it to someone like how hybrid storing both realtime and batch data then how should I prepare my Kafka and batch data to store both kind of data in pinot hybrid table.kindly suggest.
x

Xiang Fu

06/03/2021, 9:29 AM
then you can prepare multiple days data
Put a few old data to batch say Monday to Friday
And Thursday to Saturday data in Kafka
Then your query will give you data from Monday to Saturday
You can also query tableName_OFFLINE or tanleName_REALTIME to query them separately
r

RK

06/03/2021, 9:50 AM
Ok @Xiang Fu so I am using current_ts column as timestamp in pinot. for realtime data the current_ts value is today's date. When I am doing select * from hybrid_realtime I am able to see the same current_ts. And from hdfs I m loading old data where current_ts is last year's 2020 date . I am able to see the same in hybrid_offline. But when I am doing select * from hybrid it's only showing Kafka topics current_ts same as hybrid_realtime and not the last year's current_ts which is available in hybrid_offline
x

Xiang Fu

06/03/2021, 9:52 AM
What’s the max ts from offline table
Do you only have 1 day data in offline?
r

RK

06/03/2021, 9:54 AM
Yes in offline table for testing I have loaded only 1 day data i.e. 30 aug 2020 data
x

Xiang Fu

06/03/2021, 9:55 AM
Ok
Pinot queries offline table until max ts -1day
In your case, it queries offline table until 29 aug 2020
And query real-time table from 30 aug 2020 until now
r

RK

06/03/2021, 9:56 AM
Got it so if I load data for 29 and 30 then I should be able to see 29th from batch and today's date data from realtime
x

Xiang Fu

06/03/2021, 9:56 AM
Yes
30 aug 2020 data won’t be seen from the hybrid query
r

RK

06/03/2021, 9:57 AM
Ok Thanks alot @Xiang Fu let me try will update you if it's worksa
x

Xiang Fu

06/03/2021, 9:57 AM
Usually we expect data overlapping between offline and real-time table
r

RK

06/03/2021, 9:58 AM
Ohh ok yes in real time scenario we will have overlapping data but for now for testing purpose I am creating data manually
@Xiang Fu Thanks. Worked
@Xiang Fu @Ken Krugler is there any way to load complete month data in one go. I.e. when I am pushing data from hdfs to pinot table from source directory I have month folder and inside month daywise folder are there when I am giving path till day folder i.e. ...month=09/day=01 then I am able to see the data in pinot table. But if I am giving path till month=09 and then want to load the data for all the daywise subfilders.In log it's showing pushing segment successfully but data is not showing in table. Everytime I am passing complt path till day then only it's showing in table.
Showing this msg in logs when giving path till month=09 but data is not reflecting in table.
k

Ken Krugler

06/03/2021, 2:33 PM
Hi @RK - please start a new thread per question, it helps everyone else find useful conversations, thanks!
r

RK

06/03/2021, 2:43 PM
Sure @Ken Krugler
@Ken Krugler started a new thread.