Hi all I am trying to push hdfs data in hybrid table I have Apache Pinot #troubleshooting

Hi all , I am trying to push hdfs data in hybrid t...

06/02/2021, 11:26 AM

Hi all , I am trying to push hdfs data in hybrid table. I have added offline table in pinot and now trying to push the hdfs file. When I am executing the final Hadoop jar command. It's showing pinot-plugins.tar.gz doesn't exist. Someone kindly suggest. Error: File file:/home/rah/hybrid/staging/pinot-plugin.tar.gz doesn't exits. I am attaching my config file. Here /user/hdfs is my hdfs location and /home/rah is local location .P.s. for staging and outputdir if I am giving hdfs Location then it's giving error. "Wrong FS:" hdfs://location-of- inputdir/filename.txt, expected: file:/// @Ken Krugler @Elon @Alexander Pucher @Ting Chen @Neha Pawar @Xiang Fu @Mayank Kindly suggest.

Elon

06/02/2021, 5:10 PM

Hi @RK are you using the gcs plugin? Or are you on s3?

Xiang Fu

06/02/2021, 5:12 PM

I will take a look the plugin jar issue for hdfs

👍 1

Xiang Fu

06/02/2021, 5:13 PM

This job is using hdfs not s3 I think

👍 1

06/02/2021, 5:15 PM

Hi @Elon I m using hdfs

👍 1

Ken Krugler

06/02/2021, 5:43 PM

A few issues with your job spec: 1. You need to use

<hdfs://xxx>

for your staging directory. 2. You need to use

<hdfs://xxx>

for your

outputDirURI

Ken Krugler

06/02/2021, 5:45 PM

And you need to have a

configs:

section inside of the pinot FS specs section, which has

hadoop.conf.path

. E.g. something like: ``````

Ken Krugler

06/02/2021, 5:45 PM

Copy code

pinotFSSpecs:

  - # scheme: used to identify a PinotFS.
    # E.g. local, hdfs, dbfs, etc
    scheme: hdfs

    className: org.apache.pinot.plugin.filesystem.HadoopPinotFS
    configs:
        hadoop.conf.path: '/root/hadoop-ops/config/master/'

Ken Krugler

06/02/2021, 5:49 PM

Also it would be good to include the stack trace with the error message.

Ken Krugler

06/02/2021, 5:49 PM

I think if you don’t have the hadoop.conf.path set, then Pinot falls back to the default file system, which is why you get the errors about “wrong FS”

Xiang Fu

06/02/2021, 6:13 PM

@RK

06/02/2021, 6:22 PM

@Xiang Fu @Ken Krugler this is complete log details when I am using local path as staging and output dir.

Xiang Fu

06/02/2021, 6:28 PM

so i guess you start the job from your local this means hadoop job tries to add this uri into dist cache

/home/rah/hybrid/staging/pinot-plugins.tar.gz

Xiang Fu

06/02/2021, 6:28 PM

how do you submit the hadoop job?

06/02/2021, 6:30 PM

Ok @Xiang Fu hadoop jar \ ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with- dependencies.jar \ org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ -jobSpecFile ${PINOT_DISTRIBUTION_DIR}/examples/batch/airlineStats/hadoopIngestionJobSpechybrid.yaml

Xiang Fu

06/02/2021, 6:34 PM

this staging dir should be on hdfs as well I think

06/02/2021, 6:37 PM

Ok @Xiang Fu let me try

06/02/2021, 6:42 PM

@Ken Krugler @Xiang Fu I have given output and staging dir as hdfs dire same as I given for input directory just created new dir on same location and passed in config. And added one extra property as hadoop.conf.path: '/etc/Hadoop/conf/' Where all my Hadoop configuration files are available i.e. hadoop-site.xml, core-site.xmo etc. But still it's giving the same error wrong FS.

06/02/2021, 6:55 PM

This is how my files looks like now. Kindly suggest @Ken Krugler @Xiang Fu

Ken Krugler

06/02/2021, 6:57 PM

Your

hadoop.conf.path

is in the wrong section. You have it as part of the

file

specification, but it needs to be part of the

hdfs

specification.

Ken Krugler

06/02/2021, 6:58 PM

You should be able to remove the

file

scheme section from the

pinotFSSpecs

configuration

06/02/2021, 7:02 PM

Error logs

06/02/2021, 7:04 PM

Ok let me remove file section and retry

06/02/2021, 7:19 PM

Thanks alot @Xiang Fu @Ken Krugler @Elon It Worked.👏

Xiang Fu

06/02/2021, 7:20 PM

@Ken Krugler huge thanks! we should document this into FAQ

Xiang Fu

06/02/2021, 11:29 PM

btw, what’s your final config file look like, wanna compare with the init one

Xiang Fu

06/02/2021, 11:30 PM

so I can update the documentation to make it more clear

thankyou 1

06/03/2021, 6:17 AM

This is the final file @Xiang Fu .

Xiang Fu

06/03/2021, 6:46 AM

got it, so staging dir is hdfs and adding

hadoop.conf.path

06/03/2021, 6:48 AM

Yes staging/ output/ input all are hdfs

👍 1

06/03/2021, 7:38 AM

Hi @Xiang Fu I have one more doubt. Not sure whether it's a expected behaviour of hybrid table or not. I have created one pinot offline table and pushed data from hdfs location.now when I am doing select * from hybrid it's showing me the hdfs file data which I have pushed. Now I am adding realtime table for the same. And now when I am doing select * from hybrid. It's overwriting the data now I am not able to see the data which I have loaded from hdfs file kindly suggest whether it's a expected behaviour or not.i was expecting both realtime and batch data in hybrid table.

Xiang Fu

06/03/2021, 8:41 AM

It’s expected behavior, please see this: https://docs.pinot.apache.org/basics/components/table#hybrid-table

Xiang Fu

06/03/2021, 8:42 AM

pinot uses time boundary to split query to realtime and offline table

Xiang Fu

06/03/2021, 8:43 AM

typically we use (earliest, offline max time-1 day ) from offline table + (offline max time-1 day, now ) from realtime table

06/03/2021, 9:26 AM

Oh okay @Xiang Fu so is there any way to test this hybrid table? like individually I have tested and I am able to see both real time as well as batch data, but if I want to explain it to someone like how hybrid storing both realtime and batch data then how should I prepare my Kafka and batch data to store both kind of data in pinot hybrid table.kindly suggest.

Xiang Fu

06/03/2021, 9:29 AM

then you can prepare multiple days data

Xiang Fu

06/03/2021, 9:30 AM

Put a few old data to batch say Monday to Friday

Xiang Fu

06/03/2021, 9:30 AM

And Thursday to Saturday data in Kafka

Xiang Fu

06/03/2021, 9:30 AM

Then your query will give you data from Monday to Saturday

Xiang Fu

06/03/2021, 9:31 AM

You can also query tableName_OFFLINE or tanleName_REALTIME to query them separately

06/03/2021, 9:50 AM

Ok @Xiang Fu so I am using current_ts column as timestamp in pinot. for realtime data the current_ts value is today's date. When I am doing select * from hybrid_realtime I am able to see the same current_ts. And from hdfs I m loading old data where current_ts is last year's 2020 date . I am able to see the same in hybrid_offline. But when I am doing select * from hybrid it's only showing Kafka topics current_ts same as hybrid_realtime and not the last year's current_ts which is available in hybrid_offline

Xiang Fu

06/03/2021, 9:52 AM

What’s the max ts from offline table

Xiang Fu

06/03/2021, 9:53 AM

Do you only have 1 day data in offline?

06/03/2021, 9:54 AM

Yes in offline table for testing I have loaded only 1 day data i.e. 30 aug 2020 data

Xiang Fu

06/03/2021, 9:55 AM

Xiang Fu

06/03/2021, 9:55 AM

Pinot queries offline table until max ts -1day

Xiang Fu

06/03/2021, 9:55 AM

In your case, it queries offline table until 29 aug 2020

Xiang Fu

06/03/2021, 9:56 AM

And query real-time table from 30 aug 2020 until now

06/03/2021, 9:56 AM

Got it so if I load data for 29 and 30 then I should be able to see 29th from batch and today's date data from realtime

Xiang Fu

06/03/2021, 9:56 AM

Yes

Xiang Fu

06/03/2021, 9:57 AM

30 aug 2020 data won’t be seen from the hybrid query

06/03/2021, 9:57 AM

Ok Thanks alot @Xiang Fu let me try will update you if it's worksa

Xiang Fu

06/03/2021, 9:57 AM

Usually we expect data overlapping between offline and real-time table

06/03/2021, 9:58 AM

Ohh ok yes in real time scenario we will have overlapping data but for now for testing purpose I am creating data manually

06/03/2021, 11:16 AM

@Xiang Fu Thanks. Worked

06/03/2021, 2:01 PM

@Xiang Fu @Ken Krugler is there any way to load complete month data in one go. I.e. when I am pushing data from hdfs to pinot table from source directory I have month folder and inside month daywise folder are there when I am giving path till day folder i.e. ...month=09/day=01 then I am able to see the data in pinot table. But if I am giving path till month=09 and then want to load the data for all the daywise subfilders.In log it's showing pushing segment successfully but data is not showing in table. Everytime I am passing complt path till day then only it's showing in table.

06/03/2021, 2:05 PM

Showing this msg in logs when giving path till month=09 but data is not reflecting in table.

Ken Krugler

06/03/2021, 2:33 PM

Hi @RK - please start a new thread per question, it helps everyone else find useful conversations, thanks!

06/03/2021, 2:43 PM

Sure @Ken Krugler

06/03/2021, 2:47 PM

@Ken Krugler started a new thread.

Open in Slack

Previous Next