Hi just to avoid doing unnecessary work Is anyone interested DataHub #contribute-code

Hi, just to avoid doing unnecessary work. Is anyon...

witty-dream-29576

03/14/2022, 2:52 PM

Hi, just to avoid doing unnecessary work. Is anyone interested in a Delta Lake implementation? Currently it is marked as "Planned", but it seems it is not scheduled, atm. See here . So if there is interest, I could add an implementation based on the delta-sharing protocol (this will require a delta-sharing server and its REST api). ETA: End of march / early april. Let me know, if I should contribute that. ❓

👍 1

plus1 1

high-family-71209

03/14/2022, 3:06 PM

Hi Stefan - compared to other things on the feature request list this has a lot of upvotes. So I would say there is definitely interests.

loud-island-88694

03/14/2022, 5:13 PM

@witty-dream-29576 would love a contribution here. There is a lot of interest

witty-dream-29576

03/15/2022, 10:19 AM

Hi I could work on that. It would use a seperate API Gateway (delta-sharing server) to the delta lake in order to extract the metadata. But setting up the API server is a good idea anyway in production environments. What we could extract is: Schemas, Datatypes, Metadata description already in the delta lake,File format (e.g. parquet etc. ). Partition columns. We might even be able to get file sizes and some statistics (about that I am not quite sure). What we won´t get is lineage between the tables. That needs to be picked up somewhere else (or it will be added to the protocol at a later date). TL;DR : the metadata is not extracted directly from the delta lake but from an API. It gets everything except lineage. Stats would have to investigated. So the question is: Should I have a go even though it does not extract the information directly? Link to the protocol: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md

👍 1

loud-island-88694

03/17/2022, 6:49 AM

@dazzling-judge-80093 @helpful-optician-78938 ^

witty-dream-29576

03/17/2022, 7:11 AM

Small p.s. just saw that data_lake uses the

BytesTypeClass

to map binaries. Not sure that is right either.

little-megabyte-1074

03/17/2022, 7:34 PM

cc: @mammoth-bear-12532

big-carpet-38439

03/19/2022, 6:51 AM

@witty-dream-29576 I may be missing some context, but is the question about capturing schema metadata from a dataset on Delta Lake?

witty-dream-29576

03/21/2022, 5:41 AM

@big-carpet-38439 Hi John, yes. The question is what kind of Datahub datatype should I assign to delta lake binary file type. Most of other implementations (e.g. data lake) seem to assign

ByteType

. Should I do the same or should I create a new Datahub datatype? If so how do I do it?

big-carpet-38439

03/21/2022, 3:35 PM

Oh got it!

big-carpet-38439

03/21/2022, 3:35 PM

Yes I think using ByteType makes sense here

witty-dream-29576

03/23/2022, 12:33 PM

Just a quick update: I have already created a working version of it. The last bits still missing are schema versioning, domain support and parsing nested datatypes. Apart from that it seems to progressing nicely.

teamwork 1

👏 6

excited 1

big-carpet-38439

03/24/2022, 7:55 PM

You are awesome Stefan! I think a lot of folks will benefit from this 🙂

witty-dream-29576

03/25/2022, 7:40 AM

Thanks 🙂 But in the end, I benefit from the work done by others previously. Plus architecture underneath makes it easy to contribute an new source... bowdown Anyway, the metadata version is also implemented now. I am currently working on parsing the nested types. But I am extremly busy next week, so there will be little to no progress until next Friday.

little-megabyte-1074

03/29/2022, 8:14 PM

Hi @witty-dream-29576! I know it’s a busy week for you, so no rush, but! We’re super excited to hear how things are progressing. Once you have a PR staged & ready to share, we’d love to jump on a Zoom call with you & some folks from the team to have you walk through your work - that way we can expedite the review process. Can we shoot for something next week?

witty-dream-29576

03/31/2022, 9:14 AM

Hi @little-megabyte-1074 I am sorry next week is not possible for me because I won t have the nested data types ready by then. How about Tuesday 12 April? Later Afternoon for me (CET), Morning for you guys?

witty-dream-29576

04/08/2022, 4:23 PM

Hi, I have implemented nested data types. Still have to ingestion to a datahub instance but the unit tests are done. So still missing stuff is integration tests & domains & documentation. So chances are good that I can finish everything during next Tuesday. 🤞

👏 2

little-megabyte-1074

04/11/2022, 8:52 PM

Hi @witty-dream-29576! Sorry for the delayed response, just getting caught up after PTO. Going to DM you!

witty-dream-29576

04/21/2022, 2:51 PM

Hey, I have created a quick Draft Pull request for the call later. More details to follow. https://github.com/datahub-project/datahub/pull/4716/commits

2 Views

Open in Slack

Previous Next