Hi, just to avoid doing unnecessary work. Is anyon...
# contribute-code
w
Hi, just to avoid doing unnecessary work. Is anyone interested in a Delta Lake implementation? Currently it is marked as "Planned", but it seems it is not scheduled, atm. See here . So if there is interest, I could add an implementation based on the delta-sharing protocol (this will require a delta-sharing server and its REST api). ETA: End of march / early april. Let me know, if I should contribute that.
👍 1
plus1 1
h
Hi Stefan - compared to other things on the feature request list this has a lot of upvotes. So I would say there is definitely interests.
l
@witty-dream-29576 would love a contribution here. There is a lot of interest
w
Hi I could work on that. It would use a seperate API Gateway (delta-sharing server) to the delta lake in order to extract the metadata. But setting up the API server is a good idea anyway in production environments. What we could extract is: Schemas, Datatypes, Metadata description already in the delta lake,File format (e.g. parquet etc. ). Partition columns. We might even be able to get file sizes and some statistics (about that I am not quite sure). What we won´t get is lineage between the tables. That needs to be picked up somewhere else (or it will be added to the protocol at a later date). TL;DR : the metadata is not extracted directly from the delta lake but from an API. It gets everything except lineage. Stats would have to investigated. So the question is: Should I have a go even though it does not extract the information directly? Link to the protocol: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md
👍 1
l
@dazzling-judge-80093 @helpful-optician-78938 ^
w
Small p.s. just saw that data_lake uses the
BytesTypeClass
to map binaries. Not sure that is right either.
l
cc: @mammoth-bear-12532
b
@witty-dream-29576 I may be missing some context, but is the question about capturing schema metadata from a dataset on Delta Lake?
w
@big-carpet-38439 Hi John, yes. The question is what kind of Datahub datatype should I assign to delta lake binary file type. Most of other implementations (e.g. data lake) seem to assign
ByteType
. Should I do the same or should I create a new Datahub datatype? If so how do I do it?
b
Oh got it!
Yes I think using ByteType makes sense here
w
Just a quick update: I have already created a working version of it. The last bits still missing are schema versioning, domain support and parsing nested datatypes. Apart from that it seems to progressing nicely.
teamwork 1
👏 6
excited 1
b
You are awesome Stefan! I think a lot of folks will benefit from this 🙂
w
Thanks 🙂 But in the end, I benefit from the work done by others previously. Plus architecture underneath makes it easy to contribute an new source... bowdown Anyway, the metadata version is also implemented now. I am currently working on parsing the nested types. But I am extremly busy next week, so there will be little to no progress until next Friday.
l
Hi @witty-dream-29576! I know it’s a busy week for you, so no rush, but! We’re super excited to hear how things are progressing. Once you have a PR staged & ready to share, we’d love to jump on a Zoom call with you & some folks from the team to have you walk through your work - that way we can expedite the review process. Can we shoot for something next week?
w
Hi @little-megabyte-1074 I am sorry next week is not possible for me because I won t have the nested data types ready by then. How about Tuesday 12 April? Later Afternoon for me (CET), Morning for you guys?
Hi, I have implemented nested data types. Still have to ingestion to a datahub instance but the unit tests are done. So still missing stuff is integration tests & domains & documentation. So chances are good that I can finish everything during next Tuesday. 🤞
👏 2
l
Hi @witty-dream-29576! Sorry for the delayed response, just getting caught up after PTO. Going to DM you!
w
Hey, I have created a quick Draft Pull request for the call later. More details to follow. https://github.com/datahub-project/datahub/pull/4716/commits