is possible to add aws S3 to list of dataPlatforms...
# getting-started
c
is possible to add aws S3 to list of dataPlatforms? most of our datasets are in AWS S3 lake.
b
Are these files that have certain semantics or does S3 qualify them as proper tables?
c
not really, s3 is just the storage layer. on the other hand, they are kind of cataloged in AWS Glue as dataset/table, with schema and other metadata. so in this sense, does aws glue makes more sense as dataPlatform? šŸ¤”
b
yep i think so!
someone was working on that IIRC
šŸ‘ 1
have we chatted about AWS glue integration?
c
yeah, i have done some custom glue integration with datahub using datahub's gms api, in C# tho
c
I’m making a connector for Glue in Python, maybe we can sync if somebody is doing the same?
Would be interested to know which aspects you’re adding šŸ™‚ Haven’t looked into this in detail yet. Using boto3 for doing the requests. (And the type mapping.)
c
ah, good to know. yeah, we can sync up. I have added several aspects, such as DatasetProperties, UpstreamLineage, Ownership, SchemaMetadata.
šŸ‘ 1
c
What timezone are you? Available for a call or just a regular chat tomorrow?
c
us-east. maybe a call this afternoon? chat works here too
c
Mm, I’m GMT+1, would maybe somewhere before 12 for you work? Then it will be afternoon here and I can make myself available. I could do an 1 or 2 PM EST as well, but that will be more difficult (today for example wouldn’t work..).
Or you can just ping me whenever you’re available and we can try to sync like that šŸ˜„ might be easier.
c
ok, maybe chat here tomorrow morning then
c
Just FYI. I’ve managed to push successfully to Datahub. I’m currenly not using the Kafka thing which you mentioned, so I think it’s not needed. About the types, we are quite lucky that we are quite strict in what we allow to land on S3 and thus also what we log in the Glue Catalog. So I don’t really have that much problems to define the type mappings.
šŸ‘ 1
@gray-shoe-75895, I can make this publicly available but need a bit of time to cleanup , make it more generic and assess which properties are worthwhile to push. But for specific feedback, should I make a ticket + branch or?
g
That’s awesome! For feedback, the easiest thing would be to make a ā€œdraftā€ PR on github
c
So an update. I’ve got everything working and cleaned it up a bit. Looking now to move it to the metadata-ingestion lib. However I have a few questions. • I have quite some files as I splitted them in different modules. For example a glue lib to have the helpers interacting with Glue. A Glue_model module for my dataholders, a glue source for the real source, etc. Do you want me to put this together or keep it splitted? • I have quite some IT, but they’re dependent on my env. (For example to test the effects.) I assume I can’t move them. Parts are covered by unit tests but parts are not 100% covered. • I have not added everything yet. or example I have not added the owner nor did I have exploded every property like table properties or Serde information. This we probably want to do, but it’s yet to be developed. Do you want me to wait until it’s complete or do you want me to make a PR now and then add it later. (Advantage is quicker feedback ofcourse.)
g
That’s awesome! • It’s really up to you - if you keep it as multiple files, maybe just stick them all in a subdirectory • Don’t worry about getting to 100% coverage - the overall ingestion framework is around 75% coverage. The unit tests are probably good enough • Given that other people seem to want the glue integration as well, it makes sense to prioritize getting it out quickly and then iterating and adding functionality afterwards
also cc @able-jelly-81126 - let's try to avoid duplicating effort
c
I’ll wait for Amy to make a PR and we can have a look from there to see if anything is missing / add it if needed. šŸ™‚
a
I’ve opened it as a draft here, this was done as part of a hackathon though so we need to clean it up šŸ˜… would be good to see your code, too
c
Sure, posted it together with the unit tests in here: https://gist.github.com/adriaanslechten/829efd1a3bc1842fe283f64aa2a06a1e I’ll try to do a quick review this afternoon to see if there is anything missing or not.
I’ve added some comments, nice work šŸ™‚