hi folks! I’m really interested in <https://datahu...
# ingestion
n
hi folks! I’m really interested in https://datahubproject.io/docs/metadata-ingestion/source_docs/data_lake is there any way to run such a profiling over a glue table? or broadly speaking over any kind of dataset other than a file?
c
There is source glue similar to data-lake https://datahubproject.io/docs/metadata-ingestion/source_docs/glue Please check if this satisfies your requirement.
n
yes! but it does not support that kind of profiling, does it?
b
No - the glue source doesn't attempt to access the underlying data catalogged by Glue. I'm not sure how difficult this would be. @chilly-holiday-80781 may be most familiar - Is there a reliable way to profile assets indexed inside Glue catalog?
n
I guess the way to go would be through Athena
btw, does the athena source do such a profiing?
c
This would be fairly difficult to do from the Glue API—our best bet would be to use the data lake profiler on the same files that you are ingesting into Glue, and hopefully the proper lineage would be rendered between them.
The Athena source can be profiled with SQL profiling: https://datahubproject.io/docs/metadata-ingestion/source_docs/sql_profiles
n
that makes sense
thank you both
btw, what is the difference between Athena and Glue sources? I mean, I get that the glue source uses the boto3 glue api to get the dbs and tables and the Athena source connects in a sql-like mode to Athena to get those dbs and tables
but
is there any difference on the information that those two retrieve from AWS?
since every glue db/table is accessible through Athena at the end of the day
c
Not everyone uses the Athena integration as far as I know, and Glue also supports jobs and workflows that are not captured by Athena
n
ok, thank you!!