I m noticing an odd inconsistency between the redshift metad DataHub #ingestion

I’m noticing an odd inconsistency between the reds...

white-beach-27328

06/08/2021, 6:29 PM

I’m noticing an odd inconsistency between the redshift metadata ingestion code and the hive metadata ingestion code. For some reason the hive metadata ingestion doesn’t include the catalog name (i.e. database) in the dataset name but in redshift it includes the database name. Any ideas as to why or how to fix that? If the hive ingestion doesn’t include that it could lead to collisions between catalogs that have the same schema and table names.

white-beach-27328

06/08/2021, 6:38 PM

It might be due to the dialect that’s being used because it’s using the sql engine reflection to get the name: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql_common.py#L211

gray-shoe-75895

06/08/2021, 7:19 PM

I think we just need to add an identifier hook that adds the database info for hive. We’ve already done this for SQL Server and a couple others already (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/mssql.py#L40-L44), so it should be pretty easy to add for hive

white-beach-27328

06/08/2021, 7:48 PM

ah ok

white-beach-27328

06/08/2021, 7:49 PM

I was looking into trying to use a transformer to do the work but the API for doing so combined with the Pegasus schemas makes the code kind of impenetrable, or at least not readily accessible

white-beach-27328

06/08/2021, 7:52 PM

perhaps this is easier to deal with once I upgrade to the No Code modeling?

gray-shoe-75895

06/08/2021, 9:14 PM

the nocode initiative introduced a “key aspect” that will simplify usage of the

urn

field, although the python framework hasn’t transitioned to using it yet

gray-shoe-75895

06/08/2021, 9:16 PM

once we do, it should be slightly easier to modify portions of the names of entities

gray-shoe-75895

06/08/2021, 9:17 PM

in the meantime, what do you think would help most in making the process of writing a transformer easier?

white-beach-27328

06/08/2021, 10:16 PM

hard to say, I tabled that idea given that you showed me the identifier mechanism. I’ll have a better idea if I can find another time to try developing with it. I can say that wanting to just run a simple mutation on the metadata you’re pulling probably shouldn’t require learning multiple API’s (e.g. the Transformer class, the config classes, the MCE config, etc.).

white-beach-27328

06/08/2021, 10:17 PM

Not saying it’s impossible just a bit more of barrier to entry

white-beach-27328

06/08/2021, 11:01 PM

here’s the PR for adding the identifier: https://github.com/linkedin/datahub/pull/2667

white-beach-27328

06/08/2021, 11:02 PM

would love a review on it @gray-shoe-75895 if have a second

white-beach-27328

06/08/2021, 11:03 PM

I’m hoping it builds, I did notice that an airflow related import was failing due to SQL Alchemy dependencies when I ran the tests locally

white-beach-27328

06/08/2021, 11:29 PM

I’m seeing issues with the build system and mypy. Things are failing that I didn’t modify which I think is a result of the new mypy packages that were released today. Should I fix those or do we want to do that in a separate PR?

gray-shoe-75895

06/08/2021, 11:36 PM

That should be fixed by this PR: https://github.com/linkedin/datahub/pull/2666 - I think you just need to merge/rebase

gray-shoe-75895

06/08/2021, 11:45 PM

other than that, the PR looks good!

white-beach-27328

06/08/2021, 11:46 PM

cool let me give that a shot

Open in Slack

Previous Next