I’m noticing an odd inconsistency between the reds...
# ingestion
w
I’m noticing an odd inconsistency between the redshift metadata ingestion code and the hive metadata ingestion code. For some reason the hive metadata ingestion doesn’t include the catalog name (i.e. database) in the dataset name but in redshift it includes the database name. Any ideas as to why or how to fix that? If the hive ingestion doesn’t include that it could lead to collisions between catalogs that have the same schema and table names.
It might be due to the dialect that’s being used because it’s using the sql engine reflection to get the name: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql_common.py#L211
g
I think we just need to add an identifier hook that adds the database info for hive. We’ve already done this for SQL Server and a couple others already (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/mssql.py#L40-L44), so it should be pretty easy to add for hive
w
ah ok
I was looking into trying to use a transformer to do the work but the API for doing so combined with the Pegasus schemas makes the code kind of impenetrable, or at least not readily accessible
perhaps this is easier to deal with once I upgrade to the No Code modeling?
g
the nocode initiative introduced a “key aspect” that will simplify usage of the
urn
field, although the python framework hasn’t transitioned to using it yet
once we do, it should be slightly easier to modify portions of the names of entities
in the meantime, what do you think would help most in making the process of writing a transformer easier?
w
hard to say, I tabled that idea given that you showed me the identifier mechanism. I’ll have a better idea if I can find another time to try developing with it. I can say that wanting to just run a simple mutation on the metadata you’re pulling probably shouldn’t require learning multiple API’s (e.g. the Transformer class, the config classes, the MCE config, etc.).
Not saying it’s impossible just a bit more of barrier to entry
here’s the PR for adding the identifier: https://github.com/linkedin/datahub/pull/2667
would love a review on it @gray-shoe-75895 if have a second
I’m hoping it builds, I did notice that an airflow related import was failing due to SQL Alchemy dependencies when I ran the tests locally
I’m seeing issues with the build system and mypy. Things are failing that I didn’t modify which I think is a result of the new mypy packages that were released today. Should I fix those or do we want to do that in a separate PR?
g
That should be fixed by this PR: https://github.com/linkedin/datahub/pull/2666 - I think you just need to merge/rebase
other than that, the PR looks good!
w
cool let me give that a shot