Hi all! I am pretty new to DataHub and maybe anyon...
# ingestion
h
Hi all! I am pretty new to DataHub and maybe anyone of you could help me 🙂 I am trying to ingest from dbt source but the following error pops up:
Copy code
dbtNode.columns = get_columns(catalog[dbtNode.dbt_name])
KeyError: 'seed.redshift_dbt.xxxxxxxxxx'
As I understand this object does not exist in catalog file but it is present in the manifest so I would like to exclude it from ingestion process. I saw that in the config (recipe) it is possible to add
Copy code
node_type_pattern:
      deny:
but it only applies for the whole seed, like
Copy code
"^seed.*"
, otherwise it doesn’t work. Is it possible to exclude this particular node (seed.redshift_dbt.xxxxxxxxxx) from ingestion?
l
have you tried specifying that exact node?
h
yes, I tried but it didn’t help. It helps only when I specify “seed” or with regex “^seed.*” but then it excludes all nodes which starts with it.
l
@mammoth-bear-12532 ^
m
@happy-magazine-52755: have you tried
"^seed.redshift_dbt.xxxxxxxxxx"
h
yes, I have tried but it did not help
Alright, I did some investigation by analyzing the source code (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt.py). As I understood, the
node_type_pattern
filtering for every node in the manifest file is done by checking
resource_type
field which can be one of the following:
analysis, model, seed, snapshot or test
. Is there a way to filter nodes by name and not by the type?
w
@happy-magazine-52755 that error message tells me you’re on an old version of dbt.py file. If your object doesnt exist in the catalog file but exists in the manifest, we handle it like this in the latest
Copy code
if catalog_node is None:
                report.report_warning(
                    key,
                    f"Entity {dbtNode.dbt_name} is in manifest but missing from catalog",
                )
            else:
                dbtNode.columns = get_columns(catalog_node, manifest_node)
As for the pattern, yes, we currently only support filtering based on resource_type
h
thanks!