hello everyone first let me say this tool looks amazing so k DataHub #ingestion

hello everyone!! first let me say this tool looks ...

colossal-alligator-29986

03/25/2022, 3:53 PM

hello everyone!! first let me say this tool looks amazing so kudos and props 🥳 I’m trying to use the bigquery plugin to load lineage data and I’m unsuccessful in that it’s not showing up in the UI… in particular I ran the ingestion process and I’m seeing this

Copy code

[2022-03-25 15:09:59,922] INFO     {datahub.cli.ingest_cli:91} - Starting metadata ingestion
[2022-03-25 15:09:59,922] INFO     {datahub.ingestion.source.sql.bigquery:276} - Populating lineage info via GCP audit logs
[2022-03-25 15:09:59,928] INFO     {datahub.ingestion.source.sql.bigquery:369} - Start loading log entries from BigQuery start_time=2022-03-23T23:45:00Z and end_time=2022-03-26T00:15:00Z
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:380} - Finished loading 12047 log entries from BigQuery so far
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:462} - Parsing BigQuery log entries: number of log entries successfully parsed=12047
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:513} - Creating lineage map: total number of entries=12047, number skipped=1.
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:270} - Built lineage map containing 12015 entries.

colossal-alligator-29986

03/25/2022, 3:54 PM

and a bit later I’m seeing

Copy code

Sink (datahub-rest) report:
{'downstream_end_time': None,
 'downstream_start_time': None,
 'downstream_total_latency_in_seconds': None,
 'failures': [],
 'gms_version': 'v0.8.31',
 'records_written': 0,
 'warnings': []}

but in the UI there’s no lineage showing up for the dataset that I pointed to …

colossal-alligator-29986

03/25/2022, 3:55 PM

wondering if this is a pebcak config thing…cache thing… or what

square-activity-64562

03/25/2022, 4:02 PM

there is a source report also which will have much more details. The sink report does not have that much details

colossal-alligator-29986

03/25/2022, 4:02 PM

apologies for my ignorance here, where can I find that?

square-activity-64562

03/25/2022, 4:03 PM

If you scroll up in the logs it should be just above

Sink (datahub-rest) report:

👍 1

square-activity-64562

03/25/2022, 4:03 PM

Depending on the amount of data that can be very large

colossal-alligator-29986

03/25/2022, 4:04 PM

yeah… hmmm … not seeing any errors

colossal-alligator-29986

03/25/2022, 4:05 PM

I see

'workunits_produced': 10}

colossal-alligator-29986

03/25/2022, 4:06 PM

is there something that I should be looking for in particular ?

square-activity-64562

03/25/2022, 4:07 PM

https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source_report/sql/bigquery.py#L10

square-activity-64562

03/25/2022, 4:07 PM

The keys mentioned in this file

colossal-alligator-29986

03/25/2022, 4:07 PM

Copy code

'lineage_metadata_entries': 12015,
 'log_entry_end_time': '2022-03-26T00:15:00Z',
 'log_entry_start_time': '2022-03-23T23:45:00Z',
 'num_parsed_audit_entires': None,
 'num_parsed_log_entires': 12047,
 'num_total_audit_entries': None,
 'num_total_log_entries': 12047,
 'os_details': 'Linux-5.4.0-1067-gcp-x86_64-with-Ubuntu-18.04-bionic',
 'py_exec_path': '/home/andreslowrie/venv/bin/python3',
 'py_version': '3.6.9 (default, Dec  8 2021, 21:08:43) \n[GCC 8.4.0]',
 'query_combiner': {'combined_queries_issued': 4,
                    'queries_combined': 73,
                    'query_exceptions': 0,
                    'total_queries': 115,
                    'uncombined_queries_issued': 42},
 'soft_deleted_stale_entities': [],
 'start_time': datetime.datetime(2022, 3, 24, 0, 0, tzinfo=datetime.timezone.utc),
 'tables_scanned': 1,
 'use_exported_bigquery_audit_metadata': False,
 'use_v2_audit_metadata': False,

square-activity-64562

03/25/2022, 4:10 PM

can you check

workunit_ids

? keys from this https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/api/source.py#L15

colossal-alligator-29986

03/25/2022, 4:10 PM

yeah that came back with

square-activity-64562

03/25/2022, 4:10 PM

If it says

workunits_produced = 10

then there should be at least 10

workunits_produced

too

square-activity-64562

03/25/2022, 4:11 PM

workunit_ids

colossal-alligator-29986

03/25/2022, 4:11 PM

gotcha gotcha…

colossal-alligator-29986

03/25/2022, 4:11 PM

one sec

square-activity-64562

03/25/2022, 4:11 PM

You can look at https://datahubproject.io/docs/metadata-ingestion/source_docs/bigquery/ and increase the start_time which will look at more log entries

square-activity-64562

03/25/2022, 4:12 PM

Copy code

'log_entry_end_time': '2022-03-26T00:15:00Z',
 'log_entry_start_time': '2022-03-23T23:45:00Z',

This tells us it looked at what start time to end time

colossal-alligator-29986

03/25/2022, 4:15 PM

I see the ids , for the workuntis … can I use those to query some store somewhere then?

colossal-alligator-29986

03/25/2022, 4:16 PM

I’m a bit confused on opening up the window… so it’s possible for it to scan logs and parse them but not create things that are visible in the UI?

colossal-alligator-29986

03/25/2022, 4:17 PM

the ids look like this

Copy code

'container-platforminstance-our-bq-project-id-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-subtypes-our-bq-project-id-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-info-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-platforminstance-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-subtypes-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-parent-container-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224-to-urn:li:dataset:(urn:li:dataPlatform:bigquery,our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table,PROD)',
                  'our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table',
                  'profile-our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table'],

colossal-alligator-29986

03/25/2022, 4:17 PM

(i had to redact some names but that’s the gist)

colossal-alligator-29986

03/25/2022, 4:20 PM

again noob here so bear with me , I guess my question is: what I should I be expecting in the UI when using the linage feature of BigQuery

square-activity-64562

03/25/2022, 4:29 PM

workunit IDs are mainly a CLI only thing

square-activity-64562

03/25/2022, 4:29 PM

container feature is representing logical containers like schema, database etc

square-activity-64562

03/25/2022, 4:30 PM

can you please share the recipe in text format (instead of screenshots) after masking the secrets?

square-activity-64562

03/25/2022, 4:32 PM

Or use https://datahubspace.slack.com/archives/CUMUWQU66/p1648224705597669?thread_ts=1648223639.149669&cid=CUMUWQU66 and increase the start time/end time in the recipe and re-run it

square-activity-64562

03/25/2022, 4:36 PM

lineage feature looks like this https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:snowflake,lon[…].analytics.active_customer_ltv,PROD)?is_lineage_mode=true

square-activity-64562

03/25/2022, 4:36 PM

https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:snowflake,lon[…]cs.active_customer_ltv,PROD)/Schema?is_lineage_mode=false has examples of what to expect

square-activity-64562

03/25/2022, 4:37 PM

This has details about the UI tabs https://datahubproject.io/docs/how/ui-tabs-guide/

colossal-alligator-29986

03/25/2022, 4:49 PM

the recipe looks like this

Copy code

---
source:
  type: "bigquery"
  config:
    project_id: our-project-id
    include_tables: true
    include_views: true
    include_table_lineage: true
    table_pattern:
      allow:
        - '.*name_of_the_table.*'
    schema_pattern:
      allow:
        - '.*the-name-of-the-dataset.*'
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

square-activity-64562

03/25/2022, 4:57 PM

I think the table and schema pattern might be filtering things out. Try changing

schema_pattern

as below

Copy code

source:
  type: "bigquery"
  config:
    project_id: gcp-project-name
    schema_pattern:
      allow:
        - the-name-of-the-dataset
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

square-activity-64562

03/25/2022, 4:57 PM

Note to self - We need more example in bigquery docs for allow deny pattern for schema and tables

square-activity-64562

03/25/2022, 4:58 PM

And the source report contains a key called filtered. That will show what got filtered out

colossal-alligator-29986

03/25/2022, 4:59 PM

adding on to the point about the docs, the regex pattern in the example has

.*.*.*

which python regex doesn’t allow

colossal-alligator-29986

03/25/2022, 5:00 PM

yeah there’s a huge list of filtered tables , I’m trying to avoid loading/scanning everything … it’s a lot of tables

square-activity-64562

03/25/2022, 5:01 PM

yes we need better examples. I am sorry I don't have those handy. But if you search in this slack itself you should be able to find examples

colossal-alligator-29986

03/25/2022, 5:01 PM

kinda goes back to the question, does the lineage feature of this plugin check all tables for queries against the matched tables?

colossal-alligator-29986

03/25/2022, 5:01 PM

I guess I’m wondering, what does it mean for a dataset to have up/down lineage from the perspective of this plugin? … does it mean other tables have queries that access them?…

colossal-alligator-29986

03/25/2022, 5:02 PM

or does it mean something else?

colossal-alligator-29986

03/25/2022, 5:03 PM

oh man no need to apologize I get it …. I’m documenting my journey here for my team and I’ll be sending up pull requests

square-activity-64562

03/28/2022, 3:26 PM

Copy code

# `schema_pattern` for BQ Datasets
    schema_pattern:
      allow:
        - finance_bq_dataset
    table_pattern:
      deny:
        # The exact name of the table is revenue_table_name
        # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
        # project_id.dataset_name.table_name
        # We will improve this in the future
        - .*revenue_table_name

colossal-alligator-29986

03/28/2022, 3:58 PM

I’m not sure what to do with this information sorry

square-activity-64562

03/28/2022, 3:59 PM

This was example for this https://datahubspace.slack.com/archives/CUMUWQU66/p1648227441463659?thread_ts=1648223639.149669&cid=CUMUWQU66

square-activity-64562

03/28/2022, 3:59 PM

You were not seeing the information. I did not have examples on Friday for using patterns. So just sharing examples above

colossal-alligator-29986

03/28/2022, 4:02 PM

oh gotcha, thank you much

Open in Slack

Previous Next