hello everyone!! first let me say this tool looks ...
# ingestion
c
hello everyone!! first let me say this tool looks amazing so kudos and props 🥳 I’m trying to use the bigquery plugin to load lineage data and I’m unsuccessful in that it’s not showing up in the UI… in particular I ran the ingestion process and I’m seeing this
Copy code
[2022-03-25 15:09:59,922] INFO     {datahub.cli.ingest_cli:91} - Starting metadata ingestion
[2022-03-25 15:09:59,922] INFO     {datahub.ingestion.source.sql.bigquery:276} - Populating lineage info via GCP audit logs
[2022-03-25 15:09:59,928] INFO     {datahub.ingestion.source.sql.bigquery:369} - Start loading log entries from BigQuery start_time=2022-03-23T23:45:00Z and end_time=2022-03-26T00:15:00Z
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:380} - Finished loading 12047 log entries from BigQuery so far
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:462} - Parsing BigQuery log entries: number of log entries successfully parsed=12047
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:513} - Creating lineage map: total number of entries=12047, number skipped=1.
[2022-03-25 15:19:32,800] INFO     {datahub.ingestion.source.sql.bigquery:270} - Built lineage map containing 12015 entries.
and a bit later I’m seeing
Copy code
Sink (datahub-rest) report:
{'downstream_end_time': None,
 'downstream_start_time': None,
 'downstream_total_latency_in_seconds': None,
 'failures': [],
 'gms_version': 'v0.8.31',
 'records_written': 0,
 'warnings': []}
but in the UI there’s no lineage showing up for the dataset that I pointed to …
wondering if this is a pebcak config thing…cache thing… or what
s
there is a source report also which will have much more details. The sink report does not have that much details
c
apologies for my ignorance here, where can I find that?
s
If you scroll up in the logs it should be just above
Sink (datahub-rest) report:
👍 1
Depending on the amount of data that can be very large
c
yeah… hmmm … not seeing any errors
I see
'workunits_produced': 10}
is there something that I should be looking for in particular ?
The keys mentioned in this file
c
Copy code
'lineage_metadata_entries': 12015,
 'log_entry_end_time': '2022-03-26T00:15:00Z',
 'log_entry_start_time': '2022-03-23T23:45:00Z',
 'num_parsed_audit_entires': None,
 'num_parsed_log_entires': 12047,
 'num_total_audit_entries': None,
 'num_total_log_entries': 12047,
 'os_details': 'Linux-5.4.0-1067-gcp-x86_64-with-Ubuntu-18.04-bionic',
 'py_exec_path': '/home/andreslowrie/venv/bin/python3',
 'py_version': '3.6.9 (default, Dec  8 2021, 21:08:43) \n[GCC 8.4.0]',
 'query_combiner': {'combined_queries_issued': 4,
                    'queries_combined': 73,
                    'query_exceptions': 0,
                    'total_queries': 115,
                    'uncombined_queries_issued': 42},
 'soft_deleted_stale_entities': [],
 'start_time': datetime.datetime(2022, 3, 24, 0, 0, tzinfo=datetime.timezone.utc),
 'tables_scanned': 1,
 'use_exported_bigquery_audit_metadata': False,
 'use_v2_audit_metadata': False,
s
c
yeah that came back with
10
s
If it says
workunits_produced = 10
then there should be at least 10
workunits_produced
too
workunit_ids
c
gotcha gotcha…
one sec
s
You can look at https://datahubproject.io/docs/metadata-ingestion/source_docs/bigquery/ and increase the start_time which will look at more log entries
Copy code
'log_entry_end_time': '2022-03-26T00:15:00Z',
 'log_entry_start_time': '2022-03-23T23:45:00Z',
This tells us it looked at what start time to end time
c
I see the ids , for the workuntis … can I use those to query some store somewhere then?
I’m a bit confused on opening up the window… so it’s possible for it to scan logs and parse them but not create things that are visible in the UI?
the ids look like this
Copy code
'container-platforminstance-our-bq-project-id-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-subtypes-our-bq-project-id-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-info-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-platforminstance-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-subtypes-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224',
                  'container-parent-container-the-name-of-the-dataset-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224-urn:li:container:10f9ca61ed89e0b95f4fb82690bacfc1',
                  'container-urn:li:container:2665ba5c1eca78c8a1ce78d3f402c224-to-urn:li:dataset:(urn:li:dataPlatform:bigquery,our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table,PROD)',
                  'our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table',
                  'profile-our-bq-project-id.the-name-of-the-dataset.the-name-of-the-dataset_name_of_the_table'],
(i had to redact some names but that’s the gist)
again noob here so bear with me , I guess my question is: what I should I be expecting in the UI when using the linage feature of BigQuery
s
workunit IDs are mainly a CLI only thing
container feature is representing logical containers like schema, database etc
can you please share the recipe in text format (instead of screenshots) after masking the secrets?
This has details about the UI tabs https://datahubproject.io/docs/how/ui-tabs-guide/
c
the recipe looks like this
Copy code
---
source:
  type: "bigquery"
  config:
    project_id: our-project-id
    include_tables: true
    include_views: true
    include_table_lineage: true
    table_pattern:
      allow:
        - '.*name_of_the_table.*'
    schema_pattern:
      allow:
        - '.*the-name-of-the-dataset.*'
    profiling:
      enabled: true

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
s
I think the table and schema pattern might be filtering things out. Try changing
schema_pattern
as below
Copy code
source:
  type: "bigquery"
  config:
    project_id: gcp-project-name
    schema_pattern:
      allow:
        - the-name-of-the-dataset
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"
Note to self - We need more example in bigquery docs for allow deny pattern for schema and tables
And the source report contains a key called filtered. That will show what got filtered out
c
adding on to the point about the docs, the regex pattern in the example has
.*.*.*
which python regex doesn’t allow
yeah there’s a huge list of filtered tables , I’m trying to avoid loading/scanning everything … it’s a lot of tables
s
yes we need better examples. I am sorry I don't have those handy. But if you search in this slack itself you should be able to find examples
c
kinda goes back to the question, does the lineage feature of this plugin check all tables for queries against the matched tables?
I guess I’m wondering, what does it mean for a dataset to have up/down lineage from the perspective of this plugin? … does it mean other tables have queries that access them?…
or does it mean something else?
oh man no need to apologize I get it …. I’m documenting my journey here for my team and I’ll be sending up pull requests
s
Copy code
# `schema_pattern` for BQ Datasets
    schema_pattern:
      allow:
        - finance_bq_dataset
    table_pattern:
      deny:
        # The exact name of the table is revenue_table_name
        # The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
        # project_id.dataset_name.table_name
        # We will improve this in the future
        - .*revenue_table_name
c
I’m not sure what to do with this information sorry
You were not seeing the information. I did not have examples on Friday for using patterns. So just sharing examples above
c
oh gotcha, thank you much