DataHub #ingestion

microscopic-mechanic-13766

10/25/2022, 1:35 PM

Hello, quick question. My Hive ingestion takes too long to execute (more than 5h and the profiling is in progress but the profiling of the tables is not obtained). What should I do to make the ingestion execute in less time or at least to obtain the profiling?? (Note: Profiling part of the tables and not all is not an option as I have 6 tables with a max of 200 rows) I am currently using v0.8.45 and this is my recipe for Hive:

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'
source:
    type: hive
    config:
        database: <db>
        profiling:
            enabled: true
        host_port: 'hive-server:10000'
        options:
            connect_args:
                auth: KERBEROS
                kerberos_service_name: hive-server

bumpy-pharmacist-66525

10/25/2022, 1:48 PM

Hello, I have a question about enabling the

stateful_ingestion

feature on sources which support it. I am in the process of updating the

Iceberg

source to support

stateful_ingestion

, but I am running into a weird issue. From my testing, when the feature is enabled and I try to re-create and then re-ingest a table (after it has been soft-deleted by a previous ingestion run), it does not appear again in the UI. Is this the expected behavior? I would have imagined that when the ingestion recipe is executed, that the dataset (which was remade) would re-appear in the UI even if it was soft-deleted in an earlier ingestion run.

salmon-angle-92685

10/25/2022, 4:39 PM

Hello guys, If I launch a stateful ingestion via command-line, than I decide to relaunch it via UI, keeping the same, because I want to set a schedule... Will the second ingestion keep the same stateful ingestion status from the one executed via command-line ? I am asking this because on the UI am seeing that the new ingestion didn't replace the old one. Thanks !

wide-spring-1569

10/25/2022, 6:02 PM

Hello, I'm having trouble with the LookML ingestion, when I run it I have lots of errors due to it being unable to resolve our include statements, e.g..

cannot resolve include "//customer/views/customer_account/customer_responses_all.view.lkml"

This is almost certainly due to the fact that we are using multiple repos to define our LookML with a

manifest.lkml

used to define references to files in other projects, e.g.

Copy code

local_dependency: {
  project: "customer"
}

Is this something that people have been able to work with or is our multi-repo setup beyond what the

acryl-datahub[lookml]

library is set up to handle? Basically I have about 10 repos and they all reference each other in the include statements which makes the parsing of LookML much harder.

refined-energy-76018

10/26/2022, 12:25 AM

Hi, does the Datahub Airflow plugin support emitting the

UP_FOR_RETRY

status? If not, are there any plans to?

little-breakfast-38102

10/26/2022, 5:22 AM

Hello @dazzling-judge-80093/ @mammoth-bear-12532, I am adding odbcdriver and other dependencies to datahub actions image v0.0.8. I am able to use the same dependencies on top of ingestion native image v0.8.45 and successfully execute CRON job. However when used it against actions image I am running into CrashLoopBackOff error. I have the following two lines at the end of my custom image docker file.

USER datahub

ENTRYPOINT [“datahub”]

Appreciate any help on this.

best-umbrella-88325

10/26/2022, 10:58 AM

Hi All. It seems like S3 custom ingestion isn't picking the Default CLI version on the UI. When the version is provided as 0.9.0 under the 'advanced' section on the UI, it still picks up 0.8.43. For other sources like AWS Glue, the version gets picked up correctly. Are we missing something here? Any help appreciated. AWS S3 Output:

Copy code

RUN_INGEST - {'errors': [],
 'exec_id': '484fcd84-7b63-41f2-9ace-faac265203ec',
 'infos': ['2022-10-26 10:47:10.581114 [exec_id=484fcd84-7b63-41f2-9ace-faac265203ec] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-10-26 10:47:18.713045 [exec_id=484fcd84-7b63-41f2-9ace-faac265203ec] INFO: stdout=venv setup time = 0\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/484fcd84-7b63-41f2-9ace-faac265203ec/recipe.yml --report-to '
           '/tmp/datahub/ingest/484fcd84-7b63-41f2-9ace-faac265203ec/ingestion_report.json\n'
           '[2022-10-26 10:47:13,128] INFO     {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.9.0\n'
           '[2022-10-26 10:47:13,158] INFO     {datahub.ingestion.run.pipeline:175} - Sink configured successfully. DataHubRestEmitter: configured '

AWS Glue Output:

Copy code

RUN_INGEST - {'errors': [],
 'exec_id': '82df4b0b-8d97-461b-bc4c-f4922ebe5d04',
 'infos': ['2022-10-26 10:53:31.768404 [exec_id=82df4b0b-8d97-461b-bc4c-f4922ebe5d04] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-10-26 10:53:37.857276 [exec_id=82df4b0b-8d97-461b-bc4c-f4922ebe5d04] INFO: stdout=venv setup time = 0\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/82df4b0b-8d97-461b-bc4c-f4922ebe5d04/recipe.yml --report-to '
           '/tmp/datahub/ingest/82df4b0b-8d97-461b-bc4c-f4922ebe5d04/ingestion_report.json\n'
           '[2022-10-26 10:53:34,404] INFO     {datahub.cli.ingest_cli:177} - DataHub CLI version: 0.8.43.5\n'
           '[2022-10-26 10:53:34,478] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '

dazzling-caravan-26726

10/26/2022, 11:06 AM

Hi everyone, I created my Glossary via CLI and the displaynames of the nodes are changing automatically and add the name of the parent glossary node, which makes it confusing in case of a large number of subnodes. If i create the Glossary manually everything looks fine. Any ideas to fix this? (DataHub Version is v0.8.45)

dazzling-caravan-26726

10/26/2022, 11:06 AM

image.png

dazzling-caravan-26726

10/26/2022, 11:08 AM

image.png

colossal-hairdresser-6799

10/26/2022, 12:11 PM

Hi team. Is there any functionality to adding domains with the python sdk. Talking about adding the entity domain and not adding domain to a dataset for example. I tried to look at the documentation but couldn’t find anything. Am I possibly overlooking something?

prehistoric-helicopter-42228

10/26/2022, 1:51 PM

Hi team, I am currently using DataHub BigQuery ingestion (bigquery module) for lineage and I would like to clear some things up: 1. I am currently using the gcp logging method. Am I right that gcp logging method doesn't need enabling anything in bigQuery as it is working from resources that are automatically generated by bigQuery? 2. How much more information would I get using the audit logs? We have a few audit log sinks set up, but it would be hard to configure audit logs for all datasets and I would not start it if it isn't worth it. 3. I get some lineage info for a few tables and I would like to understand why the rest is not showing up: a. What is happening under the hood for the gcp logging based lineage calculation? b. I checked the code and I saw the usage of information_schema.views/tables and jobStats. Where exactly the upstream_lineage field of the Ingestion Run Details comes from? c. How can I get the lineage manually for a given table in the GCP CLI? I need this so I can verify what is missing for the tables that didn't have lineage previously. Thank you a lot for the answers, Barnabas

square-solstice-69079

10/26/2022, 2:18 PM

Hey, regarding the Airflow push ingestion, is there any easy way to know and delete jobs/tasks if they are deleted or renamed in airflow?

happy-twilight-44865

10/26/2022, 2:19 PM

I am ingesting metadata using #S3 ingestion using partition_key approach snapshot attached of my sample recipie file. Problem is when I am using partition_key instead of regex path commented in snapshot, we are not able to get the s3_object_tags in datahub but same as working when I am using regex path commented on snapshot. Kindly suggest.

happy-baker-8735

10/26/2022, 2:39 PM

Hi everyone, I would like to know how to manage special characters in file ingestion. We try to enrich our datahub with json file ingestion but special characters (é, è, ô...) are not displayed correctly. We tried this:

Copy code

{
    "entityType":"domain",
    "entityUrn": "urn:li:domain:referentiel",
    "changeType":"UPSERT",
    "aspectName":"domainProperties",
    "aspect":{
      "value":"{\"name\": \"Référentiel\"}",
      "contentType":"application/json"
    }

Copy code

{
    "entityType":"domain",
    "entityUrn": "urn:li:domain:referentiel",
    "changeType":"UPSERT",
    "aspectName":"domainProperties",
    "aspect":{
      "value":"{\"name\": \"R\u00e9f\u00e9rentiel\"}",
      "contentType":"application/json"
    }

hallowed-lizard-92381

10/26/2022, 8:01 PM

Crossposting to the Ingestion channel https://datahubspace.slack.com/archives/CV2KB471C/p1666814408596799

witty-motorcycle-52108

10/26/2022, 9:05 PM

hey all, kinda a weird question. i'm wondering whether people think it would be appropriate to augment the glue ingestion source with the ability to profile/sample column data using athena. i know that probably sounds weird, but hear me out haha. right now, the glue source ignores columns that arent profiled however as far as i can tell, column profiling the way datahub refers to it isnt really a thing thats built into glue. this current implementation appears to rely on people running a custom profiler, unless iim missing something while reading over the implementation and glue docs. we'd like to use the glue ingestion over the athena ingestion because the behavior/output is better, but there's no sampling (what we really care about) built into it which is unfortunate. the athena source doesnt handle large partitioned tables well, because it cant restrict to profiling the latest partition so our ingest times are gigantic (several hours in beta). athena also doesnt support lineage, while glue does. since glue and athena are so tightly coupled as AWS products, i was curious if profiling/sampling glue with athena would be a PR that would be accepted, or if thats too weird basically

silly-oil-35180

10/27/2022, 1:07 AM

Hi team, I have a problem with datasetProfile. I added datasetProfile to dataset entity. I can find datasetProfile data using api below. ElasticSearch also has datasetProfile.

Copy code

http://<gms-url>:8080/aspects/<encoded urn>/aspects?action=getTimeseriesAspectValues

However,

Stats

tab is not activated on web ui. I checked GraphQL which used to fetch dataset(

getDataset

). GarphQL didn’t fetch any

datasetProfiles

. Why this problem happens? How can I fix it?

gifted-knife-16120

10/27/2022, 3:07 AM

what is the permission needed to

Enable Profiling

for postgres? I already give SELECT permission, but it still show an permission error

lemon-cat-72045

10/27/2022, 6:28 AM

Hi all, I have setup airflow integration with datahub. I am using the Kafka sink as connection, but I am seeing this error. Does anyone know what the problem is here? Thanks a lot.

famous-florist-7218

10/27/2022, 8:22 AM

Hi there, I’m playing around with the new

bigquery

ingestion. It seems the executor couldn’t run probably. Lineage map function was unable to retrieve the audit log. Any help? Here is the log from datahub-actions:

Copy code

[2022-10-27 07:26:25,746] INFO     {datahub.ingestion.source.bigquery_v2.lineage:154} - Populating lineage info via GCP audit logs for my-dev-95adf
[2022-10-27 07:26:25,783] INFO     {datahub.ingestion.source.bigquery_v2.lineage:161} - Log Entries loaded
[2022-10-27 07:26:25,783] INFO     {datahub.ingestion.source.bigquery_v2.lineage:371} - Entering create lineage map function
[2022-10-27 07:26:25,783] INFO     {datahub.ingestion.source.bigquery_v2.lineage:218} - Start loading log entries from BigQuery for my-dev-95adf with start_time=2022-10-25T23:26:40Z and end_time=2022-10-27T07:59:41Z
[2022-10-27 07:26:25,783] INFO     {datahub.ingestion.source.bigquery_v2.lineage:234} - Start iterating over log entries from BigQuery for my-dev-95adf
unable to retrieve container logs for <containerd://68b4741cd185c5ac09e560ee58932bb7166861d580c955ace267eed4be25f8d>9
unable to retrieve container logs for <containerd://68b4741cd185c5ac09e560ee58932bb7166861d580c955ace267eed4be25f8d>9
unable to retrieve container logs for <containerd://68b4741cd185c5ac09e560ee58932bb7166861d580c955ace267eed4be25f8d>9
unable to retrieve container logs for <containerd://68b4741cd185c5ac09e560ee58932bb7166861d580c955ace267eed4be25f8d>9
...

late-yak-71835

10/27/2022, 9:22 AM

Hi all In our organizations all credentials are store and managed within Vault by HashiCorp Can we configure datahub to retrieve these credentials from Vault and use them to perform its functions? (i.e. retrieve PostgresDB Credentials and use it to pull and ingest its metadata)

plus1 1

microscopic-mechanic-13766

10/27/2022, 11:31 AM

Hi, so I have just profiling over some tables from Hive and obtained the following (see picture). I have also checked the ES index just in case it was a problem of the front not showing all the information, but during the ingestion nothingwas profiled at all. I know that the ingestion on Hive might not be the best, but I just wanted to let you know in case you didn't know or to get help in case it is just a particular case. Thanks in advance!!

bitter-byte-67818

10/27/2022, 2:25 PM

Hi all ! I have kind of dummy question. I’m trying to ingest a simple MariaDB database with uppercase letters. Ingestion seems to convert to lowercase whatever parameter i try (database_pattern.ignoreCase, etc). Would you enlighten me ? Thanks !

melodic-tomato-17544

10/27/2022, 4:35 PM

Hey y’all. I’m new to using Datahub so apologies if this has been covered before. I’m pointing our datahub instance at my team’s API openapi.json and it fails on any endpoint that takes parameters. I’ve updated the endpoint annotations to provide example parameters, but the datahub ingester thing doesn’t pick them up. I’ve sanitized the example using the colors API theme from the docs. Imagine two endpoints:

GET /color-of-the-day

and

GET /color-of-the-day/{date}

where the second takes a specific date. The openapi spec details the parameters, format, etc., but the datahub ingester is still making requests to

/color-of-the-day/{date}

(literally), and not picking up the example parameter, eg, requesting

GET /color-of-the-day/2021-10-24

… any suggestions? I’ve tried googling around but the answers are pretty sparse. I did figure out I can use

forced_examples

in the recipe source config, but I’d prefer not to if possible, favoring the API documentation instead.

openapi.json

some-car-9623

10/27/2022, 6:35 PM

Hello Everyone we do have one of the requirement is to ingest metadata from: SAP BO, in the Doc and and in Demo I don't find any details about the same. is there any possible way i can use the existing Ingestion source to achieve this? Thanks Geetha

refined-energy-76018

10/28/2022, 12:29 AM

Any thoughts about some features that are more DAG-centric when it comes to Airflow ingestion. Features such as: • Overall DAG run status and run status history • Labeling which DAG a task is from in the lineage UI • "Collapsed" lineage showing DAG to DAG dependencies. Would be helpful to get high level overview of dependencies for what would be very complicated task (jobrun) dependencies

worried-branch-76677

10/28/2022, 3:59 AM

Hi , for the powerbi connector. https://datahubproject.io/docs/generated/ingestion/sources/powerbi/#config-details Can i check if we can model Powerbi report as Datahub dashboard entity? When i dig into the codebase, it looks like its not ingesting PowerBI reports and

InputFieldClass

. Any guidance will be nice

lemon-cat-72045

10/28/2022, 5:52 AM

Hi all, I have set up a Looker source in the UI, but it fails with the error saying Looker Not Found. I have tested the Looker connection is successful.

gifted-knife-16120

10/28/2022, 6:20 AM

hi All, i’ve one issue when enable

ingest profiling

for postgres

Copy code

{"public.tablename": ["Profiling exception year -1 is out of range"]},

above is the error. how can I fix this?