DataHub #ingestion

alert-fall-82501

01/09/2023, 7:04 AM

check the config file in thread

curved-planet-99787

01/09/2023, 7:37 AM

Hi, is there a reason why

s3_staging_dir

is used as a parameter name in the

Athena

source recipe? For me it is rather unintuitive and I would expect something like

query_result_location

since this is also the term AWS uses But I'm also interested in what others in the community think about this? I could try to come up with a PR to change this if I you agree with my suggestion

👀 1

aloof-energy-17918

01/09/2023, 9:31 AM

Hi all, I'm having a connection problem to pypi.org when trying to add new ingestion source. I think this is because i'm behind a proxy. I saw some solution posted previously there, https://datahubspace.slack.com/archives/C029A3M079U/p1647507748598729?thread_ts=1646928810.396579&cid=C029A3M079U However, the solution is for docker deployment, while i'm using K8S. I'm not really familiar with K8S. While I know it has something to do with Configmaps, volumes, and mountVolumes. I'm not exactly sure how the syntax work. Could anyone give me some guidance on this.

Copy code

"'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4d90>, 'Connection to <http://pypi.org|pypi.org> timed out. "
           "(connect timeout=15)')': /simple/wheel/\n"
           'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
           "'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4f10>, 'Connection to <http://pypi.org|pypi.org> timed out. "
           "(connect timeout=15)')': /simple/wheel/\n"
           'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
           "'ConnectTimeoutError(<pip

✅ 2

best-umbrella-88325

01/09/2023, 9:56 AM

Hi Community! I've been trying to schedule ingestion via the ingestion-cron pod. However, the pod doesn't get scheduled. Is there anything that is going wrong here? Any help is appreciated. Section from values.yaml file:

Copy code

datahub-ingestion-cron:
  enabled: true
  crons:
    s3:
      schedule: "* * * * *" # Every Minute
      recipe:
        configmapName: s3-ingestion
        fileName: s3-ingestion.yaml
  image:
    repository: acryldata/datahub-ingestion
    tag: "v0.9.5"

Config Map:

Copy code

Data
====
s3-ingestion.yaml:
----
source:
    type: "s3"
    config:
        path_spec:
            include: 's3://*****/datafiles/*.*'
        platform: s3
        aws_config:
            aws_access_key_id: ******
            aws_region: us-west-1
            aws_secret_access_key: *****
sink:
    type: "datahub-rest"
    config:
        server: '<http://datahub-datahub-gms:8080>'



BinaryData
====

Events:  <none>

s3-ingestion.yaml

Copy code

source:
    type: "s3"
    config:
        path_spec:
            include: 's3://****/datafiles/*.*'
        platform: s3
        aws_config:
            aws_access_key_id: *****
            aws_region: us-west-1
            aws_secret_access_key: *****
sink:
    type: "datahub-rest"
    config:
        server: '<http://datahub-datahub-gms:8080>'

Command used to create config map

Copy code

kubectl create configmap s3-ingestion --from-file=s3-ingestion.yaml

👀 1

✅ 1

refined-tent-35319

01/09/2023, 9:32 AM

I am trying to ingest redshift private cluster data to datahub(deployed to Amazon EKS) and getting the error. Can anybody help me to understand why this is happening Attaching the log.

exec-urn_li_dataHubExecutionRequest_9fd5b939-ecf6-4b41-9881-c84c5ac65285.log

✅ 1

salmon-motorcycle-36881

01/09/2023, 11:36 AM

Hi, my organisation has begun using Datahub, we ingest data from Snowflake.

salmon-motorcycle-36881

01/09/2023, 11:43 AM

Hi, my organisation has begun using Datahub, we ingest data from Snowflake. We have enabled data profiling for our Snowflake tables - however we have noticed that some of these profiling queries are taking up large amounts of time and therefore processing cost - and also means that ingestion recipes are not completing. One of the things I have noticed is in regard to the calculation of the median. Datahub uses the following query. SELECT <<Column>> FROM <<Table>> WHERE <<Column>> IS NOT NULL ORDER BY <<Column>> LIMIT 2 OFFSET <<rows in table / 2>> For an example table in our database with 500 million rows, this query takes about 8 mins. The median function on the same table takes about 10-15 seconds. Is there anything we can do about this?

👍 1

✅ 1

gorgeous-memory-27579

01/09/2023, 4:45 PM

Just out of curiosity, has anyone worked on implementing a source for Onedrive/Sharepoint files from a shared drive or site? Is implementing a custom source the way to go here, or do folks recommend using REST emitter or file-based sources as an alternative? Metadata will be pretty simple.

✅ 1

ambitious-room-6707

01/09/2023, 4:41 PM

Hi, I ingested a data source with the push-based method indicating the data owner to a non-existing group in the recipe file. An entity page was created for the group once the ingestion was completed but it's not found in the overall list of groups and editing the group gives the error that 'Group does not exist'. Hoping to know if anyone had a similar output and if they were able to resolve this? Much thanks :-)

adorable-summer-43339

01/10/2023, 1:37 AM

Hello, I’m going to introduce a data hub. You want to extract terms from existing metasystems as csv and put them in the data hub. Is there a place where I can see the sample data of the csv file? Official Data Hub documents are not friendly.

fresh-processor-63024

01/10/2023, 5:34 AM

hi, i’m using db profiling with limit option to reduce db overhead

Copy code

profiling:
    enabled: true
    limit: 5000

but if i use limit option, the row count of row count is setted to limit count is this normal?

✅ 1

microscopic-machine-90437

01/10/2023, 6:26 AM

Hi Team, Please help me with dbt ingestion. I'm trying to ingest dbt metadata using datahub recipe. I have deployed datahub using docker-compose and placed the dbt artifacts in the container. So I have given the docker container path in the recipe as below: source: type: dbt config: manifest_path: /etc/datahub/dbt_metadata/manifest.json test_results_path: /etc/datahub/dbt_metadata/run_results.json sources_path: /etc/datahub/dbt_metadata/sources_file.json target_platform: snowflake catalog_path: /etc/datahub/dbt_metadata/catalog_file.json sink: type: datahub-rest config: server: 'http://us01vlprdedphub:9002/' But I'm not able to run it successfully. Attached is the error log of the same.

exec-urn_li_dataHubExecutionRequest_9334df14-ceb7-4db4-8f13-87b99e336f6b.log

polite-actor-701

01/10/2023, 7:55 AM

Hi everyone. I have a question. I ingested metadata from oracle, and all data was stored in mysql. And I ingested data from tableau while oracle data was being indexed in ES. When the tableau data was all saved to mysql, ES was still indexing the oracle data. Contrary to my expectations, after all oracle data was indexed, Tableau data was not indexed. So I ingested the tableau data again, but it still wasn't indexed. So I performed ingestion from another oracle db and confirmed that the data was indexed in ES. Is this a bug? Is there a way to manually indexing in ES?

fresh-processor-63024

01/10/2023, 8:10 AM

hi teams when i use oracle ingestion, upper case table name convert to lower case table real table name is RV120

✅ 1

cool-tiger-42613

01/10/2023, 10:55 AM

Hi, other than the CLI what is the recommended option for a hard delete of all the contents(flows,datasets,tasks etc) ?

👀 1

✅ 1

rich-policeman-92383

01/10/2023, 10:27 AM

Hello Please suggest why the hive ingestion run is trying to connect to api.github.com:443. We run ingestion from air-gapped linux servers and due to this the job remains stuck. datahub version: v0.9.5 cli version: v0.9.1

lively-engine-55407

01/10/2023, 12:49 PM

Hi, guys! I'm trying to run an ingestion in Power BI and the execution is successful, but I don't get the assets.

alert-fall-82501

01/10/2023, 1:47 PM

Hi Team - I am working to import airflow dag jobs to datahub . I have gone through the procedure which is mentioned on documentation . But Getting some Bug ...can anybody help me with this ?

hallowed-lizard-92381

01/10/2023, 4:55 PM

Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…

Copy code

USERNAME = ***
pipeline = Pipeline.create(
    {
        "source": {
            "type": "mysql",
            "config": {
                "username": "user",
                "password": "pass",
                "database": "db_name",
                "host_port": "localhost:3306",
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {"server": "<http://localhost:8080>"},
        },
    }
)

# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()

hallowed-lizard-92381

01/10/2023, 4:56 PM

Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…

Copy code

USERNAME = ***
PASS = ***
DB_NAME = ***
HOST = ***

def pipeline1():
pipeline = Pipeline.create(
    {
        "source": {
            "type": "mysql",
            "config": {
                "username": USERNAME,
                "password": PASS,
                "database": DB_NAME,
                "host_port": HOST,
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {"server": "<http://localhost:8080>"},
        },
    }
)

# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()

hallowed-lizard-92381

01/10/2023, 5:00 PM

Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…

Copy code

USERNAME = ***
PASS = ***
DB_NAME = ***
HOST = ***

def pipeline1():
   pipeline = Pipeline.create(
       {
        "source": {
            "type": "mysql",
            "config": {
                "username": USERNAME,
                "password": PASS,
                "database": DB_NAME,
                "host_port": HOST,
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {"server": "<http://localhost:8080>"},
          },
      }
    )

   # Run the pipeline and report the results.
   pipeline.run()
   pipeline.pretty_print_summary()

instead it would be nice to have

Copy code

def run_pipeline(pipeline_str, **kwargs):
    pipeline=pipeline.create(json_loads(pipeline_str.format(**kwargs))
    pipeline.run()
    pipeline.pretty_print_summary()

Invoked by...
run_pipeline('''{
        "source": {
            "type": "mysql",
            "config": {
                "username": USERNAME,
                "password": PASS,
                "database": DB_NAME,
                "host_port": HOST,
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {"server": "<http://localhost:8080>"},
          },
      }''')

Anybody doing this?

👀 1

✅ 1

bland-lighter-26751

01/10/2023, 5:11 PM

Hello, I'm noticing that Datahub is not removing tables that were deleted in BigQuery. An example is I deleted a table 6 days ago, and the UI displays that the table was last synchronized 6 days ago but it won't delete. Am I missing something in my yaml?

Copy code

source:
    type: bigquery
    config:
        include_table_lineage: true
        include_usage_statistics: true
        include_tables: true
        include_views: true
        profiling:
            enabled: true
            profile_table_level_only: false
        stateful_ingestion:
            enabled: true
            remove_stale_metadata: true
        credential:
            project_id: study-342717

👀 1

✅ 1

microscopic-carpet-71950

01/10/2023, 5:22 PM

@bulky-soccer-26729 I was reviewing the

https://www.youtube.com/watch?v=FjkNySWkghY&t=2475s▾

, and wondering if you are able to now automatically parse SQL to extract Column Level Lineage information? We use Presto/Athena for our data store, and it wasn't clear if we'd have to write a parser/API ourselves or could use some Open Source pre-built options...or is all your stuff only available in the hosted solution?

plain-cricket-83456

01/11/2023, 2:38 AM

hello,If I deprecate a glossaryterm set, it's not supposed to be typed if I ingestion it again, but deprecating the glossaryterm set doesn't work, i want know why.I'm using version 0.8.41, by the way

✅ 1

👀 1

rich-policeman-92383

01/11/2023, 7:37 AM

Hello Domains created by simple_add_dataset_domain transformer are not visible in the govern domain view. Datahub version: v0.9.5 CLI: v0.9.5

Copy code

transformers:
      - type: "simple_add_dataset_domain"
        config:
          replace_existing: true  # false is default behaviour
          domains:
            - "urn:li:domain:engineering"
            - "urn:li:domain:hr"

astonishing-cartoon-6079

01/11/2023, 8:52 AM

#ingestion Hi team, I am integrating the datahub with hive. we integrating hive metadata everyday. but I found the aspect of datasetProperties has 20 versions, and these daily updated hive talbes generate a new datasetProperties once a day. How can i clean those old version aspect? I don't find any related docs. Should I use some tools or run delete sql directly to clean old version aspects?

✅ 1

👀 1

better-orange-49102

01/11/2023, 9:16 AM

using v0.8.45, I find that platform Instances information isn't visible in UI for containers, only dataset. just to confirm, i only need to specify instance inside containerProperties right

Copy code

{
  "auditHeader": null,
  "entityType": "container",
  "entityUrn": "urn:li:container:19c4d1f6538241d930dba76ede90e9a9",
  "entityKeyAspect": null,
  "changeType": "UPSERT",
  "aspectName": "containerProperties",
  "aspect": {
    "value": "{\"customProperties\": {\"platform\": \"mysql\", \"instance\": \"mycustomMySQL\", \"database\": \"datahub\"}, \"name\": \"datahub\"}",
    "contentType": "application/json"
  },
  "systemMetadata": {
    "lastObserved": 1673423105823,
    "runId": "mysql-2023_01_11-15_45_03",
    "registryName": null,
    "registryVersion": null,
    "properties": null
  }
}

✅ 1

refined-hamburger-93459

01/11/2023, 9:57 AM

Hi all, i want ingestion with mongodb . i was config succeed done but when check dataset then have not data (just have schema) . Who can help me pls ? Thanks !!

✅ 1

plain-cricket-83456

01/11/2023, 10:10 AM

Hello, is there a way to clear tags and glossaryterm before and after data ingestion?

👀 1

✅ 1

magnificent-lock-58916

01/11/2023, 11:03 AM

Hello! In the comments of the feature request about Tableau ingestion someone stated, that there’s is the problem with Tableau ingestion if source contains folders with the same naming (e.g. sub-folders called “Drafts” in each project folder) Has this issue been solved in a DataHub version, where Tableau stateful ingestion was included? I’m wondering because currently we’re running on older version without it

✅ 1

👀 1