https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • a

    alert-fall-82501

    01/09/2023, 7:04 AM
    check the config file in thread
  • c

    curved-planet-99787

    01/09/2023, 7:37 AM
    Hi, is there a reason why
    s3_staging_dir
    is used as a parameter name in the
    Athena
    source recipe? For me it is rather unintuitive and I would expect something like
    query_result_location
    since this is also the term AWS uses But I'm also interested in what others in the community think about this? I could try to come up with a PR to change this if I you agree with my suggestion
    đź‘€ 1
    a
    d
    • 3
    • 7
  • a

    aloof-energy-17918

    01/09/2023, 9:31 AM
    Hi all, I'm having a connection problem to pypi.org when trying to add new ingestion source. I think this is because i'm behind a proxy. I saw some solution posted previously there, https://datahubspace.slack.com/archives/C029A3M079U/p1647507748598729?thread_ts=1646928810.396579&cid=C029A3M079U However, the solution is for docker deployment, while i'm using K8S. I'm not really familiar with K8S. While I know it has something to do with Configmaps, volumes, and mountVolumes. I'm not exactly sure how the syntax work. Could anyone give me some guidance on this.
    Copy code
    "'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4d90>, 'Connection to <http://pypi.org|pypi.org> timed out. "
               "(connect timeout=15)')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4f10>, 'Connection to <http://pypi.org|pypi.org> timed out. "
               "(connect timeout=15)')': /simple/wheel/\n"
               'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
               "'ConnectTimeoutError(<pip
    âś… 2
    • 1
    • 3
  • b

    best-umbrella-88325

    01/09/2023, 9:56 AM
    Hi Community! I've been trying to schedule ingestion via the ingestion-cron pod. However, the pod doesn't get scheduled. Is there anything that is going wrong here? Any help is appreciated. Section from values.yaml file:
    Copy code
    datahub-ingestion-cron:
      enabled: true
      crons:
        s3:
          schedule: "* * * * *" # Every Minute
          recipe:
            configmapName: s3-ingestion
            fileName: s3-ingestion.yaml
      image:
        repository: acryldata/datahub-ingestion
        tag: "v0.9.5"
    Config Map:
    Copy code
    Data
    ====
    s3-ingestion.yaml:
    ----
    source:
        type: "s3"
        config:
            path_spec:
                include: 's3://*****/datafiles/*.*'
            platform: s3
            aws_config:
                aws_access_key_id: ******
                aws_region: us-west-1
                aws_secret_access_key: *****
    sink:
        type: "datahub-rest"
        config:
            server: '<http://datahub-datahub-gms:8080>'
    
    
    
    BinaryData
    ====
    
    Events:  <none>
    s3-ingestion.yaml
    Copy code
    source:
        type: "s3"
        config:
            path_spec:
                include: 's3://****/datafiles/*.*'
            platform: s3
            aws_config:
                aws_access_key_id: *****
                aws_region: us-west-1
                aws_secret_access_key: *****
    sink:
        type: "datahub-rest"
        config:
            server: '<http://datahub-datahub-gms:8080>'
    Command used to create config map
    Copy code
    kubectl create configmap s3-ingestion --from-file=s3-ingestion.yaml
    đź‘€ 1
    âś… 1
    i
    • 2
    • 16
  • r

    refined-tent-35319

    01/09/2023, 9:32 AM
    I am trying to ingest redshift private cluster data to datahub(deployed to Amazon EKS) and getting the error. Can anybody help me to understand why this is happening Attaching the log.
    exec-urn_li_dataHubExecutionRequest_9fd5b939-ecf6-4b41-9881-c84c5ac65285.log
    âś… 1
    m
    • 2
    • 8
  • s

    salmon-motorcycle-36881

    01/09/2023, 11:36 AM
    Hi, my organisation has begun using Datahub, we ingest data from Snowflake.
  • s

    salmon-motorcycle-36881

    01/09/2023, 11:43 AM
    Hi, my organisation has begun using Datahub, we ingest data from Snowflake. We have enabled data profiling for our Snowflake tables - however we have noticed that some of these profiling queries are taking up large amounts of time and therefore processing cost - and also means that ingestion recipes are not completing. One of the things I have noticed is in regard to the calculation of the median. Datahub uses the following query. SELECT <<Column>> FROM <<Table>> WHERE <<Column>> IS NOT NULL ORDER BY <<Column>> LIMIT 2 OFFSET <<rows in table / 2>> For an example table in our database with 500 million rows, this query takes about 8 mins. The median function on the same table takes about 10-15 seconds. Is there anything we can do about this?
    👍 1
    âś… 1
    a
    g
    • 3
    • 2
  • g

    gorgeous-memory-27579

    01/09/2023, 4:45 PM
    Just out of curiosity, has anyone worked on implementing a source for Onedrive/Sharepoint files from a shared drive or site? Is implementing a custom source the way to go here, or do folks recommend using REST emitter or file-based sources as an alternative? Metadata will be pretty simple.
    âś… 1
    a
    • 2
    • 1
  • a

    ambitious-room-6707

    01/09/2023, 4:41 PM
    Hi, I ingested a data source with the push-based method indicating the data owner to a non-existing group in the recipe file. An entity page was created for the group once the ingestion was completed but it's not found in the overall list of groups and editing the group gives the error that 'Group does not exist'. Hoping to know if anyone had a similar output and if they were able to resolve this? Much thanks :-)
    g
    h
    b
    • 4
    • 8
  • a

    adorable-summer-43339

    01/10/2023, 1:37 AM
    Hello, I’m going to introduce a data hub. You want to extract terms from existing metasystems as csv and put them in the data hub. Is there a place where I can see the sample data of the csv file? Official Data Hub documents are not friendly.
    h
    • 2
    • 1
  • f

    fresh-processor-63024

    01/10/2023, 5:34 AM
    hi, i’m using db profiling with limit option to reduce db overhead
    Copy code
    profiling:
        enabled: true
        limit: 5000
    but if i use limit option, the row count of row count is setted to limit count is this normal?
    âś… 1
    h
    • 2
    • 3
  • m

    microscopic-machine-90437

    01/10/2023, 6:26 AM
    Hi Team, Please help me with dbt ingestion. I'm trying to ingest dbt metadata using datahub recipe. I have deployed datahub using docker-compose and placed the dbt artifacts in the container. So I have given the docker container path in the recipe as below: source: type: dbt config: manifest_path: /etc/datahub/dbt_metadata/manifest.json test_results_path: /etc/datahub/dbt_metadata/run_results.json sources_path: /etc/datahub/dbt_metadata/sources_file.json target_platform: snowflake catalog_path: /etc/datahub/dbt_metadata/catalog_file.json sink: type: datahub-rest config: server: 'http://us01vlprdedphub:9002/' But I'm not able to run it successfully. Attached is the error log of the same.
    exec-urn_li_dataHubExecutionRequest_9334df14-ceb7-4db4-8f13-87b99e336f6b.log
    h
    a
    • 3
    • 8
  • p

    polite-actor-701

    01/10/2023, 7:55 AM
    Hi everyone. I have a question. I ingested metadata from oracle, and all data was stored in mysql. And I ingested data from tableau while oracle data was being indexed in ES. When the tableau data was all saved to mysql, ES was still indexing the oracle data. Contrary to my expectations, after all oracle data was indexed, Tableau data was not indexed. So I ingested the tableau data again, but it still wasn't indexed. So I performed ingestion from another oracle db and confirmed that the data was indexed in ES. Is this a bug? Is there a way to manually indexing in ES?
    h
    • 2
    • 5
  • f

    fresh-processor-63024

    01/10/2023, 8:10 AM
    hi teams when i use oracle ingestion, upper case table name convert to lower case table real table name is RV120
    âś… 1
    h
    • 2
    • 7
  • c

    cool-tiger-42613

    01/10/2023, 10:55 AM
    Hi, other than the CLI what is the recommended option for a hard delete of all the contents(flows,datasets,tasks etc) ?
    đź‘€ 1
    âś… 1
    g
    a
    • 3
    • 6
  • r

    rich-policeman-92383

    01/10/2023, 10:27 AM
    Hello Please suggest why the hive ingestion run is trying to connect to api.github.com:443. We run ingestion from air-gapped linux servers and due to this the job remains stuck. datahub version: v0.9.5 cli version: v0.9.1
    h
    • 2
    • 16
  • l

    lively-engine-55407

    01/10/2023, 12:49 PM
    Hi, guys! I'm trying to run an ingestion in Power BI and the execution is successful, but I don't get the assets.
    h
    • 2
    • 2
  • a

    alert-fall-82501

    01/10/2023, 1:47 PM
    Hi Team - I am working to import airflow dag jobs to datahub . I have gone through the procedure which is mentioned on documentation . But Getting some Bug ...can anybody help me with this ?
    a
    d
    • 3
    • 4
  • h

    hallowed-lizard-92381

    01/10/2023, 4:55 PM
    Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…
    Copy code
    USERNAME = ***
    pipeline = Pipeline.create(
        {
            "source": {
                "type": "mysql",
                "config": {
                    "username": "user",
                    "password": "pass",
                    "database": "db_name",
                    "host_port": "localhost:3306",
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "<http://localhost:8080>"},
            },
        }
    )
    
    # Run the pipeline and report the results.
    pipeline.run()
    pipeline.pretty_print_summary()
  • h

    hallowed-lizard-92381

    01/10/2023, 4:56 PM
    Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…
    Copy code
    USERNAME = ***
    PASS = ***
    DB_NAME = ***
    HOST = ***
    
    def pipeline1():
    pipeline = Pipeline.create(
        {
            "source": {
                "type": "mysql",
                "config": {
                    "username": USERNAME,
                    "password": PASS,
                    "database": DB_NAME,
                    "host_port": HOST,
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "<http://localhost:8080>"},
            },
        }
    )
    
    # Run the pipeline and report the results.
    pipeline.run()
    pipeline.pretty_print_summary()
  • h

    hallowed-lizard-92381

    01/10/2023, 5:00 PM
    Hey ya’ll Curious if anyone has advice on nicely coded programatic pipelines… Ideally, we don’t want individual functions for every pipeline like so…
    Copy code
    USERNAME = ***
    PASS = ***
    DB_NAME = ***
    HOST = ***
    
    def pipeline1():
       pipeline = Pipeline.create(
           {
            "source": {
                "type": "mysql",
                "config": {
                    "username": USERNAME,
                    "password": PASS,
                    "database": DB_NAME,
                    "host_port": HOST,
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "<http://localhost:8080>"},
              },
          }
        )
    
       # Run the pipeline and report the results.
       pipeline.run()
       pipeline.pretty_print_summary()
    instead it would be nice to have
    Copy code
    def run_pipeline(pipeline_str, **kwargs):
        pipeline=pipeline.create(json_loads(pipeline_str.format(**kwargs))
        pipeline.run()
        pipeline.pretty_print_summary()
    
    Invoked by...
    run_pipeline('''{
            "source": {
                "type": "mysql",
                "config": {
                    "username": USERNAME,
                    "password": PASS,
                    "database": DB_NAME,
                    "host_port": HOST,
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {"server": "<http://localhost:8080>"},
              },
          }''')
    Anybody doing this?
    đź‘€ 1
    âś… 1
    a
    g
    • 3
    • 4
  • b

    bland-lighter-26751

    01/10/2023, 5:11 PM
    Hello, I'm noticing that Datahub is not removing tables that were deleted in BigQuery. An example is I deleted a table 6 days ago, and the UI displays that the table was last synchronized 6 days ago but it won't delete. Am I missing something in my yaml?
    Copy code
    source:
        type: bigquery
        config:
            include_table_lineage: true
            include_usage_statistics: true
            include_tables: true
            include_views: true
            profiling:
                enabled: true
                profile_table_level_only: false
            stateful_ingestion:
                enabled: true
                remove_stale_metadata: true
            credential:
                project_id: study-342717
    đź‘€ 1
    âś… 1
    a
    d
    g
    • 4
    • 11
  • m

    microscopic-carpet-71950

    01/10/2023, 5:22 PM
    @bulky-soccer-26729 I was reviewing the

    https://www.youtube.com/watch?v=FjkNySWkghY&amp;t=2475sâ–ľ

    , and wondering if you are able to now automatically parse SQL to extract Column Level Lineage information? We use Presto/Athena for our data store, and it wasn't clear if we'd have to write a parser/API ourselves or could use some Open Source pre-built options...or is all your stuff only available in the hosted solution?
    b
    • 2
    • 1
  • p

    plain-cricket-83456

    01/11/2023, 2:38 AM
    hello,If I deprecate a glossaryterm set, it's not supposed to be typed if I ingestion it again, but deprecating the glossaryterm set doesn't work, i want know why.I'm using version 0.8.41, by the way
    âś… 1
    đź‘€ 1
    a
    • 2
    • 2
  • r

    rich-policeman-92383

    01/11/2023, 7:37 AM
    Hello Domains created by simple_add_dataset_domain transformer are not visible in the govern domain view. Datahub version: v0.9.5 CLI: v0.9.5
    Copy code
    transformers:
          - type: "simple_add_dataset_domain"
            config:
              replace_existing: true  # false is default behaviour
              domains:
                - "urn:li:domain:engineering"
                - "urn:li:domain:hr"
    h
    • 2
    • 1
  • a

    astonishing-cartoon-6079

    01/11/2023, 8:52 AM
    #ingestion Hi team, I am integrating the datahub with hive. we integrating hive metadata everyday. but I found the aspect of datasetProperties has 20 versions, and these daily updated hive talbes generate a new datasetProperties once a day. How can i clean those old version aspect? I don't find any related docs. Should I use some tools or run delete sql directly to clean old version aspects?
    âś… 1
    đź‘€ 1
    m
    a
    • 3
    • 5
  • b

    better-orange-49102

    01/11/2023, 9:16 AM
    using v0.8.45, I find that platform Instances information isn't visible in UI for containers, only dataset. just to confirm, i only need to specify instance inside containerProperties right
    Copy code
    {
      "auditHeader": null,
      "entityType": "container",
      "entityUrn": "urn:li:container:19c4d1f6538241d930dba76ede90e9a9",
      "entityKeyAspect": null,
      "changeType": "UPSERT",
      "aspectName": "containerProperties",
      "aspect": {
        "value": "{\"customProperties\": {\"platform\": \"mysql\", \"instance\": \"mycustomMySQL\", \"database\": \"datahub\"}, \"name\": \"datahub\"}",
        "contentType": "application/json"
      },
      "systemMetadata": {
        "lastObserved": 1673423105823,
        "runId": "mysql-2023_01_11-15_45_03",
        "registryName": null,
        "registryVersion": null,
        "properties": null
      }
    }
    âś… 1
    h
    • 2
    • 10
  • r

    refined-hamburger-93459

    01/11/2023, 9:57 AM
    Hi all, i want ingestion with mongodb . i was config succeed done but when check dataset then have not data (just have schema) . Who can help me pls ? Thanks !!
    âś… 1
    f
    • 2
    • 2
  • p

    plain-cricket-83456

    01/11/2023, 10:10 AM
    Hello, is there a way to clear tags and glossaryterm before and after data ingestion?
    đź‘€ 1
    âś… 1
    a
    • 2
    • 3
  • m

    magnificent-lock-58916

    01/11/2023, 11:03 AM
    Hello! In the comments of the feature request about Tableau ingestion someone stated, that there’s is the problem with Tableau ingestion if source contains folders with the same naming (e.g. sub-folders called “Drafts” in each project folder) Has this issue been solved in a DataHub version, where Tableau stateful ingestion was included? I’m wondering because currently we’re running on older version without it
    âś… 1
    đź‘€ 1
    a
    • 2
    • 1
1...949596...144Latest