https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • a

    average-rocket-98592

    03/23/2023, 3:50 PM
    Hi, question about oracle ingestion. Can someone tell me which exact permission needs the oracle user used to ingest?
    a
    a
    • 3
    • 3
  • s

    strong-hospital-52301

    03/23/2023, 6:53 PM
    Hello! I'm trying to synchronize the data that I have on a local instance of MySQL with docker container datahub but it stays freeze on this state. MySQL instance runs on localhost:3306 and it's MySQL 8.0.32 version (i attach a picture). My Ingestion file is the following one:
    a
    • 2
    • 2
  • n

    numerous-byte-87938

    03/23/2023, 10:37 PM
    Hi friends, we were trying to upgrade our fork from v0.8.35 to v0.8.45, but were not able to make a successful ES update due to
    document_missing_exception
    error (logs in 🧵). There’s good chance that we missed something during the upgrade, and it will be super helpful if you could offer some insights. Here’re some extra notes: • We had made sure our branch was on v0.8.45 (i.e. including #5827 per this thread and v0.8.44 release note suggested), and our standalone MXE consumers were looking functional according to logs. • The errors were coming from our GMS pods, and right after it was bootstrapped. • By monitoring the ES indices, we were not able to see any
    docs.count
    increases for any indices. • MCPs were able to land in MySQL without issue. • The ES cluster we used to test is restored from prod snapshot without any options.
    āœ… 1
    • 1
    • 2
  • f

    full-football-39857

    03/24/2023, 1:47 AM
    Hi Everyone, I am new to Datahub, we got the issue with ingesting Oracle DB of table type: Normal. Exactly, Datahub is still get Metadata of others table type from Oracle DB but only table type: Normal was unable to get Metadata. Would you please help us some reason this issue, thanks.
    a
    • 2
    • 1
  • h

    happy-chef-34162

    03/24/2023, 5:12 AM
    Hi, Everyone. I want to create database table ERD information using datahub from JPA(JAVA). is it possible extracting ERD information from JPA? thanks!
    āœ… 1
    b
    • 2
    • 2
  • a

    agreeable-cricket-61480

    03/24/2023, 11:14 AM
    Can someone share a DAG code that shows lineage like this more importantly it has to navigate to its dataset details when we click on 'dataset details' and have to show columns.
    a
    • 2
    • 12
  • w

    wide-optician-47025

    03/24/2023, 8:06 PM
    hello, for Mariadb, the only way I have it working is by using the sqlalchemy_uri parameter I tried to use a variable for the password embedded within the sqlalchemy_uri, but it was unable to parse it sqlalchemy_uri: 'mysql+pymysql://rds_user:xxxx@host:3306/dbname this is an issue as it exposes the password any suggestions or future enhancements?
    d
    • 2
    • 3
  • b

    best-planet-6756

    03/25/2023, 5:47 AM
    Hey All, I’m facing an issue when I ingest an Oracle DB with profiling enabled it does not drop the temp tables created for profiling. The ingestion is successful with no issues. Any ideas?
    g
    a
    • 3
    • 6
  • a

    ambitious-apple-49350

    03/27/2023, 6:21 AM
    Hi! Fairly new to datahub, but very excited about it! I would like to record complete lineage of some of our assets. That would include some OpenData Downloads (mostly CSV Files) which are freely downloadable. Is there a basic recipe/template for downloadable assets?
    āœ… 1
    a
    • 2
    • 3
  • o

    orange-room-20920

    03/27/2023, 7:37 AM
    Hello team! I am currently integrating DataHub and Airflow. I want to display Airflow’s lineage in DataHub, but it’s not working. Configured environment: 1. Airflow and DataHub are in the same VPC (curl requests work between them). 2. I am using Kafka Sink. 3. DataHub plugin installation is complete in Airflow (acryl-datahub-airflow-plugin==0.10.0.6). It is visible in the plugin tab. 4. The Airflow version is v2.5.1. 5. I have completed the setting for lazy_load_plugins = False in airflow.cfg [core]. 6. DataHub docker check has no issues (everything is healthy). 7. Emitting DataHub logs appear in Airflow without any problem (I left the log in the thread comments). Please help! 🄲
    d
    • 2
    • 7
  • m

    microscopic-room-90690

    03/27/2023, 8:38 AM
    Hi team, I'm trying to use
    datahub.metadata.schema_classes
    to set table schema. And it seems native datatype is not necessary, I'm wondering how to keep the target datatype automatically adapt to native datatype because it is not easy to configure it manually during automated execution reference: https://docs-website-ej1aml8mp-acryldata.vercel.app/docs/python-sdk/models#datahub.metadata.schema_classes.TimeTypeClass https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py
    g
    • 2
    • 1
  • a

    average-rocket-98592

    03/27/2023, 9:46 AM
    Hi! I’m trying to ingest data from Kafka, but I receive the following error message: ModuleNotFoundError : No module name ā€˜value_types_pb2’. Does someone has any idea?
    a
    • 2
    • 3
  • f

    fresh-balloon-59613

    03/27/2023, 11:52 AM
    Can someone please provide the code for this below DAG.
    a
    • 2
    • 1
  • a

    astonishing-dusk-99990

    03/27/2023, 12:19 PM
    Hi everyone! Currently I’m trying to ingest trino but still error and the error look like this. Can anyone help me? Here’s my recipe :
    Copy code
    source:
        type: trino
        config:
            host_port: '{host}:{port}'
            database: {db_name}
            username: {username}
            password: {password}
    Note: • I’m deploying with helm chart with version 0.10.0 • Trino on top of dataproc cluster
    āœ… 1
    d
    • 2
    • 2
  • m

    modern-france-82371

    03/28/2023, 3:39 AM
    Hello team, I have a problem with some roles, that is Reader users who are able to trigger ingestion jobs, generate token, edit Domain, Tag… these are very difference from the document. My expectation is the Reader users have view permission only as the documentation. Anyone give me a advice?
    a
    s
    a
    • 4
    • 7
  • g

    great-monkey-52307

    03/28/2023, 5:38 AM
    Hi Team, I'm trying to ingest from snowflake for the first and would like to use authentication type as "KEY_PAIR_AUTHENTICATOR" , I have private key file , can any one let me know the steps I need to follow to set the private_key_path pointing to the (P8 file) , Thank you
    āœ… 1
    p
    • 2
    • 1
  • w

    witty-butcher-82399

    03/28/2023, 7:33 AM
    Hi! Is there any reason why other entities such as
    charts
    ,
    dashboards
    ,
    containers
    are missed in this mapping? Could these entities be added safely? https://github.com/datahub-project/datahub/blob/c7d35ffd6609d0ae79a2b1151a2221086e[…]ingestion/src/datahub/ingestion/transformer/base_transformer.py
    Copy code
    self.entity_type_mappings: Dict[str, Type] = {
                "dataset": DatasetSnapshotClass,
                "dataFlow": DataFlowSnapshotClass,
                "dataJob": DataJobSnapshotClass,
            }
    We have a custom transformer to enrich ownership information for datasets, charts, dashboards and containers. However our transform fails because of the assert here. Additionally, what's the point of the assert if there is the fallback
    return False
    a couple of lines below? Thanks!
    āœ… 1
    a
    • 2
    • 2
  • b

    bitter-evening-61050

    03/28/2023, 7:35 AM
    Hi Team, I am trying to connect to azure ad .I am getting the below error : |←[32m←[2m[2023-03-28 125205,662]←[0m ←[31mERROR ←[0m ←[34m←[2m{datahub.ingestion.source.identity.azure_ad:489}←[0m - ←[31mResponse status code: 401. Response content: b'{"error":{"code":"Authorization_IdentityNotFound","message":"The identity of the calling application could not be established.","innerError":{"date":"2023-03-28T072039","request-id":"xxxx","client-request-id":"xxx"}}} yaml : source: type: "azure-ad" config: client_id: "xxxx" tenant_id: "xxxx" client_secret: "xxxx" redirect: "https://login.microsoftonline.com/common/oauth2/nativeclient" authority: "https://login.microsoftonline.com/xxx" token_url: "https://login.microsoftonline.com/xxx" graph_url: "https://graph.microsoft.com/v1.0" ingest_users: True ingest_groups: True groups_pattern: allow: - ".*" users_pattern: allow: - ".*" sink: type: datahub-rest config: server: http://xxx token: xxx Can anyone please help in resolving this issue
    g
    • 2
    • 6
  • c

    cool-tiger-42613

    03/28/2023, 9:41 AM
    Hi, we have a custom source and want to set up stateful ingestion for our instance. would this be the best documentation? Additionally are there some examples in github just like we have them for metadata ingestion?
    plus1 1
    g
    t
    a
    • 4
    • 9
  • b

    brainy-parrot-75918

    03/28/2023, 9:44 AM
    Hi Team, I'm currently testing the profiling options for DataHub in BQ, but am concerned about the cost and query used. Where can I find the exact queries which are used for calculating the profiling metrics? Thank you.
    a
    • 2
    • 1
  • l

    lively-raincoat-33818

    03/28/2023, 10:42 AM
    Hello everyone, I am ingesting 'dbt source freshness' with the 'source.json' file in Datahub. I have uploaded the file to S3 and I have modified the ingestion in datahub. But it is not showing the data correctly, the 'Last observed' says that it was 15 hours ago which coincides with the hr of the last datahub ingestion. But now that I'm loading the 'sources.json' it should show me the freshness data. Has anyone had this problem and can help me? I have the V0.10.0. Thanks
    a
    h
    a
    • 4
    • 10
  • p

    purple-printer-15193

    03/28/2023, 4:23 PM
    Hello, is it possible to exclude all Charts from being ingested in Looker?
    a
    • 2
    • 6
  • q

    quaint-football-54639

    03/28/2023, 9:00 PM
    Hi, when I do profiling. I set
    include_field_distinct_count: false
    but when the ingestion start I still see
    SELECT count(distinct
    run on all columns.
    a
    • 2
    • 1
  • d

    damp-lighter-99739

    03/29/2023, 9:09 AM
    Hi Team, We had recently deployed datahub on eks and wanted to move kafka from kube to confluent cloud managed service. While setting it up i noticed that the kafka setup job by default uses a single partition for all topics. Does this mean that we need to make some sort of changes from producer/consumer side for a performance upgrade (im guessing there is a single mae/mce consumer). New to kafka, so any help is appreciated
    āœ… 1
    a
    o
    • 3
    • 2
  • w

    wide-ghost-47822

    03/29/2023, 12:16 PM
    Hi, I had an error when I’d like to ingest some data from mariadb. Datahub is installed in my local environment. I have a file like this:
    Copy code
    source:
      type: mariadb
      config:
        # Coordinates
        host_port: <host:port>
        database: <db-name>
        include_tables: true
    
        profiling:
          enabled: true
          profile_table_level_only: true
        
        stateful_ingestion:
          enabled: true
    
        table_pattern:
          allow:
            - <table>
    
        # Credentials
        username: <user>
        password: <pass>
    
    # sink configs
    Then I executed the following command:
    datahub ingest -c datahub/recipes/myfile.dhub.yaml
    And get this error:
    Failed to connect to DataHub
    and
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10c1343d0>: Failed to establish a new connection: [Errno 61] Connection refused'))
    . I know that datahub-gms is running on port 8080. So I’ve checked the endpoint
    localhost:8080/config
    with curl and it respond with http status code 200. Here it is:
    Copy code
    āÆ curl localhost:8080/config -v
    *   Trying 127.0.0.1:8080...
    * Connected to localhost (127.0.0.1) port 8080 (#0)
    > GET /config HTTP/1.1
    > Host: localhost:8080
    > User-Agent: curl/7.86.0
    > Accept: */*
    >
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Date: Wed, 29 Mar 2023 12:14:47 GMT
    < Content-Type: application/json
    < Transfer-Encoding: chunked
    < Server: Jetty(9.4.46.v20220331)
    <
    {
      "models" : { },
      "patchCapable" : true,
      "versions" : {
        "linkedin/datahub" : {
          "version" : "v0.10.1",
          "commit" : "d1bab5616cbf19ce22223288feb2b9852ec1fa23"
        }
      },
      "managedIngestion" : {
        "defaultCliVersion" : "0.10.1",
        "enabled" : true
      },
      "statefulIngestionCapable" : true,
      "supportsImpactAnalysis" : true,
      "timeZone" : "GMT",
      "telemetry" : {
        "enabledCli" : true,
        "enabledIngestion" : false
      },
      "datasetUrnNameCasing" : false,
      "retention" : "true",
      "datahub" : {
        "serverType" : "quickstart"
      },
      "noCode" : "true"
    }
    * Connection #0 to host localhost left intact
    I couldn’t figure it out the problem yet. Any comments on this? Here is the full output of the error:
    Copy code
    [2023-03-29 15:06:05,702] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
    [2023-03-29 15:06:18,138] ERROR    {datahub.entrypoints:188} - Command failed: Failed to set up framework context: Failed to connect to DataHub
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
        conn = connection.create_connection(
      File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
        raise err
      File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
        sock.connect(sa)
    ConnectionRefusedError: [Errno 61] Connection refused
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
        httplib_response = self._make_request(
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 244, in request
        super(HTTPConnection, self).request(method, url, body=body, headers=headers)
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1285, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1331, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1280, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 1040, in _send_output
        self.send(msg)
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/http/client.py", line 980, in send
        self.connect()
      File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
        conn = self._new_conn()
      File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
        raise NewConnectionError(
    urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x10c1343d0>: Failed to establish a new connection: [Errno 61] Connection refused
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
        resp = conn.urlopen(
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
        return self.urlopen(
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
        return self.urlopen(
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 815, in urlopen
        return self.urlopen(
      File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
        retries = retries.increment(
      File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10c1343d0>: Failed to establish a new connection: [Errno 61] Connection refused'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/common.py", line 61, in __init__
        self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/graph/client.py", line 71, in __init__
        self.test_connection()
      File "/usr/local/lib/python3.9/site-packages/datahub/emitter/rest_emitter.py", line 146, in test_connection
        response = self._session.get(f"{self._gms_server}/config")
      File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 555, in get
        return self.request('GET', url, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10c1343d0>: Failed to establish a new connection: [Errno 61] Connection refused'))
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 115, in _add_init_error_context
        yield
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 167, in __init__
        self.ctx = PipelineContext(
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/common.py", line 63, in __init__
        raise Exception("Failed to connect to DataHub") from e
    Exception: Failed to connect to DataHub
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/datahub/entrypoints.py", line 175, in main
        sys.exit(datahub(standalone_mode=False, **kwargs))
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
        return self.main(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
        rv = self.invoke(ctx)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
        return __callback(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
        raise e
      File "/usr/local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
        res = func(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
        return func(ctx, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 187, in run
        pipeline = Pipeline.create(
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 308, in create
        return cls(
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 167, in __init__
        self.ctx = PipelineContext(
      File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 137, in __exit__
        self.gen.throw(typ, value, traceback)
      File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 117, in _add_init_error_context
        raise PipelineInitError(f"Failed to {step}: {e}") from e
    datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context: Failed to connect to DataHub
    āœ… 1
    a
    • 2
    • 3
  • q

    quaint-football-54639

    03/29/2023, 1:31 PM
    Hi team when I turn on profiling, I have
    include_field_null_count
    as false. But I still can see the ingest run null_count. This highly impact the performance, is there are way to turn it off?
    a
    • 2
    • 1
  • l

    lively-dusk-19162

    03/29/2023, 3:47 PM
    Hello all, When I deploy datahub after making changes to create a new entity, i used the following command to deploy datahub-gms alone, docker-compose -p datahub -f docker-compose.yml -f docker-compose.override.yml -f docker-compose-without-neo4j.m1.yml -f docker-compose.dev.yml up datahub-gms The above attached is the error.
    a
    • 2
    • 1
  • h

    high-night-94979

    03/29/2023, 6:48 PM
    Hi! I there a recommended way (via API) to check whether an entity exists?
    a
    • 2
    • 7
  • c

    calm-dinner-63735

    03/29/2023, 8:28 PM
    when i am trying to run this code i am getting below error
  • c

    calm-dinner-63735

    03/29/2023, 8:28 PM
    https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py
    a
    h
    • 3
    • 6
1...112113114...144Latest