https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please
  • h

    hundreds-airline-29192

    04/21/2023, 10:10 AM
    please help me
  • p

    powerful-cat-68806

    04/21/2023, 11:04 AM
    Hi team, On which pod & where does the secrets service, from the UI, manged?
    πŸ” 1
    πŸ“– 1
    l
    • 2
    • 2
  • o

    orange-gpu-90973

    04/21/2023, 11:21 AM
    In datahub when I try to ingest superset data although dashboards are ingested but that is shown on main page only on UI it is showing that no assets ingested how?
    πŸ“– 1
    l
    d
    • 3
    • 3
  • w

    wonderful-quill-11255

    04/21/2023, 1:20 PM
    Good afternoon people. I'm trying out authorization on the metadata service and running ingestions with personal access tokens. I'm a bit unclear on one thing. If a user only has the Reader role, and he generates a token, can he use that token to run ingestion recipes and create new metadata? My tests seem to indicate that. /Best regards
    πŸ” 1
    πŸ“– 1
    βœ… 1
    l
    b
    • 3
    • 3
  • r

    rapid-airport-61849

    04/21/2023, 1:41 PM
    After today’s update faced issue with redshift datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type redshift: β€˜str’ object is not callable
    plus1 1
    πŸ“– 1
    l
    l
    +5
    • 8
    • 15
  • c

    cuddly-dinner-641

    04/21/2023, 2:35 PM
    has anyone else run into concerns with Top Queries being ingested from source platforms containing PHI or other sensitive info (e.g in the where clause)? any strategies or recommendations for dealing with this?
    πŸ” 1
    πŸ“– 1
    l
    a
    • 3
    • 4
  • g

    green-honey-91903

    04/21/2023, 7:45 PM
    Hey yall πŸ‘‹ regarding the Airflow <-> Datahub integration ✈️! On following the airflow integration guide, i’ve identified a possible issue/bug in how the datahub airflow plugin respects the connection parameters. Airflow connections have a
    port
    parameter that doesn’t seem to be respected by datahub. I discovered this as I had the following configs for my connection:
    Copy code
    host: datahub-datahub-gms.datahub.svc.cluster.local
    port: 8080
    The datahub plugin couldn’t emit metadata successfully with this error:
    Copy code
    urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='datahub-datahub-gms.datahub.svc.cluster.local', port=80): Max retries exceeded with url: /aspects?action=ingestProposal
    Upon updating my connection configs like so:
    Copy code
    host: datahub-datahub-gms.datahub.svc.cluster.local:8080
    port: None
    The datahub plugin worked successfully.
    βœ… 1
    l
    a
    g
    • 4
    • 5
  • h

    hallowed-petabyte-25444

    04/23/2023, 6:57 AM
    Hi team, I am ingesting the data from PostGreSQL database using the DataHub UI (without using any API's). But the lineage is not working. I want to find the lineage to be automatically detected. In the documentation of PostGreSQL database ingestion, it has been mentioned that there is Table level lineage feature is present and it has to be configured. I read the documentation, but unable to find how to configure this. Can you please provide us any solution? Thanks in advance.
    πŸ” 1
    πŸ“– 1
    l
    d
    • 3
    • 4
  • a

    adorable-magazine-49274

    04/24/2023, 3:55 AM
    Hello Everyone! Currently, when I run redshift ingestion, the following error occurs. Upload with pod status and log. Can you help me?
    l
    d
    • 3
    • 6
  • q

    quiet-rain-16785

    04/24/2023, 4:16 AM
    Hi Guys, I want to how I can access log details of pipelines which I ingested in datahub. Is there any GraphQL Api for this??
    πŸ“– 1
    πŸ” 1
    l
    a
    • 3
    • 3
  • c

    clever-magician-79463

    04/24/2023, 4:39 AM
    Hi folks, has anyone been able to create column lineages for redshift data in datahub. i haver version 0.10.1. I see a button of column lineage on the UI but northing really happens when i click that. I am only able to set table lineages. Would be glad to hear if there are ways or some other innovative ways people are achieving the same?
    l
    d
    • 3
    • 3
  • b

    brief-cat-57352

    04/24/2023, 9:59 AM
    Hi Team, we are currently getting the error below when calling Pipeline.create for Postgres. This was already working for a long time before. Any ideas? Thank you.
    Copy code
    File "/usr/local/lib/python3.7/site-packages/great_expectations/experimental/datasources/dynamic_pandas.py", line 68, in <module>
        f"{Version(pd.__version__).major}.{Version(pd.__version__).minor}"
    AttributeError: 'Version' object has no attribute 'major'
    ...
    datahub.ingestion.run.pipeline.PipelineInitError: Failed to create source
    πŸ“– 1
    πŸ” 1
    l
    a
    a
    • 4
    • 3
  • a

    ancient-policeman-73437

    04/24/2023, 10:30 AM
    Dear Support, I haven't found that topic yet. I try to ingest SnowFlake + DBT. DBT runs fine, but SnowFlake loads quite a lot (2000 objects) and then fails saying " Pipeline finished with at least 2866 failures; produced 17798 events in 1 hour, 3 minutes and 53.65 seconds." The errors, which I see in the log looks like this "
    Copy code
    'Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, '
                                          'createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?\n'
                                          '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n'
                                          '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)',
                            'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, '
                                       'createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and v',
                            'status': 500,
                            'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,{DB and table Name},PROD)'}},
                  {'error': 'Unable to emit metadata to DataHub GMS',
                   'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'stackTrace': 'com.linkedin.restli.server.RestLiServiceException" What do I do wrong ? The test of connection to SnowFlake was successful.
    l
    d
    +9
    • 12
    • 36
  • l

    little-spring-72943

    04/24/2023, 10:53 AM
    Is there anyway to exclude certain reports when ingesting metadata for powerbi workspaces? It seem there are options for on-premise "powerbi-report-server" but none for cloud based "powerbi" source. Version v0.10.2
    πŸ“– 1
    πŸ” 1
    l
    a
    • 3
    • 4
  • a

    adorable-magazine-49274

    04/24/2023, 12:43 PM
    Hello Everyone! Currently the redshift ingestion api seems to be working fine. However, no data is being collected. Through debugging, I even confirmed that the collection query is executed in redshift. Please can you help me? As of today (2023.04.24), the versions registered in the datahub repo are used, and the CLI is using 0.10.2.1. Thanks!
    l
    d
    • 3
    • 4
  • r

    rhythmic-horse-99276

    04/24/2023, 12:52 PM
    help me pls! I'm trying to ingest power BI into DataHub. I created app in Azure, created security group in Azure and added the app in the group I allowed service principals to use power BI APIs and added the group in workspaces But then I'a trying to coonect to power BI I'm getting an error 401. What I do wrong?
    l
    d
    • 3
    • 3
  • w

    witty-butcher-82399

    04/24/2023, 12:54 PM
    Quick question about authorization in the ingestion pipelines: We have some connectors using the
    datahub-kafka
    sink and we are planning to enable authorization for GMS API for some other specific use cases. Given we have many connectors, Kafka-based ingestion is a critical requirement for us. So we are ok to trust connectors and keep connectors unauthenticated. Is enabling GMS API preventing us to work this way: some unauthenticated traffic from connectors and keep authentication exclusively for GMS API? Thanks!
    πŸ” 1
    πŸ“– 1
    l
    a
    g
    • 4
    • 5
  • o

    orange-intern-2172

    04/24/2023, 3:56 PM
    has anyone here set up the kafka ingester for Aiven? The default service user does not seem to have describeConfigs / describeTopics permissions set - has anyone had this problem?
    πŸ” 1
    πŸ“– 1
    l
    a
    • 3
    • 3
  • b

    bland-orange-13353

    04/24/2023, 4:06 PM
    This message was deleted.
    βœ… 1
    πŸ” 1
    πŸ“– 1
    l
    c
    • 3
    • 2
  • s

    salmon-angle-92685

    04/24/2023, 8:22 PM
    Hello guys, I've deleted hardly all Glossary Terms and Nodes from datahub, but when I check the tables, they are still tagged by the Glossary Terms... How to fix this ? On the pictures you can see the glossary page empty, but the tables still keeping the glossaries. Thanks for your help !
    l
    d
    • 3
    • 3
  • r

    rich-salesmen-77587

    04/24/2023, 9:58 PM
    Hi Team, I have tried to establish a connection for my datahub with Unity catalog with the below recipe: source: type: unity-catalog config: workspace_url: 'https://####.azuredatabricks.net' include_table_lineage: true include_column_lineage: true stateful_ingestion: enabled: true token: ################################### env: 'PROD' pipeline_name: 'unity_prod' sink: type: datahub-rest config: server: 'http://datahub-host-url:8080' But i get this ERROR: HTTPError: 403 Client Error: Forbidden for url: https://############.azuredatabricks.net/api/2.1/unity-catalog/metastores Response from server: { 'details': [ { '@type': 'type.googleapis.com/google.rpc.RequestInfo', 'request_id': 'af76d1a6-682d-41a8-82dd-d4bc4c986c1e', 'serving_data': ''}], 'error_code': 'PERMISSION_DENIED', 'message': 'Only account admin can list metastores.'}
    l
    d
    • 3
    • 2
  • w

    wide-florist-83539

    04/25/2023, 4:13 AM
    Hi Team, This seems to be a reoccurring issue with ingesting BigQuery via the console/ YAML. When testing connection with my service account key I get the following error:
    Copy code
    ('Failed to load service account credentials from /tmp/tmpmuzf8mak', ValueError('Could not deserialize key data. The data may be in an incorrect format, it may be encrypted with an unsupported algorithm, or it may be an unsupported key type (e.g. EC curves with explicit parameters).', [<OpenSSLError(code=503841036, lib=60, reason=524556, reason_text=unsupported)>]))
    v0.10.0 Gave the service account the appropriate permissions Seems like a parsing issue that is happening with the SSL Key? Should I be removing the
    -----BEGIN PRIVATE KEY-----
    πŸ“– 1
    πŸ” 1
    l
    d
    • 3
    • 6
  • f

    freezing-orange-76204

    04/25/2023, 5:00 AM
    Hi all. I have a question about kafka ingestion.Do you support importing kafka streams created by ksqldb or is it only topics and kafka-connect connectors?
    πŸ“– 1
    πŸ” 1
    l
    m
    • 3
    • 2
  • g

    gifted-bear-4760

    04/25/2023, 7:21 AM
    Hi everyone! I have deployed DataHub on GKE. I'm trying to ingest the sample data from the prerequisite MySQL instance that is deployed as a part of DataHub using the following recipe: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/recipes/mysql_to_datahub.dhub.yaml But I'm getting the following error: [2023-04-25 065640,643] INFO {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.2.1 [2023-04-25 065652,940] ERROR {datahub.entrypoints:195} - Command failed: Failed to set up framework context: Failed to connect to DataHub Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/usr/local/lib/python3.9/dist-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 244, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib/python3.9/http/client.py", line 1255, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output self.send(msg) File "/usr/lib/python3.9/http/client.py", line 950, in send self.connect() File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/usr/local/lib/python3.9/dist-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fad40b240d0>: Failed to establish a new connection: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fad40b240d0>: Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/api/common.py", line 62, in init self.graph = DataHubGraph(datahub_api) if datahub_api is not None else None File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/graph/client.py", line 72, in init self.test_connection() File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/emitter/rest_emitter.py", line 146, in test_connection response = self._session.get(f"{self._gms_server}/config") File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 600, in get return self.request("GET", url, **kwargs) File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fad40b240d0>: Failed to establish a new connection: [Errno 111] Connection refused')) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 119, in _add_init_error_context yield File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 187, in init self.ctx = PipelineContext( File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/api/common.py", line 64, in init raise Exception("Failed to connect to DataHub") from e Exception: Failed to connect to DataHub The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/entrypoints.py", line 182, in main sys.exit(datahub(standalone_mode=False, **kwargs)) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.9/dist-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper raise e File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper res = func(*args, **kwargs) File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper return func(ctx, *args, **kwargs) File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 187, in run pipeline = Pipeline.create( File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 328, in create return cls( File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 187, in init self.ctx = PipelineContext( File "/usr/lib/python3.9/contextlib.py", line 135, in exit self.gen.throw(type, value, traceback) File "/home/naman_gulati/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 121, in _add_init_error_context raise PipelineInitError(f"Failed to {step}: {e}") from e datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context: Failed to connect to DataHub I have also tried to replace the hostname by the Cluster IP of the Prerequisite MySQL Service and then I also tried to replace it by the Serving pod's endpoint too. But getting identical error. Can someone please help me with this? It's a little Urgent
    πŸ” 1
    πŸ“– 1
    l
    b
    +3
    • 6
    • 21
  • b

    billowy-flag-4217

    04/25/2023, 4:41 PM
    Does anyone know if we should expect ingestion of
    Looks
    from the Looker ingestion library. It seems only visualisations from dashboards are being ingested, but not individual Looks, which are not present on dashboards?
    πŸ” 1
    πŸ“– 1
    l
    a
    +2
    • 5
    • 6
  • i

    important-tailor-54083

    04/25/2023, 6:34 PM
    Hi, I'm using DataHub CLI version: 0.10.2.1 and Metabase 0.45.2.1. I got this error multiple times when ingesting Metabase. This card (with id 2594) is accessible by the account that I'm using, and this card is using custom query. I've been dealing with this issue for several days and have no idea how to solve this. Is this a bug in datahub ? Kindly help πŸ™
    Copy code
    'metabase-table-card__2594': ['Unable to retrieve source table. Reason: 404 Client Error: Not Found for url: <https://xxxx.com/api/table/card__2594>'],
    l
    a
    • 3
    • 3
  • i

    important-tailor-54083

    04/25/2023, 7:01 PM
    another question, why after I soft delete Metabase dashboard & chart , and do re-ingestion, the asset won't show up in the UI ? already try re-ingesting many times using the same connection setting, but still not appear in the UI. Kindly help. Thanks πŸ™
    l
    a
    f
    • 4
    • 3
  • n

    numerous-byte-87938

    04/25/2023, 10:26 PM
    Hi folks, it’s wonderful to see that a retention mechanism is available for kv store and I’m interested to hear what’s the current strategy for keeping Elasticsearch size in check. I found one related thread but no conclusion was drawn there. In our setup, there are mainly four types of index in ES, 1. graph_service_v1 β€” used by GraphService, 217M docs, size 29.3gb (primary). 2. system_metadata_service_v1 β€” used by SystemMetadataService, 401M docs, size 44.5gb. 3. timeseries indices β€” used by TimeseriesAspectService, largest dataprocessinstance_dataprocessinstanceruneventaspect_v1, 133M docs, size 25.3gb. 4. versioned indices β€” used by EntitySearchService, largest dataprocessinstanceindex_v2, 73M docs, size 100.1gb. We are concerned leaving those indices growing will deteriorate our query perf. and hinder future data migration gradually. For example, in our recent upgrade, it takes us 50h to reindex dataprocessinstanceindex_v2 due to a mapping change. Given we have to pause the ingestion traffic during the upgrade, it enlarges our maintenance window.
    l
    a
    +3
    • 6
    • 7
1...117118119...144Latest