https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • r

    rich-state-73859

    02/06/2023, 10:04 PM
    Hi all, I got
    com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type.
    when importing
    scalapb
    in my protobuf. I added the following file option.
    Copy code
    option (scalapb.options) = {
      import: "path.to.package._"
    };
  • g

    green-lion-58215

    02/06/2023, 11:30 PM
    Hello team, I ahve a quick questionregarding redshift lineage tracking. If a table is created through a temp table as upstream, would we still capture lineage? for eg: table A-> created from table_b which is a temp table --> created from table C would we still get lineage that table A’s upstream is table C?
  • r

    refined-energy-76018

    02/06/2023, 11:44 PM
    I'm using Datahub Actions to action on dataProcessInstanceProperties aspects that are emitted via the Datahub Airflow Plugin. However, one thing I'm trying to figure out is the best way to get the dag id and task id while executing the action. Any ideas how this can be done or is this something that will require adding properties to customProperties emitted by the Datahub Airflow Plugin?
  • b

    bitter-evening-61050

    02/07/2023, 5:58 AM
    Hi , Can anyone help me how to schedule a ingestion through datahub cli.
    ✅ 1
    g
    s
    • 3
    • 4
  • b

    bitter-evening-61050

    02/07/2023, 6:51 AM
    Hi , I am getting below error. Can anyone help me please raise KeyError(f"key already in use - {key}") KeyError: 'key already in use - datahub'
    g
    f
    e
    • 4
    • 5
  • h

    high-toothbrush-90528

    02/07/2023, 9:11 AM
    Hi everybody! I am implementing a python registration/deletion script. What is the best way to delete the entities in this case? I have used the CLI for deletion inside the script, but I was thinking of using the
    ChangeTypeClass.DELETE
    instead (But I saw it's not supported yet).
  • c

    chilly-potato-57465

    02/07/2023, 12:40 PM
    Hello Everyone! I am wondering about the following. Ingestion happens from different sources and their data types do not match one-to-one. What is the DataHub approach to matching the data types from the different ingestion sources? Thank you!
    ✅ 1
    b
    • 2
    • 5
  • l

    lively-dusk-19162

    02/07/2023, 2:46 PM
    Hello team, Quick question! I am trying on creating a new entity. Is there any plugin to create a new entity? Or Is forking the datahub the only way to create new entity?
    ✅ 1
    a
    • 2
    • 2
  • k

    kind-kite-29761

    02/07/2023, 4:00 PM
    Hello Team I am trying to ingest my S3 data to data hub, I have done my code in .yaml . It's successfully running but it's not sending any data to data hub. What is the issue here? Below are the message, I am getting
    Copy code
    >> datahub ingest -c S3.yml
    [2023-02-07 13:12:16,262] INFO     {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.9.6.4
    [2023-02-07 13:12:16,317] INFO     {datahub.ingestion.run.pipeline:179} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://a770b2e6a6b9d4997bc43a67978e8c9f-1534401114.us-east-1.elb.amazonaws.com:9002/api/gms> with token: eyJh**********Tdmk
    [2023-02-07 13:12:16,697] ERROR    {logger:26} - Please set env variable SPARK_VERSION
    [2023-02-07 13:12:16,697] INFO     {logger:27} - Using deequ: com.amazon.deequ:deequ:1.2.2-spark-3.0
    /home/ec2-user/environment/datahub/lib64/python3.7/site-packages/datahub/ingestion/source/s3/source.py:317: ConfigurationWarning: env is deprecated and will be removed in a future release. Please use platform_instance instead.
      config = DataLakeSourceConfig.parse_obj(config_dict)
    [2023-02-07 13:12:17,144] INFO     {datahub.ingestion.run.pipeline:196} - Source configured successfully.
    [2023-02-07 13:12:17,146] INFO     {datahub.cli.ingest_cli:120} - Starting metadata ingestion
    -[2023-02-07 13:12:17,243] INFO     {botocore.credentials:1253} - Found credentials in shared credentials file: ~/.aws/credentials
    |[2023-02-07 13:13:26,208] INFO     {datahub.cli.ingest_cli:133} - Finished metadata ingestion
    
    Cli report:
    {'cli_entry_location': '/home/ec2-user/environment/datahub/lib64/python3.7/site-packages/datahub/__init__.py',
     'cli_version': '0.9.6.4',
     'mem_info': '186.32 MB',
     'os_details': 'Linux-4.14.301-224.520.amzn2.x86_64-x86_64-with-glibc2.2.5',
     'py_exec_path': '/home/ec2-user/environment/datahub/bin/python3',
     'py_version': '3.7.16 (default, Dec 15 2022, 23:24:54) \n[GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]'}
    Source (s3) report:
    {'aspects': {},
     'entities': {},
     'events_produced': 0,
     'events_produced_per_sec': 0,
     'failures': {},
     'filtered': [],
     'running_time': '1 minute and 9.41 seconds',
     'start_time': '2023-02-07 13:12:16.962307 (1 minute and 9.41 seconds ago)',
     'warnings': {}}
    Sink (datahub-rest) report:
    {'current_time': '2023-02-07 13:13:26.378011 (now)',
     'failures': [],
     'gms_version': 'v0.8.45',
     'pending_requests': 0,
     'records_written_per_second': 0,
     'start_time': '2023-02-07 13:12:16.302696 (1 minute and 10.08 seconds ago)',
     'total_duration_in_seconds': 70.08,
     'total_records_written': 0,
     'warnings': []}
    
     Pipeline finished successfully; produced 0 events in 1 minute and 9.41 seconds.
    ❗Client-Server Incompatible❗ Your client version 0.9.6.4 is older than your server version 0.8.45. Upgrading the cli to 0.8.45 is recommended.
    Any idea, where I am doing wrong. Why I am not able to push it from S3 ?
    b
    a
    • 3
    • 8
  • a

    acceptable-account-83031

    02/07/2023, 5:42 PM
    Hi Team, In Datahub version 0.9.6.4, I am having issues with s3 lineage shown on the UI even after using the config
    emit_s3_lineage: True
    with ingestion source Glue.
    h
    • 2
    • 2
  • q

    quiet-jelly-11365

    02/07/2023, 5:51 PM
    Can we ingest Confluent schema registry protobuf schemas in Datahub ? ( Are only Avro schemas supported?)
    a
    m
    • 3
    • 2
  • a

    alert-fall-82501

    02/07/2023, 6:15 PM
    Hi Team - Can anybody suggest on this error log ? ..I am working to import airflow DAG jobs to datahub using kafka server connection instead of datahub rest .
    • 1
    • 1
  • b

    bland-barista-59197

    02/07/2023, 8:58 PM
    Hi Team I have question I wanted to ingest schema and profile table start with “^attributes*“. So, do I need to specify regx at both location i.e. schema_patterns and profile_patterns?
    ✅ 1
    b
    • 2
    • 1
  • n

    nice-advantage-52080

    02/07/2023, 9:50 PM
    Hi there! I'm trying to create a simple lineage graph with some input datasets going into a job and then output a dataset. I'm running the lineage_dataset_job_dataset.py example on a clean DataHub, but I don't see any data lineage at all. I see 1 Airflow item under Platforms. Click on that link brings me to a page that shows a Data Task, but I cannot click on it. I've tried many different ways by creating the datasets first, then creating the data flow/job, and then attaching the datasets to the datajob_input_output aspect, but nothing seems to work. This should be a fairly simple task but why is it not working for me? Thanks in advance for any help!
    h
    • 2
    • 6
  • b

    breezy-controller-54597

    02/08/2023, 2:40 AM
    Hi, what are the best practices for granting permissions to users for metadata ingestion from SQL-based data sources such as Vertica? Vertica has following roles: DBADMIN, PSEUDOSUPERUSER, DBUSER, SYSMONITOR, PUBLIC
    ✅ 1
    a
    • 2
    • 4
  • f

    flat-painter-78331

    02/08/2023, 7:46 AM
    Hi guys, Good day! I'm trying to run an ingestion and it the status says "pending" and it has been this way for the last few hours. This ingestion ran fine previously. I tried deleting it and adding it again, didnt work. Does anybody know why this might be happening? *I've deployed datahub on kubernetes
    s
    a
    • 3
    • 4
  • i

    incalculable-manchester-41314

    02/08/2023, 9:39 AM
    Hi, does anyone know can we use kafka sink for all kind of source or it is limited ? for example if source mysql we can set kafka as sink but if power bi we can't ?
    ✅ 1
    m
    • 2
    • 1
  • r

    ripe-eye-60209

    02/08/2023, 1:20 PM
    Hello Team, sqlalchemyI(Teradata) profiling seem to have an error. Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/datahub/entrypoints.py", line 171, in main sys.exit(datahub(standalone_mode=False, **kwargs)) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 350, in wrapper raise e File "/usr/local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 302, in wrapper res = func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper return func(ctx, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 190, in run loop.run_until_complete(run_func_check_upgrade(pipeline)) File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 149, in run_func_check_upgrade ret = await the_one_future File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_async return await loop.run_in_executor( File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion raise e File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 123, in run_pipeline_to_completion pipeline.run() File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 334, in run for wu in itertools.islice( File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 638, in get_workunits for inspector in self.get_inspectors(): File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 519, in get_inspectors engine = create_engine(url, **self.config.options) File "<string>", line 2, in create_engine File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 375, in warned return fn(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 636, in create_engine raise TypeError( TypeError: Invalid argument(s) 'max_overflow' sent to create_engine(), using configuration TeradataDialect/SingletonThreadPool/Engine. Please check that the keyword arguments are appropriate for this combination of components.
    • 1
    • 1
  • l

    lemon-scooter-69730

    02/08/2023, 1:41 PM
    Is there a way to configure a connection to airflow if I do not have access to the commandline for my airflow installation?
    ✅ 1
    d
    • 2
    • 2
  • e

    elegant-salesmen-99143

    02/08/2023, 6:06 PM
    A question. I have a transformer that adds Glossary Terms based on Field name pattern. Can I make the transformer in the ingest recipe work only for certain schema (container) in a datasource? Or omit certain containers? I've tried adding
    schema_pattern:
    allow:
    to the transformer, but it didn't work, the ingest failed with an error in that part of recipe. Or maybe I messed up the nesting depth, where should it be?
    g
    s
    • 3
    • 6
  • e

    elegant-salesmen-99143

    02/08/2023, 6:10 PM
    And in addition to my question above, if I can't specify schema within a transformer, what happens if I run two ingestions for the same datasource: one for full database and with no transformers, and one for only one schema within database and with a transformer that adds terms by field pattern? will it solve my problem or will there be errors due two two conflicting ingestions?
  • c

    clean-tomato-22549

    02/09/2023, 3:09 AM
    Hi team, about UI ingest https://datahubproject.io/docs/ui-ingestion Can I use it to ingest business glossary or csv enricher, which need to specify the ingested file position? If yes, could you help to provide some example for them? We prefer to use UI to manage our ingestion, but not CLI.
    ✅ 1
    a
    h
    • 3
    • 8
  • c

    cold-airport-17919

    02/09/2023, 4:10 AM
    Hi, I am evaluating the csv-enricher module for ingesting a CSV file. Can someone share a good sample CSV file that I can use to learn? Thank you
    ✅ 1
    b
    • 2
    • 2
  • c

    clean-tomato-22549

    02/09/2023, 6:04 AM
    Hi Team, do we support bulk ingest link for datasets in document? I thought the csv enricher can support it. However, it seems not, according to the code https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/csv_enricher.py I want to take use of the link ability to jump to our query portal. Do we have such kind of practice to share?
    f
    a
    e
    • 4
    • 11
  • t

    thousands-bird-50049

    02/09/2023, 6:48 AM
    is there any way to know which managed ingestion source brought in a certain entity?
    g
    a
    • 3
    • 3
  • l

    late-bear-87552

    02/09/2023, 6:59 AM
    i am getting below error while trying to ingest bigquery metadata ingestion, can anyone help with this?
    Copy code
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 114, in _add_init_error_context
        yield
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 192, in __init__
        self.source = source_class.create(
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 262, in create
        return cls(ctx, config)
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 199, in __init__
        super(BigqueryV2Source, self).__init__(config, ctx)
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/source/state/stateful_ingestion_base.py", line 180, in __init__
        self._initialize_checkpointing_state_provider()
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/source/state/stateful_ingestion_base.py", line 223, in _initialize_checkpointing_state_provider
        checkpointing_state_provider_class.create(
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/source/state_provider/datahub_ingestion_checkpointing_provider.py", line 50, in create
        graph = DataHubGraph(provider_config.datahub_api)
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/graph/client.py", line 72, in __init__
        self.test_connection()
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 146, in test_connection
        response = self._session.get(f"{self._gms_server}/config")
      File "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 600, in get
        return self.request("GET", url, **kwargs)
      File "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/airflow/.local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
        r = adapter.send(request, **kwargs)
      File "/home/airflow/.local/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
        resp = conn.urlopen(
      File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
        httplib_response = self._make_request(
      File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
        super(HTTPConnection, self).request(method, url, body=body, headers=headers)
      File "/usr/local/lib/python3.8/http/client.py", line 1256, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/local/lib/python3.8/http/client.py", line 1297, in _send_request
        self.putheader(hdr, value)
      File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 224, in putheader
        _HTTPConnection.putheader(self, header, *values)
      File "/usr/local/lib/python3.8/http/client.py", line 1234, in putheader
        raise ValueError('Invalid header value %r' % (values[i],))
    ValueError: Invalid header value b'Bearer ***********\n'
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
        return_value = self.execute_callable()
      File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 192, in execute_callable
        return self.python_callable(*self.op_args, **self.op_kwargs)
      File "/opt/airflow/dags/repo/org/groww/dataplatform/datahub/bigquery/DAG_DATAHUB_BIGQUERY_META_WITH_DENY_DATASETS.py", line 85, in ingest_bigquery_metadata
        pipeline = Pipeline.create(complete_json)
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 303, in create
        return cls(
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 196, in __init__
        <http://logger.info|logger.info>("Source configured successfully.")
      File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
        self.gen.throw(type, value, traceback)
      File "/home/airflow/.local/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 116, in _add_init_error_context
        raise PipelineInitError(f"Failed to {step}: {e}") from e
    s
    • 2
    • 3
  • l

    limited-forest-73733

    02/09/2023, 7:27 AM
    Hey team, i am facing an issue i am not able to see deployed version on my UI, it’s showing null here. I deployed 0.9.6 images and released with tag 3.1.0 so it has to reflect 3.1.0. Can anyone pls let me know gow frontend utilised this version to reflect on UI. Thanks
    c
    g
    e
    • 4
    • 6
  • w

    white-napkin-20729

    02/09/2023, 9:23 AM
    Hi! I've just got my first test ingestion from MS SQL Server working - I was just wondering if there is a way to import the Extended Properties for MSSQL database objects? (i.e. the ones added using sp_addextendedproperty). Ideally I'd like to use these to automatically populate the tags (or something similar)...thanks!
    g
    • 2
    • 3
  • e

    elegant-salesmen-99143

    02/09/2023, 9:39 AM
    Sorry if that's a silly question, but what's the difference between Presto-on-Hive and Presto as Datahub sources? When it is better to use one or the other?
    g
    • 2
    • 2
  • p

    purple-printer-15193

    02/09/2023, 9:54 AM
    Hi all, does the UI ingestion and Datahub Actions now work with AWS Glue as a schema registry?
    g
    q
    • 3
    • 7
1...101102103...144Latest