DataHub #ingestion

refined-ability-35859

11/02/2022, 5:08 PM

Hello All, our team is trying to develop vertica connector and right now datahub ingests objects one after the other hence increasing the ingestion time. Can some one suggest a way we can ingest all the metadata at once? (The goal is to make the connector performant and reduce the ingestion time).

👍 1

lively-dusk-19162

11/02/2022, 8:09 PM

Hi all, Is it possible to ingest data to datahub through yaml file?

full-chef-85630

11/03/2022, 6:14 AM

When executing an ingestion job, first run a job with a properties transformer, and then run a job without a transformer. Why is the property data empty in the final result @dazzling-judge-80093 Will it cover, Because a job takes a long time to execute, we split it, and this happens. How to prevent this situation

lemon-cat-72045

11/03/2022, 7:27 AM

Hi all, does stateful ingestion support Kafka sink? I have a recipe ingesting bigquery metadata to Kafka sink and it failes with the following error:

Copy code

'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (bigquery)\n'
           '[2022-11-03 07:23:42,552] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\tFailed to configure source (bigquery) due to \n'
           "\t\t'Missing provider configuration.'.\n"
           '\tRun with --debug to get full stacktrace.\n'
           "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/bb9624b9-d4aa-4af4-b861-cd287691400c/recipe.yml --report-to "

Do I need to config the stateful ingestion provider for the Kafka sink? Thanks!

mammoth-gigabyte-6392

11/03/2022, 7:40 AM

Hello! I am trying to ingest data from s3 to the datahub-rest sink. When I execute this script, it runs successfully but no data is uploaded. What am I missing?

Copy code

from datahub.ingestion.run.pipeline import Pipeline


def get_pipeline():
    pipeline = Pipeline.create(
        {
            "source": {
                "type": "s3",
                "config": {
                    "path_specs": [{
                        "include": "<s3://path/to/my/json>"}],
                    "aws_config": {
                        "aws_access_key_id": "**************",
                        "aws_secret_access_key": "***************",
                        "aws_region": "*********"
                    },
                    "env": "prod",
                    "profiling": {"enabled": False},
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {
                    "server": "server-link",
                    "token": "*******"
                }
            },
        }
    )
    return pipeline


def main():
    pipeline = get_pipeline()
    pipeline.run()
    pipeline.pretty_print_summary()


if __name__ == '__main__':
    main()

Copy code

Cli report:
{'cli_version': '0.9.1',
 'cli_entry_location': '/usr/local/lib/python3.8/dist-packages/datahub/__init__.py',
 'py_version': '3.8.10 (default, Mar 15 2022, 12:22:08) \n[GCC 9.4.0]',
 'py_exec_path': '/usr/bin/python3',
 'os_details': 'Linux-5.4.172-90.336.amzn2.x86_64-x86_64-with-glibc2.29',
 'mem_info': '232.53 MB'}
Source (s3) report:
{'events_produced': '0',
 'events_produced_per_sec': '0',
 'event_ids': [],
 'warnings': {},
 'failures': {},
 'filtered': [],
 'start_time': '2022-11-03 07:29:28.481404 (now).',
 'running_time': '0.5 seconds'}
Sink (datahub-rest) report:
{'total_records_written': '0',
 'records_written_per_second': '0',
 'warnings': [],
 'failures': [],
 'start_time': '2022-11-03 07:29:28.471589 (now).',
 'current_time': '2022-11-03 07:29:28.982979 (now).',
 'total_duration_in_seconds': '0.51',
 'gms_version': 'v0.8.45',
 'pending_requests': '0'}

 Pipeline finished successfully; produced 0 events in 0.5 seconds.

microscopic-mechanic-13766

11/03/2022, 9:26 AM

Good morning, I have done a Hive ingestion on v0.9.0 and 0.9.0.4 CLI version. I am aware that the ingestion on this source (especially its profiling) is not the best example, but is this behaviour normal?? The thing is that it hasn't been able to obtaine the min, max, mean and median values for the first 4 numeric fields.

steep-family-13549

11/03/2022, 9:52 AM

Hi team, I am integrating great expectations in the cli shows all test pass but the UI only shows the assertions UI does not show the assertion failed or passed at have attached some screenshots please let me anyone know.

steep-family-13549

11/03/2022, 9:55 AM

image.png,image.png

dazzling-park-96517

11/03/2022, 11:15 AM

Hi i’m new to datahub and I’m trying to ingest a superset source. I’ve read the documentation and wrote the recipe:

Copy code

Sink:
 Type: datahub-rest
 Config:
  Server: <http://datahub-Datahub-gms:8080>
Source:
 Type: superset
 Config:
  Connect_uri: <myhost:port>
  Username: myuser
  Password: mypassword

But I always get the error below:

Copy code

‘ self.access_token = login_response.json()[“access_token”] ‘
“KeyError: ‘access_token’”

I have the access to my superset implemented with keycloak. Any suggestion to solve this problem? Thanks in advance

rapid-army-98062

11/03/2022, 11:25 AM

Hi All, We are new to using datahub. We are trying to use SQLAlchemy with CrateDB but we get the following error:

Copy code

'    entrypoint = u._get_entrypoint()\n'
           'File "/tmp/datahub/ingest/venv-928a9961-8859-44e5-aaab-dfe230122564/lib/python3.9/site-packages/sqlalchemy/engine/url.py", line 172, in '
           '_get_entrypoint\n'
           '    cls = registry.load(name)\n'
           'File "/tmp/datahub/ingest/venv-928a9961-8859-44e5-aaab-dfe230122564/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line '
           '277, in load\n'
           '    raise exc.NoSuchModuleError(\n'
           '\n'
           "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:crate\n"

We have installed the following packages in the acryl-datahub-actions docker image :

Copy code

RUN pip install crate acryl-datahub[sqlalchemy] crate[sqlalchemy]

however when the ingestion job runs, the crate[sqlalchemy] package is not present. Any idea how we can get that loaded on the ingestion run from the datahub UI?

delightful-barista-90363

11/03/2022, 4:32 PM

Hey! We recently upgraded our datahub version to the latest and now have s3 paths in our spark lineage (Love it). One issue with it is that the names currently have s3a instead of s3 in the path. Think this is gonna prevent a link between s3 ingested through s3 source vs s3 ingested through spark lineage. Wondering if theres any plan to work on this!

green-lion-58215

11/03/2022, 5:07 PM

does anyone know why I am receiving this error while ingesting glossary terms using recipee?

Copy code

File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj

File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__

ValidationError: 1 validation error for BusinessGlossarySourceConfig
enable_auto_id
  extra fields not permitted (type=value_error.extra)

bumpy-pharmacist-66525

11/03/2022, 6:02 PM

Hi Everyone, When it comes to stateful ingestion of a source, you need to specify a field called

pipeline_name

in the recipe (https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/stateful#sample-configuration). Is there a way to delete pipelines once they have been created?

nutritious-salesclerk-57675

11/03/2022, 6:14 PM

Good day everyone. I am trying to integrate datahub to my cloud composer instance - airflow version 2.2.5. My REST emitter seems to fail with the following error.

Copy code

[2022-11-04, 01:48:09 ] {logging_mixin.py:109} INFO - Exception: Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 241, in _emit_generic
    response = <http://self._session.post|self._session.post>(url, data=payload)
  File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 577, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 515, in request
    prep = self.prepare_request(req)
  File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 443, in prepare_request
    p.prepare(
  File "/opt/python3.8/lib/python3.8/site-packages/requests/models.py", line 318, in prepare
    self.prepare_url(url, params)
  File "/opt/python3.8/lib/python3.8/site-packages/requests/models.py", line 392, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/aspects?action=ingestProposal': No scheme supplied. Perhaps you meant http:///aspects?action=ingestProposal?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 337, in custom_on_success_callback
    datahub_on_success_callback(context)
  File "/opt/python3.8/lib/python3.8/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 204, in datahub_on_success_callback
    dataflow.emit(emitter)
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/api/entities/datajob/dataflow.py", line 155, in emit
    rest_emitter.emit(mcp)
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 183, in emit
    self.emit_mcp(item)
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 218, in emit_mcp
    self._emit_generic(url, payload)
  File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 255, in _emit_generic
    raise OperationalError(
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'message': "Invalid URL '/aspects?action=ingestProposal': No scheme supplied. Perhaps you meant http:///aspects?action=ingestProposal?"})
[2022-11-04, 01:48:09 ] {logging_mixin.py:109} INFO - 
[2022-11-04, 01:48:09 ] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check

I dont seem to get this error when I dont have a secret manager configured. This error only occurs when I try to integrate datahub to a composer instance with a secret manager configured. Does anyone have an idea as to what I am doing wrong here?

lively-dusk-19162

11/03/2022, 7:10 PM

What will be the best parser to find column level lineage out of sql queries?

eager-lifeguard-22029

11/03/2022, 11:45 PM

Is there a way to delete metadata from DataHub via the Python SDK?

lively-dusk-19162

11/04/2022, 2:57 AM

Is there any API to emit column level lineage to datahub?

microscopic-mechanic-13766

11/04/2022, 8:44 AM

Good Friday, so yesterday (thread) I did some testing related to both Hive and PostgreSQL profiling as the min, max, ... values weren't being obtain in the profiling process. I ended up discovering the source of such values not being obtained (at least for PostgreSQL), and it was due the column's cardinality. If the column had no null values or duplicate values, the cardinality assigned didn't trigger the execution of the processes to calculate the min, max, .. values which I can't yet understand. Could someone please explain to me why is the cardinality of a column is key to calculate this values? In my opinion this should be calculated for all columns but for those whose null count is so big that, for example, calculating its mean looses sense (as it won't be a relevant value) Thanks in advance!

limited-forest-73733

11/04/2022, 11:12 AM

Hey team i am not able to see snowflake views.Can anyone please help me. This is the recipe i am using

few-carpenter-93837

11/04/2022, 12:17 PM

Hey guys, if I'm using the CLI + recipe to ingest data towards DataHub:

Copy code

datahub ingest -c datahub-vertica-lineage-ingestion.dhub.yaml

Then how am I supposed to toggle the telemetry to disabled, mentioned here: https://datahubproject.io/docs/cli/#user-guide

few-carpenter-93837

11/04/2022, 12:19 PM

adding two commands after each-other just gives an error

few-carpenter-93837

11/04/2022, 1:00 PM

Lineage aspect is a bit hard to understand. Will data sent in by lineage_emitter_dataset_finegrained overwrite data sent in by lineage_emitter_rest and vice versa? What about the note mentioned in https://datahubproject.io/docs/lineage/sample_code, emitting any aspect associated with an entity completely overwrites the previous value. What does this mean, lets say I have the following lineage Table1 -> View1 Atr1 -> Atr1 If I now add another Table to the View1 relation, I need to send in info about both Table1 & Table2? Since if I only send in Table2 info, the previous info about Table1 is overwritten?

limited-forest-73733

11/04/2022, 1:59 PM

Hey team i am able to enable snowflake table profiling but there is something wrong unable to see the null count. Do we need to specify any field for this?

most-monkey-10812

11/04/2022, 2:03 PM

Hi! I am trying to ingest column-level lineage info as dataJobInputOutput aspect of the datajob entity. But I don't see anything in the UI. There is also a possibility to ingest this info as upstreamLineage aspect of dataset. Are this two approaches complement each other or are they mutually-exclusive? Is column-level lineage information for datajob (dataset->datajob->dataset) some how reflected at lineage visualisation UI or in Column-level Impact Analysis screen in versions 0.9.0 or 0.9.1?

dazzling-park-96517

11/04/2022, 3:14 PM

Hi all I’m facing with a Druid recipe ingestion. My Druid app use the https and when I submit the recipe the error I got says that the port is null, my “host_port”.

Host_port: <https://my-secured-Druid-app:443>

Can somebody share the recipe for Druid connection? Maybe is the sqlalchemy necessary? Thanks in advance

ripe-alarm-85320

11/04/2022, 5:22 PM

Has anyone built an ingestion source for Domo (BI tool) or have some documentation I can turn to and an estimate of complexity/effort for building one?

quiet-school-18370

11/04/2022, 10:05 PM

Hi team, i am integrating the datahub with lookml, Following is our recipe.dhub.yaml file

Copy code

sink:
    type: datahub-rest
    config:
      server: '<https://datahub.dev.dap.XXXXXX.com:8080>'
      token : "XXXXXX"
source:
    type: lookml
    config:
       github_info:
          repo: 'XXXX' <repo address where deploy key is added>
          #          deploy_key_file: <file_address>
       api:
            base_url: '<https://dev-looker.XXXXXXX.com>'
            client_secret: 'XXXXXXXXX'
            client_id: XXXXXXXX
       base_folder: /
pipeline_name: XXXXXXXXX

but when i am running

datahub ingest -c recipe.dhub.yaml

command, i am receiving the following error

Copy code

raise ConfigurationError(

ConfigurationError: Failed to initialize Looker client. Please check your configuration.

quiet-school-18370

11/04/2022, 10:06 PM

can anyone help me to resolve this error

gifted-rocket-7960

11/07/2022, 5:41 AM

Hi Team , I have created a pull model ingestion from redshift to Data hub with include_copy_lineage: true parameter the lineage is created with all the part files in the folder instead of folder name .

gifted-rocket-7960

11/07/2022, 5:42 AM

is there a way we can only display folder ?