https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • r

    refined-ability-35859

    11/02/2022, 5:08 PM
    Hello All, our team is trying to develop vertica connector and right now datahub ingests objects one after the other hence increasing the ingestion time. Can some one suggest a way we can ingest all the metadata at once? (The goal is to make the connector performant and reduce the ingestion time).
    👍 1
    g
    m
    • 3
    • 4
  • l

    lively-dusk-19162

    11/02/2022, 8:09 PM
    Hi all, Is it possible to ingest data to datahub through yaml file?
    g
    h
    • 3
    • 6
  • f

    full-chef-85630

    11/03/2022, 6:14 AM
    When executing an ingestion job, first run a job with a properties transformer, and then run a job without a transformer. Why is the property data empty in the final result @dazzling-judge-80093 Will it cover, Because a job takes a long time to execute, we split it, and this happens. How to prevent this situation
    h
    • 2
    • 11
  • l

    lemon-cat-72045

    11/03/2022, 7:27 AM
    Hi all, does stateful ingestion support Kafka sink? I have a recipe ingesting bigquery metadata to Kafka sink and it failes with the following error:
    Copy code
    'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (bigquery)\n'
               '[2022-11-03 07:23:42,552] ERROR    {datahub.entrypoints:195} - Command failed: \n'
               '\tFailed to configure source (bigquery) due to \n'
               "\t\t'Missing provider configuration.'.\n"
               '\tRun with --debug to get full stacktrace.\n'
               "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/bb9624b9-d4aa-4af4-b861-cd287691400c/recipe.yml --report-to "
    Do I need to config the stateful ingestion provider for the Kafka sink? Thanks!
    h
    • 2
    • 2
  • m

    mammoth-gigabyte-6392

    11/03/2022, 7:40 AM
    Hello! I am trying to ingest data from s3 to the datahub-rest sink. When I execute this script, it runs successfully but no data is uploaded. What am I missing?
    Copy code
    from datahub.ingestion.run.pipeline import Pipeline
    
    
    def get_pipeline():
        pipeline = Pipeline.create(
            {
                "source": {
                    "type": "s3",
                    "config": {
                        "path_specs": [{
                            "include": "<s3://path/to/my/json>"}],
                        "aws_config": {
                            "aws_access_key_id": "**************",
                            "aws_secret_access_key": "***************",
                            "aws_region": "*********"
                        },
                        "env": "prod",
                        "profiling": {"enabled": False},
                    },
                },
                "sink": {
                    "type": "datahub-rest",
                    "config": {
                        "server": "server-link",
                        "token": "*******"
                    }
                },
            }
        )
        return pipeline
    
    
    def main():
        pipeline = get_pipeline()
        pipeline.run()
        pipeline.pretty_print_summary()
    
    
    if __name__ == '__main__':
        main()
    Copy code
    Cli report:
    {'cli_version': '0.9.1',
     'cli_entry_location': '/usr/local/lib/python3.8/dist-packages/datahub/__init__.py',
     'py_version': '3.8.10 (default, Mar 15 2022, 12:22:08) \n[GCC 9.4.0]',
     'py_exec_path': '/usr/bin/python3',
     'os_details': 'Linux-5.4.172-90.336.amzn2.x86_64-x86_64-with-glibc2.29',
     'mem_info': '232.53 MB'}
    Source (s3) report:
    {'events_produced': '0',
     'events_produced_per_sec': '0',
     'event_ids': [],
     'warnings': {},
     'failures': {},
     'filtered': [],
     'start_time': '2022-11-03 07:29:28.481404 (now).',
     'running_time': '0.5 seconds'}
    Sink (datahub-rest) report:
    {'total_records_written': '0',
     'records_written_per_second': '0',
     'warnings': [],
     'failures': [],
     'start_time': '2022-11-03 07:29:28.471589 (now).',
     'current_time': '2022-11-03 07:29:28.982979 (now).',
     'total_duration_in_seconds': '0.51',
     'gms_version': 'v0.8.45',
     'pending_requests': '0'}
    
     Pipeline finished successfully; produced 0 events in 0.5 seconds.
    d
    • 2
    • 30
  • m

    microscopic-mechanic-13766

    11/03/2022, 9:26 AM
    Good morning, I have done a Hive ingestion on v0.9.0 and 0.9.0.4 CLI version. I am aware that the ingestion on this source (especially its profiling) is not the best example, but is this behaviour normal?? The thing is that it hasn't been able to obtaine the min, max, mean and median values for the first 4 numeric fields.
    d
    • 2
    • 5
  • s

    steep-family-13549

    11/03/2022, 9:52 AM
    Hi team, I am integrating great expectations in the cli shows all test pass but the UI only shows the assertions UI does not show the assertion failed or passed at have attached some screenshots please let me anyone know.
    h
    • 2
    • 1
  • s

    steep-family-13549

    11/03/2022, 9:55 AM
    image.png,image.png
    • 1
    • 1
  • d

    dazzling-park-96517

    11/03/2022, 11:15 AM
    Hi i’m new to datahub and I’m trying to ingest a superset source. I’ve read the documentation and wrote the recipe:
    Copy code
    Sink:
     Type: datahub-rest
     Config:
      Server: <http://datahub-Datahub-gms:8080>
    Source:
     Type: superset
     Config:
      Connect_uri: <myhost:port>
      Username: myuser
      Password: mypassword
    But I always get the error below:
    Copy code
    ‘ self.access_token = login_response.json()[“access_token”] ‘
    “​KeyError: ‘access_token’”
    I have the access to my superset implemented with keycloak. Any suggestion to solve this problem? Thanks in advance
    a
    • 2
    • 3
  • r

    rapid-army-98062

    11/03/2022, 11:25 AM
    Hi All, We are new to using datahub. We are trying to use SQLAlchemy with CrateDB but we get the following error:
    Copy code
    '    entrypoint = u._get_entrypoint()\n'
               'File "/tmp/datahub/ingest/venv-928a9961-8859-44e5-aaab-dfe230122564/lib/python3.9/site-packages/sqlalchemy/engine/url.py", line 172, in '
               '_get_entrypoint\n'
               '    cls = registry.load(name)\n'
               'File "/tmp/datahub/ingest/venv-928a9961-8859-44e5-aaab-dfe230122564/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line '
               '277, in load\n'
               '    raise exc.NoSuchModuleError(\n'
               '\n'
               "NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:crate\n"
    We have installed the following packages in the acryl-datahub-actions docker image :
    Copy code
    RUN pip install crate acryl-datahub[sqlalchemy] crate[sqlalchemy]
    however when the ingestion job runs, the crate[sqlalchemy] package is not present. Any idea how we can get that loaded on the ingestion run from the datahub UI?
    m
    • 2
    • 11
  • d

    delightful-barista-90363

    11/03/2022, 4:32 PM
    Hey! We recently upgraded our datahub version to the latest and now have s3 paths in our spark lineage (Love it). One issue with it is that the names currently have s3a instead of s3 in the path. Think this is gonna prevent a link between s3 ingested through s3 source vs s3 ingested through spark lineage. Wondering if theres any plan to work on this!
    a
    • 2
    • 3
  • g

    green-lion-58215

    11/03/2022, 5:07 PM
    does anyone know why I am receiving this error while ingesting glossary terms using recipee?
    Copy code
    File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
    
    File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
    
    ValidationError: 1 validation error for BusinessGlossarySourceConfig
    enable_auto_id
      extra fields not permitted (type=value_error.extra)
    • 1
    • 2
  • b

    bumpy-pharmacist-66525

    11/03/2022, 6:02 PM
    Hi Everyone, When it comes to stateful ingestion of a source, you need to specify a field called
    pipeline_name
    in the recipe (https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/stateful#sample-configuration). Is there a way to delete pipelines once they have been created?
    a
    g
    • 3
    • 4
  • n

    nutritious-salesclerk-57675

    11/03/2022, 6:14 PM
    Good day everyone. I am trying to integrate datahub to my cloud composer instance - airflow version 2.2.5. My REST emitter seems to fail with the following error.
    Copy code
    [2022-11-04, 01:48:09 ] {logging_mixin.py:109} INFO - Exception: Traceback (most recent call last):
      File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 241, in _emit_generic
        response = <http://self._session.post|self._session.post>(url, data=payload)
      File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 577, in post
        return self.request('POST', url, data=data, json=json, **kwargs)
      File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 515, in request
        prep = self.prepare_request(req)
      File "/opt/python3.8/lib/python3.8/site-packages/requests/sessions.py", line 443, in prepare_request
        p.prepare(
      File "/opt/python3.8/lib/python3.8/site-packages/requests/models.py", line 318, in prepare
        self.prepare_url(url, params)
      File "/opt/python3.8/lib/python3.8/site-packages/requests/models.py", line 392, in prepare_url
        raise MissingSchema(error)
    requests.exceptions.MissingSchema: Invalid URL '/aspects?action=ingestProposal': No scheme supplied. Perhaps you meant http:///aspects?action=ingestProposal?
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/opt/python3.8/lib/python3.8/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 337, in custom_on_success_callback
        datahub_on_success_callback(context)
      File "/opt/python3.8/lib/python3.8/site-packages/datahub_airflow_plugin/datahub_plugin.py", line 204, in datahub_on_success_callback
        dataflow.emit(emitter)
      File "/opt/python3.8/lib/python3.8/site-packages/datahub/api/entities/datajob/dataflow.py", line 155, in emit
        rest_emitter.emit(mcp)
      File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 183, in emit
        self.emit_mcp(item)
      File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 218, in emit_mcp
        self._emit_generic(url, payload)
      File "/opt/python3.8/lib/python3.8/site-packages/datahub/emitter/rest_emitter.py", line 255, in _emit_generic
        raise OperationalError(
    datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'message': "Invalid URL '/aspects?action=ingestProposal': No scheme supplied. Perhaps you meant http:///aspects?action=ingestProposal?"})
    [2022-11-04, 01:48:09 ] {logging_mixin.py:109} INFO - 
    [2022-11-04, 01:48:09 ] {local_task_job.py:264} INFO - 0 downstream tasks scheduled from follow-on schedule check
    I dont seem to get this error when I dont have a secret manager configured. This error only occurs when I try to integrate datahub to a composer instance with a secret manager configured. Does anyone have an idea as to what I am doing wrong here?
    g
    h
    • 3
    • 7
  • l

    lively-dusk-19162

    11/03/2022, 7:10 PM
    What will be the best parser to find column level lineage out of sql queries?
    g
    • 2
    • 1
  • e

    eager-lifeguard-22029

    11/03/2022, 11:45 PM
    Is there a way to delete metadata from DataHub via the Python SDK?
    g
    e
    • 3
    • 7
  • l

    lively-dusk-19162

    11/04/2022, 2:57 AM
    Is there any API to emit column level lineage to datahub?
    m
    g
    q
    • 4
    • 10
  • m

    microscopic-mechanic-13766

    11/04/2022, 8:44 AM
    Good Friday, so yesterday (thread) I did some testing related to both Hive and PostgreSQL profiling as the min, max, ... values weren't being obtain in the profiling process. I ended up discovering the source of such values not being obtained (at least for PostgreSQL), and it was due the column's cardinality. If the column had no null values or duplicate values, the cardinality assigned didn't trigger the execution of the processes to calculate the min, max, .. values which I can't yet understand. Could someone please explain to me why is the cardinality of a column is key to calculate this values? In my opinion this should be calculated for all columns but for those whose null count is so big that, for example, calculating its mean looses sense (as it won't be a relevant value) Thanks in advance!
    d
    g
    • 3
    • 12
  • l

    limited-forest-73733

    11/04/2022, 11:12 AM
    Hey team i am not able to see snowflake views.Can anyone please help me. This is the recipe i am using
    h
    • 2
    • 3
  • f

    few-carpenter-93837

    11/04/2022, 12:17 PM
    Hey guys, if I'm using the CLI + recipe to ingest data towards DataHub:
    Copy code
    datahub ingest -c datahub-vertica-lineage-ingestion.dhub.yaml
    Then how am I supposed to toggle the telemetry to disabled, mentioned here: https://datahubproject.io/docs/cli/#user-guide
    b
    • 2
    • 1
  • f

    few-carpenter-93837

    11/04/2022, 12:19 PM
    adding two commands after each-other just gives an error
  • f

    few-carpenter-93837

    11/04/2022, 1:00 PM
    Lineage aspect is a bit hard to understand. Will data sent in by lineage_emitter_dataset_finegrained overwrite data sent in by lineage_emitter_rest and vice versa? What about the note mentioned in https://datahubproject.io/docs/lineage/sample_code, emitting any aspect associated with an entity completely overwrites the previous value. What does this mean, lets say I have the following lineage Table1 -> View1 Atr1 -> Atr1 If I now add another Table to the View1 relation, I need to send in info about both Table1 & Table2? Since if I only send in Table2 info, the previous info about Table1 is overwritten?
    a
    • 2
    • 1
  • l

    limited-forest-73733

    11/04/2022, 1:59 PM
    Hey team i am able to enable snowflake table profiling but there is something wrong unable to see the null count. Do we need to specify any field for this?
    h
    • 2
    • 15
  • m

    most-monkey-10812

    11/04/2022, 2:03 PM
    Hi! I am trying to ingest column-level lineage info as dataJobInputOutput aspect of the datajob entity. But I don't see anything in the UI. There is also a possibility to ingest this info as upstreamLineage aspect of dataset. Are this two approaches complement each other or are they mutually-exclusive? Is column-level lineage information for datajob (dataset->datajob->dataset) some how reflected at lineage visualisation UI or in Column-level Impact Analysis screen in versions 0.9.0 or 0.9.1?
    a
    b
    b
    • 4
    • 8
  • d

    dazzling-park-96517

    11/04/2022, 3:14 PM
    Hi all I’m facing with a Druid recipe ingestion. My Druid app use the https and when I submit the recipe the error I got says that the port is null, my “host_port”.
    Host_port: <https://my-secured-Druid-app:443>
    Can somebody share the recipe for Druid connection? Maybe is the sqlalchemy necessary? Thanks in advance
    a
    g
    • 3
    • 5
  • r

    ripe-alarm-85320

    11/04/2022, 5:22 PM
    Has anyone built an ingestion source for Domo (BI tool) or have some documentation I can turn to and an estimate of complexity/effort for building one?
    a
    • 2
    • 2
  • q

    quiet-school-18370

    11/04/2022, 10:05 PM
    Hi team, i am integrating the datahub with lookml, Following is our recipe.dhub.yaml file
    Copy code
    sink:
        type: datahub-rest
        config:
          server: '<https://datahub.dev.dap.XXXXXX.com:8080>'
          token : "XXXXXX"
    source:
        type: lookml
        config:
           github_info:
              repo: 'XXXX' <repo address where deploy key is added>
              #          deploy_key_file: <file_address>
           api:
                base_url: '<https://dev-looker.XXXXXXX.com>'
                client_secret: 'XXXXXXXXX'
                client_id: XXXXXXXX
           base_folder: /
    pipeline_name: XXXXXXXXX
    but when i am running
    datahub ingest -c recipe.dhub.yaml
    command, i am receiving the following error
    Copy code
    raise ConfigurationError(
    
    ConfigurationError: Failed to initialize Looker client. Please check your configuration.
    m
    a
    +2
    • 5
    • 10
  • q

    quiet-school-18370

    11/04/2022, 10:06 PM
    can anyone help me to resolve this error
  • g

    gifted-rocket-7960

    11/07/2022, 5:41 AM
    Hi Team , I have created a pull model ingestion from redshift to Data hub with include_copy_lineage: true parameter the lineage is created with all the part files in the folder instead of folder name .
    h
    • 2
    • 2
  • g

    gifted-rocket-7960

    11/07/2022, 5:42 AM
    is there a way we can only display folder ?
1...818283...144Latest