https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • a

    alert-fall-82501

    11/28/2022, 10:23 AM
    Copy code
    [2022-11-28 15:48:50,975] INFO     {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.9.2.4
    [2022-11-28 15:48:51,351] ERROR    {datahub.entrypoints:206} - Command failed: while scanning for the next token
    found character '\t' that cannot start any token
      in "<file>", line 13, column 32
    Traceback (most recent call last
        return loader.get_single_data()
      File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
        node = self.get_single_node()
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 36, in get_single_node
        document = self.compose_document()
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 55, in compose_document
        node = self.compose_node(None, None)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node
        node = self.compose_mapping_node(anchor)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 133, in compose_mapping_node
        item_value = self.compose_node(node, item_key)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node
        node = self.compose_mapping_node(anchor)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 133, in compose_mapping_node
        item_value = self.compose_node(node, item_key)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node
        node = self.compose_mapping_node(anchor)
      File "/usr/lib/python3/dist-packages/yaml/composer.py", line 127, in compose_mapping_node
        while not self.check_event(MappingEndEvent):
      File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
        self.current_event = self.state()
      File "/usr/lib/python3/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key
        if self.check_token(KeyToken):
      File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
        self.fetch_more_tokens()
      File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
        raise ScannerError("while scanning for the next token", None,
    yaml.scanner.ScannerError: while scanning for the next token
    found character '\t' that cannot start any token
      in "<file>", line 13, column 32
  • a

    alert-fall-82501

    11/28/2022, 10:25 AM
    Copy code
    source:
      type: redshift
      config:
        # Coordinates
        host_port: xxxxxxxxxxxx
        database: xxx
        database_alias: xx
        # Credentials
        username: xxxx
        password: xxxxxxxx
        include_views: True # whether to include views, defaults to True
        include_tables: True # whether to include views, defaults to True
        include_table_lineage: True	
        schema_pattern:
          allow: ['rawdata']
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
  • a

    alert-fall-82501

    11/28/2022, 11:45 AM
    Hi Team - I am ingesting the metadata from BQ to datahub including the table level lineage but having some issues with validation ..can anybody help me with this ?
    ✅ 1
  • a

    alert-fall-82501

    11/28/2022, 11:46 AM
    Copy code
    /usr/lib/python3/dist-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
      "class": algorithms.Blowfish,
    [2022-11-28 17:12:43,199] INFO     {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.9.2.4
    [2022-11-28 17:12:43,386] INFO     {datahub.ingestion.run.pipeline:174} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://localhost:8080>
    /home/kiranto@cybage.com/.local/lib/python3.8/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 0.1.36ubuntu1 is an invalid version and will not be supported in a future release
      warnings.warn(
    /home/kiranto@cybage.com/.local/lib/python3.8/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 0.23ubuntu1 is an invalid version and will not be supported in a future release
      warnings.warn(
    /home/kiranto@cybage.com/.local/lib/python3.8/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.13.1-unknown is an invalid version and will not be supported in a future release
      warnings.warn(
    [2022-11-28 17:12:58,883] WARNING  {root:99} - project_id_pattern is not set but project_id is set, setting project_id as project_id_pattern. project_id will be deprecated, please use project_id_pattern instead.
    [2022-11-28 17:12:59,244] ERROR    {datahub.entrypoints:182} - Failed to configure source (bigquery): 1 validation error for BigQueryV2Config
    credential -> include_table_lineage
      extra fields not permitted (type=value_error.extra)
    d
    • 2
    • 14
  • b

    brave-pencil-21289

    11/28/2022, 12:16 PM
    When ingesting a schema using platform instance, it is creating 2 folders like one in caps and the other in small like "DELTA" & "delta". The datasets are getting distributed to 2 folders. How to avoid this issue and we can get all datasets on one folder. This is happening for me in Oracle ingestion and I am using database name as platform instance.
    ✅ 1
    d
    • 2
    • 3
  • l

    lively-dusk-19162

    11/28/2022, 5:57 PM
    Hello team, When we ingest fine grained lineages using python SDK to datahub, what should be the entity urn in metadatachangeproposal wrapper? Is it downstream table or all tables?
    ✅ 1
    b
    • 2
    • 6
  • l

    lively-dusk-19162

    11/28/2022, 5:57 PM
    Can anyone help me on this?
  • b

    breezy-controller-54597

    11/29/2022, 2:25 AM
    Hello. I want to run DataHub ingestion from Airflow that deployed on Kubernetes, how should I manage the recipe file? When I use Argo workflows, I define the recipe as a ConfigMap to reference it from the job container, but what is the best way to do this for Airflow? Thank you.
    ✅ 1
    d
    • 2
    • 21
  • a

    average-baker-96343

    11/29/2022, 3:14 AM
    Hello.We want to connect trino. For the same configuration, the CLI succeeds, but the UI fails.The following is the UI running error message
  • a

    average-baker-96343

    11/29/2022, 3:15 AM
    Copy code
    '(trino.exceptions.FailedToObtainAddedPrepareHeader) \\n[SQL: SELECT \\"table_name\\"\\nFROM '
                          '\\"information_schema\\".\\"views\\"\\nWHERE \\"table_schema\\" = ?]\\n[parameters: (\'trino_cd_test\',)]\\n(Background on '
                          'this error at: <http://sqlalche.me/e/13/dbapi>)"], "xxl_job_2.1.0": ["Tables error: '
                          '(trino.exceptions.FailedToObtainAddedPrepareHeader) \\n[SQL: SELECT \\"table_name\\"\\nFROM '
                          '\\"information_schema\\".\\"tables\\"\\nWHERE \\"table_schema\\" = ? and \\"table_type\\" != \'VIEW\']\\n[parameters: '
                          '(\'xxl_job_2.1.0\',)]\\n(Background on this error at: <http://sqlalche.me/e/13/dbapi>)", "Views error: '
                          '(trino.exceptions.FailedToObtainAddedPrepareHeader) \\n[SQL: SELECT \\"table_name\\"\\nFROM '
                          '\\"information_schema\\".\\"views\\"\\nWHERE \\"table_schema\\" = ?]\\n[parameters: (\'xxl_job_2.1.0\',)]\\n(Background on '
                          'this error at: <http://sqlalche.me/e/13/dbapi>)"]}, "tables_scanned": "0", "views_scanned": "0", "entities_profiled": "0", '
                          '"filtered": [], "soft_deleted_stale_entities": [], "start_time": "2022-11-29 03:10:56.524470", "running_time_in_seconds": '
                          '"1"}}, "sink": {"type": "datahub-rest", "report": {"total_records_written": "55", "records_written_per_second": "11", '
                          '"warnings": [], "failures": [], "start_time": "2022-11-29 03:10:53.114169", "current_time": "2022-11-29 03:10:57.986499", '
                          '"total_duration_in_seconds": "4.87", "gms_version": "v0.9.2", "pending_requests": "0"}}}'}
    Execution finished with errors.
    d
    • 2
    • 33
  • l

    loud-journalist-47725

    11/29/2022, 6:12 AM
    Hello datahub community, I'm struggling with re-ingesting glossary nodes / terms I've deleted. I ingested about 17,000 glossary nodes and terms to datahub then, deleted them all and now in the process of re-ingesting them. The ingestion works great except I can't see them on the Datahub UI and when I use datahub CLI to check for information, this appears (check thread). From my understanding, the data exists, but the UI assumes it would not be shown because it's in a
    removed
    state Has anyone else had the chance to change the status
    'removed': 'true'
    to '`false'`?
    d
    • 2
    • 3
  • a

    ancient-policeman-73437

    11/29/2022, 10:22 AM
    why dont I get any help ?
    l
    • 2
    • 1
  • a

    aloof-iron-76856

    11/29/2022, 6:37 PM
    Hello, community. I am ingesting metadata from Kafka, and the thing I get are topics, topic names, the schema for Value, and the schema for Key with field descriptions. I have empty Documentation and Properties for every topic. What can I do wrong?
    plus1 2
    a
    • 2
    • 8
  • f

    freezing-cat-19219

    11/30/2022, 1:27 AM
    Hi datahub. I already have Athena data ingestion. I want to connect additional Maria db to the S3 entity of this lineage. Can you help me?
    m
    • 2
    • 7
  • l

    late-ability-59580

    11/30/2022, 7:34 AM
    Hi all, does ingestion account for Snowflake shares? ❄️ We have two accounts, one shares some databases with the other. When ingesting metadata from the "other", databases from the "one" are overwritten by their shared counterparts. Is there a way to differentiate between Snowflake metadata from different account? Or to identify a shared database?
    plus1 1
    h
    d
    • 3
    • 4
  • f

    future-iron-16086

    11/30/2022, 12:35 PM
    HI, all. I'm using OpenAPI to set some tag to table. I want set two tags for a unique table, but I'm not getting succes to do it. Only one are being set.
    Copy code
    {
       "entity":{
          "value":{
             "com.linkedin.metadata.snapshot.DatasetSnapshot":{
                "urn":"urn:li:dataset:(urn:li:dataPlatform:bigquery,project.schema.table,QA)",
                "aspects":[
                   {
                      "com.linkedin.schema.EditableSchemaMetadata": { 
                         "editableSchemaFieldInfo":[
                            {
                               "fieldPath":"IND_STATUS",
                               "globalTags": {
                                  "tags":[
                                     {
                                         "tag":"urn:li:tag:Engineering_03",
                                         "tag":"urn:li:tag:Felipe"
                                     }
                                  ]
                               },
                               
                            }
                         ]
                      }
                   }
                ]
             }
          }
       }
    }
    Is it possible to do?
    • 1
    • 1
  • c

    calm-psychiatrist-98577

    11/30/2022, 9:24 PM
    Hi Everyone, I don't know if this a good channel but let's try. I'm just curious if Datahub is still using Kafka Streams as documented here: https://github.com/datahub-project/datahub/blob/55357783f330950408e4624b3f1421594c98e3bc/metadata-jobs/README.md I'm checking code of MAE and MCE jobs and this looks like normal Spring Kafka consumer: https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mae-consumer/[…]va/com/linkedin/metadata/kafka/DataHubUsageEventsProcessor.java
    p
    m
    b
    • 4
    • 7
  • s

    square-solstice-69079

    12/01/2022, 8:24 AM
    Hello, I was wondering if you had any updates about supporting spark/databricks for great expectations? It would be a game changer for us to visualize the tests in datahub.
    d
    s
    t
    • 4
    • 10
  • a

    ancient-jordan-41401

    12/01/2022, 10:03 AM
    Hello, trying to ingest data through the CLI with secrets. Just getting below error though. The ingestion works if i use the secrets plainly.
    d
    • 2
    • 5
  • a

    ancient-apartment-23316

    12/01/2022, 1:33 PM
    Hi, could anyone help me, I’m trying to do an ingestion using CLI
    datahub ingest -c myrecipe.dhub.yaml
    and I getting a lot of errors: Warning - Read timed out Error - Failed to fetch the large result set
    Copy code
    -[2022-12-01 15:23:19,646] WARNING  {snowflake.connector.vendored.urllib3.connectionpool:780} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='<http://myhostname.s3.amazonaws.com|myhostname.s3.amazonaws.com>', port=443): Read timed out. (read timeout=7)")': /5bdk-s-v2st8093/results/01a8ad7b-0402-c842-0021-fd031e5452d2_0/main/data_0_4_10?x-amz-server-side-encryption-customer-algorithm=AES256&response-content-encoding=gzip&AWSAccessKeyId=qweqwe&Expires=1669922461&Signature=qweqwe
    /[2022-12-01 15:23:19,843] ERROR    {snowflake.connector.result_batch:342} - Failed to fetch the large result set batch data_0_4_7 for the 1 th time, backing off for 3s for the reason: 'HTTPSConnectionPool(host='<http://myhostname.s3.amazonaws.com|myhostname.s3.amazonaws.com>', port=443): Read timed out.'
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/snowflake/connector/vendored/urllib3/contrib/pyopenssl.py", line 319, in recv_into
        return self.connection.recv_into(*args, **kwargs)
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/OpenSSL/SSL.py", line 1800, in recv_into
        self._raise_ssl_error(self._ssl, result)
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/OpenSSL/SSL.py", line 1607, in _raise_ssl_error
        raise WantReadError()
    OpenSSL.SSL.WantReadError
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/snowflake/connector/vendored/urllib3/contrib/pyopenssl.py", line 319, in recv_into
        return self.connection.recv_into(*args, **kwargs)
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/OpenSSL/SSL.py", line 1800, in recv_into
        self._raise_ssl_error(self._ssl, result)
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/OpenSSL/SSL.py", line 1607, in _raise_ssl_error
        raise WantReadError()
    OpenSSL.SSL.WantReadError
    d
    h
    • 3
    • 28
  • b

    bumpy-pharmacist-66525

    12/01/2022, 2:43 PM
    Hello, I was wondering if there is any way of supplying the
    superset
    source with an OAuth token to Superset? At the moment it seems like you can only supply a username and password to a Superset account, but being able to supply it an OAuth token would be a great feature. Unless I am misunderstanding how it works, once you enable OAuth in Superset, there is no longer a way to login using a local username and password, you must go the OAuth route. This means that the
    superset
    source can no longer work as soon as you enable OAuth (on the Superset end).
    a
    g
    • 3
    • 2
  • q

    quiet-wolf-56299

    12/01/2022, 4:26 PM
    How would one ingest view lineage from an oracle source? as an example. I assume this is dataset to dataset…. but its not a useful thing to have to manually create the lineage in a python script and update emit it. (this is how I understand what the lineage scripts do… you are responsible for manually mapping the lineage and the ingestion just adds that manual lineage to the object in datahub?)
    a
    • 2
    • 4
  • a

    ancient-apartment-23316

    12/01/2022, 8:03 PM
    Hello! What does this warning mean?
    invalid-dataset-pattern
    I don’t see this data loaded into the datahub, I made a recipe for only 2 tables, but it didn’t get into the datahub
    Copy code
    'warnings': {'invalid-dataset-pattern': ["Found ['MY_QQQ_PROD', 'WWW'] of type Schema", "Found ['MY_QQQ_PROD', 'WWW'] of type Schema", "Found ['MY_QQQ_PROD', 'WWW'] of type Schema", "Found ['MY_QQQ_PROD', 'WWW'] of type Schema", "Found ['MY_QQQ_DEV', 'JJ_KK_WW'] of type Schema", "Found ['MY_QQQ_PROD', 'WWW'] of type Schema", "Found ['MY_QQQ_DEV', 'JJ_KK_WW'] of type Schema", "Found ['MY_QQQ_DEV', 'QWE'] of type Schema", "Found ['MY_QQQ_DEV', 'KKK'] of type Schema", "Found ['MY_QQQ_DEV', 'QWE'] of type Schema", '... sampled of 12 total elements']},
    
    ...
    
    'total_records_written': '9',
    
    ...
    
     Pipeline finished with at least 12 warnings; produced 9 events in 4 minutes and 37.43 seconds.
    h
    • 2
    • 2
  • f

    future-iron-16086

    12/01/2022, 8:14 PM
    Hello. How can I set an owner to a container through API?
    a
    • 2
    • 4
  • r

    rhythmic-stone-77840

    12/01/2022, 10:57 PM
    Hi! I'm currently trying to load 1 table from a BQ project, but I looks like its trying to also load GCP logs, which I dont want it to do. How do I turn that off? Recipe setup in 🧵
    🆗 1
    • 1
    • 3
  • b

    blue-fall-10754

    12/01/2022, 11:04 PM
    Hey folks, (Not sure if this is the right place for such a question; so pls lmk if thats the case) Trying out something with Datahub ingestion where I work. Essentially, my infra colleagues have setup the services on k8s, and are relying on me to start ingestion. I have some data (metadata) that I want to ingest programmatically (python), however, in using the emitter
    datahub.emitter.rest_emitter.DatahubRestEmitter
    on the gms endpoint; I am being met with a 401 Unauthorized -
    ConfigurationError: Unable to connect to https://{MY_COMPANIES_DATAHUB_HOST}/api/gms/config with status_code: 401. Maybe you need to set up authentication? Please check your configuration and make sure you are talking to the DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS API (usually <frontend>:9002/api/gms).
    I know the team has not _opted-in_ for authn (can confirm this also coz the root user cannot create access tokens), so is it expected behavior to be running into this issue while attempting hitting gms using the rest emitter API? Whats adding confusion to this is that, when I hit the same gms link in my browser, I can see the json config returned (so that rules out hitting the wrong URL for gms).. Also this issue is not met by a teammate who ingests smaller datasets using file ingestion via UI.
  • s

    square-solstice-69079

    12/02/2022, 8:39 AM
    Hello, any idea why my CLI version for a custom recipe in the UI is showing an older version? I think that is why the ingestion is not working? (Delta Lake)
    Copy code
    '[2022-12-02 08:23:27,442] INFO     {datahub.cli.ingest_cli:177} - DataHub CLI version: 0.8.43.5\n'
               '[2022-12-02 08:23:27,467] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
    I'm on version 0.9.2, and I use only quickstart to upgrade, but use a docker-compose after a upgrade to add some variables to the docker for OIDC.
    h
    • 2
    • 3
  • l

    limited-forest-73733

    12/02/2022, 9:48 AM
    Hey team i check the new release datahub-0.2.116 all the images upgraded to 0.9.3 but arcyldata/datahub-ingestion:v0.9.3 doesn’t exist on docker hub
    a
    g
    b
    • 4
    • 16
  • l

    lemon-cat-72045

    12/05/2022, 5:51 AM
    Hi all, how can I change the default checkpoint provider's commit policy? Thanks!
    Copy code
    'DatahubIngestionCheckpointingProvider. Commit policy = CommitPolicy.ON_NO_ERRORS, has_errors=True, has_warnings=False\n'
    a
    • 2
    • 2
  • k

    kind-sunset-55628

    12/05/2022, 6:32 AM
    Hi all, after upgrading to version 0.9.2 from 0.8.43 Oracle Ingestion is not working, giving error:
    Copy code
    (cx_Oracle.DatabaseError) ORA-00942: table or view does not exist\n'
               '[SQL: SELECT username FROM dba_users ORDER BY username]
    plus1 3
    a
    g
    +5
    • 8
    • 13
1...878889...144Latest