https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • n

    nutritious-bird-77396

    04/16/2022, 5:49 PM
    @early-lamp-41924 Thanks for that hint. When debugging, I found that when trying to get the next version of the aspect here The result in the database wasn't getting just the max(version) ( which should have been just 1) but instead it was getting all the versions for that aspect.
    i.e. 0 and 1
    Copy code
    select aspect, max(version) from metadata_aspect_v2 where urn='urn:li:corpuser:<redacted>' and aspect ='groupMembership' group by  urn, aspect, version
    So, picking the first returned value 0 it's returning the next version as 1 again causing the primary key error. Postgres Databases are created using this script. I suspect something with our postgres setup....not clear on where the issue could be.. Let me know if you have any thoughts.
  • n

    nutritious-bird-77396

    04/18/2022, 4:22 PM
    @early-lamp-41924 I added more debug logs to GMS here What i see is: • Depending on which version gets last from DB that is overwritten as the
    version
    in
    dbResults
    • In my case as the version 1 comes first then version 0 the list has
    groupMembership
    and version as
    0
    as the latest • So when incrementing for the next version it gets 1 again causing primary key error.. Here are debug logs from the environment to prove my finding:
    Copy code
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - From DB- Urn: urn:li:corpuser:<redacted>, Aspect: corpUserInfo, currVersion: 0
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - From DB- Urn: urn:li:corpuser:<redacted>, Aspect: groupMembership, currVersion: 1
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - From DB- Urn: urn:li:corpuser:<redacted>, Aspect: groupMembership, currVersion: 0
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - Contains ASpect - Urn: urn:li:corpuser:<redacted>, Aspect: corpUserInfo, nextVersion: 1
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - Added to result - Aspect: corpUserInfo, nextVersion: 1
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - Contains ASpect - Urn: urn:li:corpuser:<redacted>, Aspect: groupMembership, nextVersion: 1
    16:01:26 [qtp1025799482-17] INFO  c.l.m.entity.ebean.EbeanAspectDao - Added to result - Aspect: groupMembership, nextVersion: 1
    16:01:26 [qtp1025799482-17] INFO  c.l.m.e.ebean.EbeanEntityService - Urn - corpUserInfo, AspectName- 1, nextVersion- {}
    16:01:26 [qtp1025799482-17] INFO  c.l.m.e.ebean.EbeanEntityService - Urn - groupMembership, AspectName- 1, nextVersion- {}
    16:01:26 [qtp1025799482-17] INFO  c.l.m.filter.RestliLoggingFilter - POST /entities?action=ingest - ingest - 500 - 47ms
    16:01:26 [qtp1025799482-17] ERROR c.l.m.filter.RestliLoggingFilter - <http://Rest.li|Rest.li> error:
    com.linkedin.restli.server.RestLiServiceException: com.datahub.util.exception.RetryLimitReached: Failed to add after 3 retries
    Findings: • It looks like this is a basic functionality in Datahub so i doubt if this has to do with Postgres? Any inputs on this would be helpful
  • d

    delightful-barista-90363

    04/18/2022, 4:42 PM
    Hello, I am digging into the datahub source code to contribute to the S3 Source. I want to add the option to ingest bucket and object tags as tags in Datahub. I am having some difficulty finding where i would add the tags in the file here. It seems like the objects being yielded are
    MetadataChangeEvent
    and
    MetadataWorkUnit
    however i am having difficulty finding the definition of these objects and if these are related to tags. I was wondering if anyone could give me some advice. Much appreciated in advanced! Edit: I think this example https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_tag.py is what i was looking for
  • c

    clean-nightfall-92007

    04/19/2022, 6:54 AM
    Hi, I encountered this problem when using this interface
    /aspects?action=ingestProposal
    ,This string should be OK
    is not a valid string representation of bytes
  • m

    mysterious-nail-70388

    04/19/2022, 8:38 AM
    Hi, I encountered this error when I replaced the ES container with a local ES with a username and password
  • f

    fast-ability-23281

    04/21/2022, 8:33 PM
    Hi! My Glue (a table based on our S3 data) ingested just completed successfully on DataHub, but I cannot find the metadata on the DataHub console.
  • e

    early-librarian-13786

    04/22/2022, 11:31 AM
    Hello again, I have the following issue: after metadata ingestion I found that partitions of partitioned postgres tables are treated as separate objects. Is there any way or setting to avoid this?
    plus1 2
  • n

    nutritious-bird-77396

    04/22/2022, 1:47 PM
    I see query being formatted and trimmed but don't see any hashing being done to the query values. So is there a possibility of PII column values being exposed thru the Queries tab?
  • c

    cuddly-arm-8412

    04/24/2022, 6:20 AM
    hi Team... i want to debug ingestion local and i am run
  • b

    brash-photographer-9183

    04/26/2022, 12:05 PM
    https://github.com/datahub-project/datahub/blob/37aedfc87c4e39015bd456bf34debe5a3a[…]feb3/metadata-ingestion/src/datahub/ingestion/source/tableau.py
  • p

    prehistoric-salesclerk-23462

    04/27/2022, 1:12 PM
    @cold-hydrogen-10513 I have the same error, how did you solve it
    Copy code
    {'workunits_produced': 0,
     'workunit_ids': [],
     'warnings': {},
     'failures': {'version': ['Error: (snowflake.connector.errors.DatabaseError) 250001 (08001): Failed to connect to DB: '
                              '<http://mycompay.eu-central-1.snowflakecomputing.com:443|mycompay.eu-central-1.snowflakecomputing.com:443>. Incorrect username or password was specified.\n'
                              '(Background on this error at: <http://sqlalche.me/e/13/4xp6)']>},
  • s

    steep-soccer-91284

    04/28/2022, 1:34 AM
    How can I ingest Airflow? I’m now using helm chart of Datahub.
  • r

    rapid-book-98432

    04/29/2022, 2:21 PM
    Now that i can ingest the superset install, i wanted to check if i can see the "changes" in the lineage if i change the dataset source. I mean are we notified when a dataset is "changed" and can we see the impact in the lineage graph or somewhere else ?
  • a

    alert-football-80212

    05/01/2022, 2:08 PM
    if I have a bucket look like this: bucketName: • table1 ◦ parquet1 ◦ parquet2 ◦ parquet3 ◦ . ◦ . ◦ . • table2 ◦ parquet1 ◦ parquet2 ◦ parquet3 ◦ . ◦ . ◦ . • table3 ◦ parquet1 ◦ parquet2 ◦ parquet3 ◦ . ◦ . ◦ . • . • . • . and i want to ingest all of this tables there us a recipe for it?
  • b

    best-umbrella-24804

    05/02/2022, 1:37 AM
    Hello, I am trying to ingest a glue catalog from an AWS account that is different to the AWS account that datahub is hosted on. The doco says that this can be done using the catalog_id config https://datahubproject.io/docs/metadata-ingestion/source_docs/glue/ My recipe is of this form
  • w

    wonderful-egg-79350

    05/02/2022, 8:14 AM
    I am trying to put only lineage information in an existing mssql table using file-based-lineage. A new dataset is being created. Is there any way to add only lineage information?
  • d

    dry-zoo-35797

    05/03/2022, 3:54 PM
    Hello,
  • d

    dry-zoo-35797

    05/03/2022, 3:54 PM
    Hello All,
  • q

    quaint-lighter-81058

    05/03/2022, 5:03 PM
    "source": { "type": "mysql", "config": { "username": "usrname", "password": "pass", "database": "test_datahub", "host_port": "test.mysql.database.azure.com:3306", }, }, "sink": { "type": "datahub-rest", "config": {"server": "http://localhost:8080"}, }, }
  • n

    nutritious-bird-77396

    05/03/2022, 5:05 PM
    @big-carpet-38439 It looks like from what i can tell from the debug logs
    Copy code
    [2022-05-03 16:53:06,994] DEBUG    {datahub.cli.ingest_cli:94} - Using config: {'source': {'type': 'datahub-stream', 'config': {'auto_offset_reset': 'latest', 'connection': {'bootstrap': '<redacted>', 'schema_registry_url': '<redacted>', 'consumer_config': {'security.protocol': 'SASL_SSL'}}, 'actions': [{'type': 'executor', 'config': {'local_executor_enabled': True, 'remote_executor_enabled': 'False', 'remote_executor_type': 'acryl.executor.sqs.producer.sqs_producer.SqsRemoteExecutor', 'remote_executor_config': {'id': 'remote', 'aws_access_key_id': '""', 'aws_secret_access_key': '""', 'aws_session_token': '""', 'aws_command_queue_url': '""', 'aws_region': 'us-east-1'}}}], 'topic_routes': {'mae': 'MetadataAuditEvent_v4', 'mcl': 'MetadataChangeLog_Versioned_v1'}}}, 'sink': {'type': 'console'}, 'datahub_api': {'server': '<redacted>:8080', 'extra_headers': {'Authorization': 'Basic __datahub_system:JohnSnowKnowsNothing'}}}
    sasl.mechanism
    that is passed is not set in the consumer configs....
  • w

    worried-motherboard-80036

    05/05/2022, 4:59 PM
    Hi everyone, I'm trying to ingest data from an existing Elasticsearch cluster. My recipe:
    Copy code
    source:
      type: "elasticsearch"
      config:
        # Coordinates
        host: '<https://internal_ip:9200>'
    
        # Credentials
        username: the_user
        password: the_pass
    
    sink:
        type: "console"
    because I am not passing any SSL config params, I am getting:
    Copy code
    elasticsearch.exceptions.SSLError: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131))
    I dug into the source code a bit, and I am seeing that the connection to elasticsearch is set by passing only the host and the http_auth (basically username and pass):
  • m

    mysterious-nail-70388

    05/06/2022, 3:19 AM
    Hi, The following error occurred when I ingested the CK data source. The field is missing. Is it a bug or the version of CK
  • s

    sticky-dawn-95000

    05/08/2022, 12:39 PM
    image.png
  • r

    rich-policeman-92383

    05/10/2022, 12:09 PM
    Hello How we define custom run ids in yml while doing ingestion using cli.
  • a

    astonishing-dusk-99990

    05/11/2022, 12:32 PM
    Hi All, I'm having trouble for parsing manifest.json for ingest data dbt.
    Copy code
    'JSONDecodeError: Invalid control character at: line 1594 column 4096 (char 65977)\n'
    However when I'm running in local I'm having in the same error, but I modified in the local with this solution it works with assigned content of json file into variable https://stackoverflow.com/questions/9156417/valid-json-giving-jsondecodeerror-expecting-delimiter https://stackoverflow.com/questions/63107394/jsonload-jsondecodeerror-invalid-control-character-at But if I tried to execute in datahub UI it always error. Does anyone know which file I can edit to add
    strict=False
    in json.loads?
  • m

    miniature-sandwich-75434

    05/16/2022, 7:24 AM
    Hello, for Redash ingestion, which all Redash versions are supported? https://datahubproject.io/docs/generated/ingestion/sources/redash/
  • m

    millions-waiter-49836

    05/16/2022, 7:37 PM
    Hey team, a question about partition. As I see from the GraphQL Reference,
    partition
    sits in
    dataset.datasetProfiles.partitionSpec
    .Does this mean we can only surface
    partition
    as part of
    datasetProfiles
    ? IMHO, shouldn’t dataset partition exists by itself, not depending on the dataset profiles?
  • b

    best-umbrella-24804

    05/17/2022, 4:11 AM
    Hello, I have a Great Expectations script pushing validations to datahub. Everything works as expected except the script refuses to terminate even with a sys.exit() at the end of the script. I have to force it to terminate with CTRL+C. Any idea why this is happening?
  • c

    chilly-gpu-46080

    05/18/2022, 7:38 AM
    Hi All
  • b

    brash-sundown-77702

    05/19/2022, 2:33 PM
    I am trying to do CLI ingestion from mysql to data hub rest.
1...137138139...144Latest