https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • l

    lively-dusk-19162

    02/23/2023, 5:50 PM
    Hello team, I am grtting the following error when I was building datahub to reflect the changes after creating the new entity.I was running ./gradlew build Could anyone please help me on this?
    b
    a
    • 3
    • 2
  • w

    white-horse-97256

    02/23/2023, 7:27 PM
    Hi Team, I am trying to create DataLineage from JAVA-SDK client using DataJobInputOutput() class. I see that the function
    setInputDatajobs()
    is deprecated and there is no other function for setting the job values in that class for this variable
    _inputDatajobsField
    Where and how can i create lineage of dataset->datajob->dataset using JAVA SDK? I don't find any reference documentation either?
    h
    m
    • 3
    • 5
  • h

    handsome-flag-16272

    02/23/2023, 7:33 PM
    1. Is datahub-mce-consumer responsible for consuming the events? 2. To improve the throughput, we need to increasing the number of Kafka partitions and scaling out the datahub-mce-consumer instance. - Is that all messages send to one topic or we can config diffient Kafka topics per domain? - Does datahub support scaleing out consumer based on the number of to be processed messages in the Kafka topic? 3. If no standalone datahub-mce-consumer instances, which component will do the job, datahub-gms or other servers? 4. Where will datahub-mce-consumer send data to, datahub-gms or directly to ElasticSearch for index buildibng? It appears to the datahub-gms.
    h
    • 2
    • 2
  • r

    rich-daybreak-77194

    02/24/2023, 1:53 AM
    Why column stat doesn’t show it too much row? Another table columns stat is display but it’s has 1M rows
    ✅ 1
    h
    • 2
    • 1
  • m

    magnificent-lawyer-97772

    02/24/2023, 10:00 AM
    Hi folks, we are writing a custom transformer. We noticed that in the documentation, developing transformers depends on datahub_actions. There’s another Transformer in the datahub.ingestion package. In the past the documentation used to point towards datahub.ingestion package but nowadays, it points to datahub_actions. Are transformers based on datahub_actions the new way to go forward and will the datahub.ingestion package be deprecated or will they continue co-existing?
    ✅ 1
    h
    b
    • 3
    • 5
  • b

    boundless-nail-65912

    02/24/2023, 2:50 PM
    HI Team, I am experiencing a weird issue. I took fresh linux machine and installed datahub in linux(Rhel8.x). I checked the datahub version and I can see it is 0.10.0.2 and I went inside datahub actions docker container there I checked the datahub version it says 0.9.6.2. And also for vertica .Inside container it has old dailect (sqlalchemy_vertica). and outside the docker container it has latest dailect (vertica_sqlalchemy_dailect). Can someone help me with this?
    a
    • 2
    • 3
  • d

    dazzling-microphone-98929

    02/24/2023, 5:57 PM
    Hello All, I am getting error with Power BI ingestion , attaching the logs here and looking for some guidance.Thanks in advance!!!
    exec-urn_li_dataHubExecutionRequest_305aece4-b17c-46e9-bab2-80a368fe4322.log
  • g

    gray-ghost-82678

    02/24/2023, 7:58 PM
    Hi all I have a question about creating tables and jobs. Is it possible to create jobs manually or is it only available from ingesting from Airflow? Also can tables with their column names be created manually either from the UI or the CLI? Thanks!
  • w

    white-horse-97256

    02/24/2023, 8:43 PM
    Hi Team, I am trying to create a datajob with JAVA emitter and I get 422 error,
    Copy code
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
            .entityType("datajob")
            .entityUrn(Utils.createDataJobUrn("Connectors", "source_connector", "cdc_digital_account_master_account_dbz_source_connector_test", "STG"))
            .upsert()
            .aspect( new DataJobInfo()
                    .setName("cdc_digital_account_master_account_dbz_source_connector_test")
                    .setFlowUrn(new DataFlowUrn("Connectors", "source_connector", "STG")))
                    .build();
    • 1
    • 3
  • b

    breezy-controller-54597

    02/25/2023, 6:42 AM
    Hi. 👋 I propose a method to create projects like dbt, define some datasets and lineages in yaml files, and ingest them by project. This is useful for managing metadata as code when you cannot extract metadata directly from data sources for some reason. It is something like a combination of the current csv-enricher and File based Llineage, and it is easy to manage if it works even if it is divided into multiple yaml files and directories like dbt.
    m
    • 2
    • 2
  • r

    rich-daybreak-77194

    02/25/2023, 1:28 PM
    Can we update dataset properties without ingest data? I mean only update dataset properties not refresh new data in dataset Thank you
    👍 1
    f
    a
    • 3
    • 4
  • b

    best-notebook-58252

    02/26/2023, 5:41 PM
    Hi everybody, I’m running a LookML ingestion and noticed a weird behavior, but I don’t know if this is expected because I’m not a Looker expert. I have some models lkml files, where explores are not defined there but they are included:
    Copy code
    connection: "starburst"
    
    include: "/partners_dm/explores/*.explore.lkml"
    checking on source code, when scanning on reachable views only explores defined in model files are considered, ignoring the ones included from separate files: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py#L1632 So every view is marked as unreachable. Is this an expected behavior?
    m
    • 2
    • 2
  • i

    important-afternoon-19755

    02/27/2023, 2:30 AM
    Hi team. I want to provide a stats tab from a glue source. When I checked this link, it looks like I need to use a glue crawler. But I don’t have access to a glue crawler. Is there any way to put the profiling value in the data catalog without using a glue crawler when loading the table from Spark to S3?
  • c

    colossal-easter-99672

    02/27/2023, 8:52 AM
    Hello, team. How to solve problem with overwriting upstream lineage from different sources? Now i query current lineage and add it incrementally, but in long term i can have problems with stale lineage.
  • s

    salmon-vr-6357

    02/27/2023, 4:23 PM
    Hello team, I’m trying to ingest BQ via datahub ui with keys setup in secrets. But got this error while executing
    Copy code
    google.auth.exceptions.DefaultCredentialsError: ('Failed to load service account credentials from /tmp/tmplmjtp2eo', ValueError('Could not deserialize key data. The data may be in an incorrect format, it may be encrypted with an unsupported algorithm, or it may be an unsupported key type (e.g. EC curves with explicit parameters).', [_OpenSSLErrorWithText(code=503841036, lib=60, reason=524556, reason_text=b'error:1E08010C:DECODER routines::unsupported')]))
    Anyone experienced this before and what could be wrong? btw I’m on
    v0.9.6.1
    ✅ 2
    a
    a
    • 3
    • 10
  • r

    ripe-tailor-61058

    02/27/2023, 6:10 PM
    Is it possble via recipe-based ingestion to change the folder structure/containers created in datahub? I am doing an s3 based ingestion and what I see is in datahub it gets placed in <env>/<s3/<bucket-name>/folder1/folder2/file.csv. What if I want it to be cataloged at <env>/s3/<bucket-name>/folder3/file.csv? Is there a transformer that does that? Thanks!
    a
    • 2
    • 3
  • c

    cold-dress-65039

    02/28/2023, 7:31 AM
    hi! i'm trying to ingest Mode, but i keep getting this
    ParserError: Unknown string format: None
    error. doesn't look like this has been flagged before. any idea on how to triage or fix this?
    g
    a
    • 3
    • 26
  • a

    agreeable-cricket-61480

    02/28/2023, 10:53 AM
    Hi, I am calling the stored procedure in snowflake from airflow DAG. I am able to see the lineage in the task but I want to see the lineage with its upstream from snowflake. How can I do this?
    g
    • 2
    • 4
  • c

    crooked-rose-22807

    02/28/2023, 4:05 PM
    Hi, The Business Glossary plugin is not behaving as expected. 1. Deleting glossary via
    entity_type
    does nothing. Instead I need to do deletion using the urn one by one which is impractical but this method still have more issues:
    Copy code
    datahub delete --entity_type glossaryTerm # DOES NOT WORK
    datahub delete --urn "urn:li:glossaryTerm:green_transport_revenue" # assigned a custom ID, WORKS
    datahub delete --hard --urn "urn:li:glossaryTerm:<randomid>" # enable_auto_id=True WORKS
    datahub delete --hard --urn "urn:li:glossaryTerm:Metrics%203.My%20Revenue" # enable_auto_id=False, no custom ID assigned, the urn automatically takes glossary term `name`, DOES NOT WORK
    2. Using
    contains
    and/or
    inherits
    key in the business glossary yaml, works as expected. However, if the a term is removed from these keys, the term DOES NOT ACTUALLY BEING REMOVED from the UI. Interchangeably modify a term to
    contains
    or
    inherits
    key & vice versa, surprisingly works. All in all, this should be fixed ASAP catyay or if there is already available solution, kindly lmk!
    a
    • 2
    • 6
  • b

    busy-train-56443

    02/28/2023, 7:18 PM
    Hi everyone! My goal is to set Properties for a Dataset object. I didn't find a straightforward way to do it. Could someone please advise on how to add Properties to a Dataset entity through Python script or POST method?
    f
    g
    • 3
    • 5
  • r

    refined-energy-76018

    02/28/2023, 11:38 PM
    has there been any success with implementing the airflow cluster policy for DAGs to emit dataProcessInstances at the DAG level? I'm playing around with the code for the Datahub Airflow plugin locally and can't seem to set the
    on_success_callback
    or
    on_failure_callback
    for
    dag_policy
    where it has any effect upon DAG completion. Would that explain why the
    run_dataflow
    and
    complete_dataflow
    are implemented in
    airflow_generator.py
    but not used?
    ✅ 1
    a
    g
    a
    • 4
    • 5
  • a

    acceptable-nest-20465

    03/01/2023, 12:10 AM
    has anyone tried ingesting metadata from neo4j to datahub ? I am able to get json with all relationships, labels,nodes etc by using call apoc.meta.schema but couldn't get figure out how to ingest it into datahub. To use either aspects or daatset snapshots seems it should be one of the existing data platform supported .Otherwise do we have to build whole logic of how to process metadata from neo4j ?
    ✅ 1
    b
    • 2
    • 4
  • e

    elegant-salesmen-99143

    03/01/2023, 12:36 PM
    Hi all. I just tried connecting Kafka to Datahub, I see the topics ingested alright, but I don't see the partitions in them. Is that the expected behavior? What info from Kafka Datahub is supposed to display?
    plus1 2
    l
    s
    +2
    • 5
    • 11
  • q

    quiet-jelly-11365

    03/01/2023, 1:37 PM
    Hi all, has anyone managed to sync AWS managed Kafka with AWS Glue schema registry to Datahub ?
  • l

    lively-dusk-19162

    03/01/2023, 3:31 PM
    Hello everyone, does anyone has idea , when we create a new entity is it possible to write graphql code changes in python because the original code for other entities is available in java?
    g
    g
    • 3
    • 13
  • h

    handsome-flag-16272

    03/01/2023, 8:12 PM
    Hi Team, Is there any one can help to answer the following questions about the the stateful ingestions (datahub v0.10.0)? 1. Between the ingestions on the same data, there’s only one new table column was added, how many events will be sent from action node to gms node? I assume that should only one, but actually I found that’s the full DB metadata were sent. Is there anything wrong in my config?
    Copy code
    source:
      type: snowflake
      config:
        platform_instance: DEV
    
        # Coordinates
        account_id: MY_ACCOUNT
        warehouse: MY_WH
    
        # Credentials
        username: MY_NAME
        password: MY_PASS
    
        # Options
        include_table_lineage: true
        include_view_lineage: true
        include_operational_stats: false
        include_usage_stats: false
    
        database_pattern:
          allow:
            - MY_DB
        schema_pattern:
          allow:
            - MY_SCHEMA
        stateful_ingestion:
          enabled: true
    
    datahub_api:
      server: '<http://localhost:8080>'
    
    sink:
      type: datahub-rest
      config:
        server: '<http://localhost:8080>'
    
    pipeline_name: 'urn:li:dataHubIngestionSource:dev_snowflake_db'
    2. Before the 3rd run of stateful ingestion, I dropped the my_test table. This time I can the summary in CLI as below:
    Copy code
    Pipeline finished with at least 2 warnings; produced 168 events in 30.94 seconds.
    In the 1st and 2nd rans, the message is “… produced 169 events ….“. The issues I’ve found: • It also indicates this stateful ingestion is full ingestion rather than delta ingestion • When I login the UI, I can still see the my_table. However, it neither marked as soft deleted nor update the “Last synchronized” time correctly. The “Last synchronized” is the 2nd ingestion time.
    g
    • 2
    • 8
  • r

    ripe-tailor-61058

    03/01/2023, 9:13 PM
    Hello, I am trying to ingest via recipe and use a transformer to add dataset properties, per https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#simple-add-dataset-datasetproperties I am running into an error with the following transformer:
    transformers:
    - type: "simple_add_dataset_properties"
    config:
    semantics: PATCH
    properties:
    bucket: djla-dev-tenant-jna
    dataset: dataset2
    Copy code
    [2023-03-01 15:59:22,624] DEBUG  {datahub.telemetry.telemetry:239} - Sending Telemetry
    [2023-03-01 15:59:22,689] DEBUG  {datahub.ingestion.run.pipeline:181} - Source type:s3,<class 'datahub.ingestion.source.s3.source.S3Source'> configured
    [2023-03-01 15:59:22,689] ERROR  {datahub.ingestion.run.pipeline:127} - 1 validation error for SimpleAddDatasetPropertiesConfig
    semantics
     extra fields not permitted (type=value_error.extra)
    Traceback (most recent call last):
     File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 197, in __init__
      self._configure_transforms()
     File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/run/pipeline.py", line 212, in _configure_transforms
      transformer_class.create(transformer_config, self.ctx)
     File "/home/jabplana/repos/dpl-scripts/datahub/.venv/lib64/python3.6/site-packages/datahub/ingestion/transformer/add_dataset_properties.py", line 97, in create
      config = SimpleAddDatasetPropertiesConfig.parse_obj(config_dict)
     File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
     File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
    pydantic.error_wrappers.ValidationError: 1 validation error for SimpleAddDatasetPropertiesConfig
    semantics
     extra fields not permitted (type=value_error.extra)
    [2023-03-01 15:59:22,691] INFO   {datahub.cli.ingest_cli:119} - Starting metadata ingestion
    [2023-03-01 15:59:22,692] INFO   {datahub.cli.ingest_cli:137} - Finished metadata ingestion
    
    Failed to configure transformers due to 1 validation error for SimpleAddDatasetPropertiesConfig
    semantics
     extra fields not permitted (type=value_error.extra)
    [2023-03-01 15:59:22,703] DEBUG  {datahub.telemetry.telemetry:239} - Sending Telemetry
    It works fine without the
    semantics: PATCH
    line but can't get it to work when including it before or after the properties.
    ✅ 1
    a
    • 2
    • 4
  • w

    white-horse-97256

    03/01/2023, 11:20 PM
    Hi Team, is there a way to ingest datasets in bulk in python-sdk?
    a
    f
    • 3
    • 4
  • a

    agreeable-cricket-61480

    03/02/2023, 7:21 AM
    Someone help me with these questions: 1. Is datahub providing encryption to the columns before sharing with downstream? 2. How can I find how many tables using the particular column? 3. How can I check recent users accessing this table? 4. How can I check if the user has proper permission to access the table?
    a
    • 2
    • 4
1...106107108...144Latest