https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • m

    mammoth-bear-12532

    06/08/2022, 4:52 PM
    It is raining good news today! If you've ever found yourself stumped with how to fill out the ingestion configuration correctly, we finally have a solution! yaml validators for VSCode and Intellij, so you can write those recipe files correctly the first time. Check out the docs here. Huge kudos to @dazzling-judge-80093 for building this for all ingestion enthusiasts! đź§µ
    Screen Recording 2022-06-06 at 18.52.04.mov
    🤩 2
    👍 7
    d
    r
    h
    • 4
    • 4
  • h

    handsome-football-66174

    06/08/2022, 5:34 PM
    Hi Everyone, trying to use https://datahubproject.io/docs/generated/metamodel/entities/dataset/#fine-grained-lineage Trying to understand the usage of upstream in the below code ( we are already adding the necessary column lineages via fieldLineages ) Also are we able to view the fine grained lineages ?
    Copy code
    upstream = Upstream(dataset=datasetUrn("bar2"), type=DatasetLineageType.TRANSFORMED)
    
    fieldLineages = UpstreamLineage(
        upstreams=[upstream], fineGrainedLineages=fineGrainedLineages
    )
    
    lineageMcp = MetadataChangeProposalWrapper(
        entityType="dataset",
        changeType=ChangeTypeClass.UPSERT,
        entityUrn=datasetUrn("bar"),
        aspectName="upstreamLineage",
        aspect=fieldLineages,
    )
    b
    e
    • 3
    • 5
  • b

    bright-cpu-56427

    06/09/2022, 7:32 AM
    Hi Guys, I am creating a custom dataplatform (looking at this)
    Copy code
    pydantic.error_wrappers.ValidationError: 1 validation error for PipelineConfig
    source -> filename:
      extra fields not permitted (type=value_error.extra)
    This error is coming out. I don’t know which one is wrong. my python code is
    Copy code
    def add_quicksight_platform():
        pipeline = Pipeline.create(
            # This configuration is analogous to a recipe configuration.
            {
                "source": {
                    "type": "file",
                    "filename:":"/opt/airflow/dags/datahub/quicksight.json"
                },
                "sink": {
                    "type": "datahub-rest",
                    "config": {"server": "{datahub-gms-ip}:8080"},
                }
            }
        )
        pipeline.run()
        pipeline.raise_from_status()
    quicksight.json
    Copy code
    {
      "auditHeader": null,
      "proposedSnapshot": {
        "com.linkedin.pegasus2avro.metadata.snapshot.DataPlatformSnapshot": {
          "urn": "urn:li:dataPlatform:quicksight",
          "aspects": [
            {
              "com.linkedin.pegasus2avro.dataplatform.DataPlatformInfo": {
                "datasetNameDelimiter": "/",
                "name": "quicksight",
                "type": "OTHERS",
                "logoUrl": "<https://play-lh.googleusercontent.com/dbiOAXowepd9qC69dUnCJWEk8gg8dsQburLUyC1sux9ovnyoyH5MsoLf0OQcBqRZILB0=w240-h480-rw>"
              }
            }
          ]
        }
      },
      "proposedDelta": null
    }
    b
    • 2
    • 10
  • b

    bumpy-activity-74405

    06/09/2022, 8:13 AM
    Hey, I just started using bigquery/dbt/airflow so forgive me if the questions sound stupid 🙂 I’ve somewhat successfully ingested metadata/usage/stats of bigquery tables but it also had lineage (from bigquery logs, I assume). I initially thought that lineage would have to be ingested from dbt/airflow sources. Is there any reason I would look into those sources? Any pros/cons of getting lineage from bq source vs dbt/airflow?
    l
    g
    g
    • 4
    • 17
  • b

    brave-pencil-21289

    06/09/2022, 12:11 PM
    Did anyone succeed with azure synapse recipe. If yes can you please share the template of recipe.
    h
    • 2
    • 2
  • c

    crooked-lunch-27985

    06/09/2022, 5:48 PM
    Hey there! Got a noob Q.. Is there an example code (Py) that someone can point me too .. that does ingestion - both metadata and then the actual data as well using datahub apis. e.g. Say from a snowflake datasource i want to identify all tables that have say "member_id" column in it and then pull the actual raw data from those tables and dump it in a local/serverside db/file etc. Is this a typical usecase or the usecase of actually pulling raw data is not supported? thx !
    b
    • 2
    • 1
  • r

    rich-policeman-92383

    06/09/2022, 6:11 PM
    Hello Do we support query usage for hive and oracle data sources. datahub_version: v0.8.35
  • s

    stocky-midnight-78204

    06/10/2022, 2:49 AM
    How Do I import hive table statistics ? I have executed analyze table xxx compute statistics and then import this table to datahub. But still can't see statistics from datahub
    d
    • 2
    • 1
  • r

    rich-policeman-92383

    06/10/2022, 9:25 AM
    In stateful ingestion how can i get list of datasets that were marked as "soft-deleted". Ingestion Source: Hive Datahub_Version: v0.8.35 Reason: On datahub UI the dataset count for hive is 13K while at the source it is 10K
    b
    • 2
    • 2
  • b

    brave-pencil-21289

    06/10/2022, 11:41 AM
    I am getting "logintime out expired" Error while working on azure synase ingestion. Can someone help on this.
    h
    • 2
    • 3
  • f

    fresh-garage-83780

    06/10/2022, 12:59 PM
    I'm trying to bring in my Confluent Cloud instance to Datahub. Schemas are protobuf in Confluent Schema Registry. Most of our message keys are uuids, the schema for topic key in the registry is:
    Copy code
    syntax = "proto3";
    package atech.proto.shared;
    
    message Uuid {
      string value = 1;
    }
    What I'm seeing is its pulling in one topic, then erroring on the second. I think its telling me because of duped shared proto objects:
    Copy code
    TypeError: Couldn't build proto file into descriptor pool!
    Invalid proto descriptor for file "shared/uuid.proto":
      atech.proto.shared.Uuid.value: "atech.proto.shared.Uuid.value" is already defined in file "atech_app_journey_ape_completed-key.proto".
      atech.proto.shared.Uuid: "atech.proto.shared.Uuid" is already defined in file "atech_app_journey_ape_completed-key.proto".
    Seems that I might have hit this this limitation but can't think of a workaround, so not quite sure where to go from here? Any clues?
    b
    • 2
    • 9
  • s

    some-lighter-85578

    06/10/2022, 1:36 PM
    Hi, I would like to ask for an advice. I'm trying to ingest data from Snowflake (that works fine), but for a reason beyond my control, we don't have descriptions of tables/columns directly in snowflake, but in another app. I can easily export those into json or csv, but how can I merge those to the ingestion process? Is it possible to use transformers for that? Any kind of help/example would be really great.
    b
    b
    • 3
    • 10
  • a

    adorable-guitar-54244

    06/10/2022, 3:22 PM
    While trying to inject schema registry getting this error
  • b

    bland-orange-13353

    06/10/2022, 3:31 PM
    This message was deleted.
  • b

    brainy-shampoo-83265

    06/10/2022, 5:25 PM
    If I want to update the Validation tab on a regular schedule, should I make a new type of Source? If so, is there a better way of doing that than using the DatahubRestEmitter?
    m
    b
    • 3
    • 4
  • s

    stocky-midnight-78204

    06/13/2022, 1:52 AM
    What is the difference of Presto on Hive and trino? How do I choose ?
    d
    • 2
    • 1
  • s

    stocky-midnight-78204

    06/13/2022, 3:42 AM
    who has the sample of ingesting
    presto-on-hive
  • l

    lemon-zoo-63387

    06/13/2022, 6:39 AM
    Could you please help me to answer questions? The Oracle client has been installed, but it does run like this,it's too hard
    h
    • 2
    • 2
  • l

    lemon-zoo-63387

    06/13/2022, 6:43 AM
    MySQL I have the same problem,I really need your help
    a
    • 2
    • 3
  • b

    bright-cpu-56427

    06/13/2022, 9:48 AM
    Hi guys When I make a request from airflow server to datahub with below code, The airflow server is set to localhost, can I not change this value?? operator code
    Copy code
    emit_lineage_task = DatahubEmitterOperator(
        task_id=f"emit_lineage_{table}",
        datahub_conn_id="datahub_rest_default",
        mces=[
            builder.make_lineage_mce(
                upstream_urns=upstreams,
                downstream_urn=builder.make_dataset_urn(downstream[0], downstream[1]),
            )
        ],
    )
    emit_lineage_task.execute(kwargs)
    log file
    Copy code
    [2022-06-13 09:27:43,912] {_lineage_core.py:67} INFO - Emitted from Lineage: 
        DataFlow(urn=<datahub.utilities.urns.data_flow_urn.DataFlowUrn object at 0x7f9bcf23eee0>, ... url='<http://localhost:8080/tree?dag_id=dag>', ...)
    d
    • 2
    • 1
  • r

    rich-rocket-77152

    06/13/2022, 10:25 AM
    @millions-waiter-49836 hello. Currently, i am using glue profiling (Thankyou for your contribution) Today, i add statistics to my glue table through update-column-statistics-for-table ( https://docs.aws.amazon.com/cli/latest/reference/glue/update-column-statistics-for-table.html) I checked statistics through get-column-statistics-for-table (https://docs.aws.amazon.com/cli/latest/reference/glue/get-column-statistics-for-table.html), and i ingest glue table but nothing showing up in Stats tab I saw your code In your code, you read statistics through get_table(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_table) and get_partitions(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_partition) When I called the get_table api via boto3, there were no Parameters field in the Table field, and get_partitions too Why didn't you use get_column_statistics_for_partition(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_column_statistics_for_partition) and get_column_statistics_for_table(https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.get_column_statistics_for_table)? Is what I checked wrong? let me know if there is a right way I'll attach a screenshot of the Stats tab from my datahub and a screenshot of the get-column-statistics-for-table result.
    m
    • 2
    • 1
  • f

    fresh-napkin-5247

    06/13/2022, 10:56 AM
    Hello, I came up with another question. I would like to change part of the urn of a given entity via a transformer. For example, I would like to replace the word “prod” with “dev” on the urn string (The reasons for this have to do with out internal architecture). Is this a possibility? There does not seem to exist a guide for this on the docs. Thank you! Version: acryl-datahub, version 0.8.38
    b
    • 2
    • 16
  • b

    bumpy-daybreak-85714

    06/13/2022, 12:26 PM
    Hello! I would like to ingest data from dbt into DataHub. I want to ingest manifest and catalog that are put on s3 bucket. Is there a way to use the instance level metadata for the role? (IAM roles) https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#id1 Right now I am getting the following error when I do not specify a role:
    Copy code
    INFO - ERROR:root:Error: Unable to locate credentials
    From the code perspective, I guess that for this to work this line needs to be modified: https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/aws/aws_common.py#L114 Am I right? Using version
    acryl-datahub[dbt] == 0.8.36
    b
    b
    h
    • 4
    • 4
  • s

    salmon-angle-92685

    06/13/2022, 1:18 PM
    Hello, Have you ever had this problem when trying to set _top_n_queries_ field in Snowflake ? Would you know why this happens and how to solve it ?
    1 validation error for SnowflakeConfig
    top_n_queries
    extra fields not permitted (type=value_error.extra)
    Thank you in advance !
    b
    • 2
    • 1
  • s

    stocky-energy-24880

    06/13/2022, 1:39 PM
    Hello, We are doing redshift ingestion with below source: ---------------- source: type: "redshift" config: platform_instance: "dataland" env: "DEV" username: "${REDSHIFT_USER}" password: "${REDSHIFT_PASSWORD}" host_port: "dataland-internal.big.dev.scmspain.io:5439" database: "dwh_sch_sp_db" schema_pattern: deny: - .*_mgmt$ table_pattern: deny: - .*dim_ad_normalization$ - .*dim_api_client$ include_table_lineage: False include_views: False stateful_ingestion: enabled: True remove_stale_metadata: True profiling: enabled: true limit: 1000 turn_off_expensive_profiling_metrics: True profile_pattern: allow: - ^dwh_sch_sp_db.motos deny: - .tmp. - .temp. options: connect_args: sslmode: prefer ------------- after our ingestion ran successfully we can see in logs that tables defined with source.deny are soft deleted --------------LOGS----------------------- 'soft_deleted_stale_entities': ['urnlidataset:(urnlidataPlatform:redshift,dataland.dwh_sch_sp_db.pro_infojobs_es.dim_api_client,DEV)', 'urnlidataset:(urnlidataPlatform:redshift,dataland.dwh_sch_sp_db.infojobs_es.dim_api_client,DEV)'], 'query_combiner': {'total_queries': 676, 'uncombined_queries_issued': 379, 'combined_queries_issued': 82, 'queries_combined': 297, 'query_exceptions': 0}, 'saas_version': 'PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.38698', 'upstream_lineage': {}} Sink (datahub-kafka) report: {'records_written': 8252, 'warnings': [], 'failures': [], 'downstream_start_time': None, 'downstream_end_time': None, 'downstream_total_latency_in_seconds': None} ---------------------------------------- but the tables are still visible from UI and the mysql table has 2 entry first with {"removed":true} and again with {"removed":false}, can you please explain what is happening wrong here.
    b
    b
    • 3
    • 7
  • b

    busy-airport-23391

    06/13/2022, 9:31 PM
    Good afternoon, I'm trying to set up a custom java emitter to update a glue dataset's
    stats
    . At the moment, all I'm trying to upsert is the
    timestampMillis
    . When I try to set the timestampMillis, I get the following error despite passing a long value:
    Copy code
    java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
    	at com.linkedin.metadata.timeseries.transformer.TimeseriesAspectTransformer.getCommonDocument(TimeseriesAspectTransformer.java:80)
    	at com.linkedin.metadata.timeseries.transformer.TimeseriesAspectTransformer.transform(TimeseriesAspectTransformer.java:47)
    	at com.linkedin.metadata.kafka.hook.UpdateIndicesHook.updateTimeseriesFields(UpdateIndicesHook.java:217)
    	at com.linkedin.metadata.kafka.hook.UpdateIndicesHook.invoke(UpdateIndicesHook.java:115)
    	at com.linkedin.metadata.kafka.MetadataChangeLogProcessor.consume(MetadataChangeLogProcessor.java:77)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:169)
    	at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:119)
    	at org.springframework.kafka.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:56)
    	at org.springframework.kafka.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:347)
    	at org.springframework.kafka.listener.adapter.RecordMessagingMessageListenerAdapter.onMessage(RecordMessagingMessageListenerAdapter.java:92)
    	at org.springframework.kafka.listener.adapter.RecordMessagingMessageListenerAdapter.onMessage(RecordMessagingMessageListenerAdapter.java:53)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeOnMessage(KafkaMessageListenerContainer.java:2334)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeOnMessage(KafkaMessageListenerContainer.java:2315)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeRecordListener(KafkaMessageListenerContainer.java:2237)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeWithRecords(KafkaMessageListenerContainer.java:2150)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListener(KafkaMessageListenerContainer.java:2032)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeListener(KafkaMessageListenerContainer.java:1705)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeIfHaveRecords(KafkaMessageListenerContainer.java:1276)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1268)
    	at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1163)
    	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    So I had two questions: 1. Is there some more documentation on how to update a dataset's profile aspect? 2. Is there a reason the long value is getting converted to an int? Or am I trying to update this aspect incorrectly? Here's my Java function:
    Copy code
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
                    .entityType("dataset")
                    .entityUrn("urn:li:dataset:(<urn>,TEST)")
                    .upsert()
                    .aspect(new DatasetProfile()
                            //.setColumnCount(11)
                            .setTimestampMillis(0L)
                    )
                    .build();
    Thanks in advance for your help!
    b
    • 2
    • 3
  • t

    tall-fall-45442

    06/13/2022, 9:59 PM
    I'm having some issues using the UI to ingest new data. I have created a secret called
    my-postgres-password
    through the UI. Then when I create an ingestion recipe for Postgres and set the password value to
    '${my-postgres-password}'
    it says that it can't authenticate with the database. However, when I put the password in plain text the ingestion succeeds. Is there something I am doing wrong or must secrets follow a certain naming convention?
    b
    b
    • 3
    • 9
  • b

    best-umbrella-24804

    06/14/2022, 12:25 AM
    Hello, I have a recipe for snowflake usage that looks like this
    Copy code
    source:
        type: snowflake-usage
        config:
            env: DEV
            host_port: <http://xxxx.snowflakecomputing.com|xxxx.snowflakecomputing.com>
            warehouse: DEVELOPER_X_SMALL
            username: DATAHUB_DEV_USER
            password: '${SNOWFLAKE_DEV_PASSWORD}'
            role: DATAHUB_DEV_ACCESS
            top_n_queries: 10
    sink:
        type: datahub-rest
        config:
            server: '<http://xxxxxx>'
    We are finding that there are 1600 instances of the following error in our logs
    Copy code
    '2022-06-13 23:52:41.644712 [exec_id=66448a5e-1f15-406f-8ee7-700ed32f427b] INFO: stdout=[2022-06-13 23:52:37,635] WARNING  '
    "{datahub.ingestion.source.usage.snowflake_usage:93} - usage => Failed to parse usage line {'query_start_time': datetime.datetime(2022, "
    '6, 12, 4, 33, 7, 869000, tzinfo=datetime.timezone.utc), \'query_text\': "INSERT INTO xxxxx( ..... )            '
    'validation error for SnowflakeJoinedAccessEvent\n'
    'email\n'
    '  none is not an allowed value (type=type_error.none.not_allowed)\n'
    Any idea what is going on here?
    l
    s
    p
    • 4
    • 6
  • l

    lemon-zoo-63387

    06/14/2022, 4:05 AM
    Could you please help me to answer questions? I ran successfully with YML, but the UI interface operation did not alarm,how to configure
    b
    b
    • 3
    • 5
  • m

    many-morning-40345

    06/14/2022, 5:44 AM
    Hi, Is it possible to add dataset/glossary Term properties via UI?
    b
    • 2
    • 1
1...464748...144Latest