https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • f

    few-grass-66826

    08/27/2022, 12:42 PM
    Ho guys I am ingesting metadata from kafka and the only thing it gets is topic names, what else it can i gest and how?
    h
    • 2
    • 2
  • l

    lemon-engine-23512

    08/27/2022, 4:12 PM
    Hi team, is it necessary to build python package of our project for adding custom sources? Also where i do install this package if i am not using cli but airflow to schedule custom source
    h
    • 2
    • 8
  • j

    jolly-yacht-10587

    08/28/2022, 9:51 AM
    Hi, I got questions about ingesting metadata via OpenAPI: 1. Can I specify a semantic version for an aspect when do a Post request? I tried using this as a request body but the version that I specified didnt seem to appear on UI.
    Copy code
    {
      "aspect": {
        "__type": "SchemaMetadata",
        "schemaName": "mongodb",
        "platform": "urn:li:dataPlatform:mongodb",
        "platformSchema": {
          "__type": "MySqlDDL",
          "tableSchema": "schema"
        },
        "version": "3",
        "hash": "",
        "fields": [
          {
            "fieldPath": "hello",
            "jsonPath": "null",
            "nullable": true,
            "description": "test hello 18",
            "type": {
              "type": {
                "__type": "RecordType"
              }
            },
            "nativeDataType": "Record()",
            "recursive": false
          }
        ]
      },
      "entityType": "dataset",
      "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mongodb,hello.hello3,PROD)"
    }
    2. Is it possible to delete an aspect using the post request instead of delete request by passing the body similar as above pic? 3. If I want to delete an aspect but still want it to be shown on UI but just mark it as “deleted” or sth so users can view version history of this dataset, is this possible?
    o
    • 2
    • 1
  • f

    few-grass-66826

    08/28/2022, 11:47 AM
    Another issue now with airflow, I changed airflow.cfg in my docker image added datahub linage but it is unable to find datahub and fail whole airflow. Rebuilt every datahub module but no result
    d
    • 2
    • 1
  • b

    better-actor-97450

    08/29/2022, 2:56 AM
    i'm ingestion with oracle but job status not update to finish bit log of job record is finish. how can i fix it ?
    f
    • 2
    • 4
  • s

    straight-agent-79732

    08/29/2022, 5:31 AM
    Does datahub support tablename.propertyName in the rules regex?
    b
    • 2
    • 4
  • a

    aloof-oil-31167

    08/29/2022, 7:57 AM
    Hey, how can i get a datahub base img with version based tags instead of sha -
    Copy code
    FROM linkedin/datahub-ingestion:85a55ff
    the following one is not pulling anything -
    Copy code
    FROM linkedin/datahub-ingestion:0.8.43.3
    b
    m
    • 3
    • 9
  • f

    few-grass-66826

    08/29/2022, 8:48 AM
    Hi guys always the same issue tried literally everything. Same with airflow pip3 install acryl-datahub[kafka-connect] no matches found: acryl-datahub[kafka-connect]
    d
    • 2
    • 1
  • f

    flat-painter-78331

    08/29/2022, 9:52 AM
    Hi guys. I've been working on integrating airflow. I've installed the plugin and created the REST hook connection. After these two steps in the documentation it says to define the inlets and outlets in the DAG. Can i please know how to define the inlets and outlets when doing an ingestion from MySQL to BigQuery?
    d
    j
    • 3
    • 17
  • s

    square-hair-99480

    08/29/2022, 10:15 AM
    Hi friends, so my doubts/problem is: I created a Snowflake ingestion. Initially I did not specify its
    platform_instance
    so it appeared to me with the name
    datahub
    in the UI. After a few days ingesting data I had to change it and add
    platform_instance
    to this ingestion since I would be ingesting data from two distinct Snowflake accounts. Later I was asked to change
    platform_instance
    value another time. So now when I go in the UI
    Datasets -> Prod -> Snwoflake
    I see 3 names (
    datahub
    ,
    name_01
    ,
    name_02
    ) for the same ingestion Job. How can I delete the older data so I only see and access the data related to the last value for the ingestion
    platform_instance
    ? I have tried things like
    Copy code
    datahub delete --urn "urn:li:dataPlatformInstance:(urn:li:dataPlatform:snowflake,DATAHUB.datahub,PROD)" --soft
    but it did not work.
    • 1
    • 1
  • a

    alert-fall-82501

    08/29/2022, 11:44 AM
    Hi team - I am scheduling hive jobs metadata transfer in apache airflow DAG . In docker container of airflow it is showing that Hive is disabled .No module found "pyhive" ...How to enable this . Please help . Thanks in advance
    d
    • 2
    • 19
  • a

    aloof-oil-31167

    08/29/2022, 1:22 PM
    Hey, i’m trying to use Spark-Lineage feature and getting the following error from the driver -
    Copy code
    Caused by: java.lang.ClassNotFoundException: datahub.spark.DatahubSparkListener
    i added the following configs to the spark session -
    Copy code
    "spark.jars.packages" = "io.acryl:datahub-spark-lineage:0.8.23",
    "spark.extraListeners" = "datahub.spark.DatahubSparkListener",
    "spark.datahub.rest.server" = ${?DATAHUB_URL},
    "spark.datahub.rest.token" = ${?DATAHUB_TOKEN}
    "spark.datahub.metadata.dataset.env" = "STG"
    does anyone have an idea?
    g
    c
    f
    • 4
    • 15
  • s

    stocky-minister-77341

    08/29/2022, 1:42 PM
    Hi! I’m trying to set up a new mysql ingestion. I’m getting an error that mysql is disabled although I see that it is enabled when running datahub check plugins. I’ll add in the comments the stack trace and the recepie file. Any Ideas?
    g
    • 2
    • 5
  • b

    brave-businessperson-3969

    08/29/2022, 2:09 PM
    Hi, I have some significant problems with the Trino ingestion when profiling is enabled: (Plattform DataHub 0.8.43 with acryldata/command line tool 0.8.43.5, the source is Starburst, a commercial variant of Trino) • Various errors of type "trino/sqlalchemy/datatype.py209 SAWarging: Did nt recoginize type 'str' This error shows up for various data types (text, bool, float, float64, int32, etc.) • config.table_pattern.allow changes the table names in DataHub when used • TrinoUseError TABLE_NOT_FOUND errors. For some reasons the ingestion source replaces _ in table names with $ and then does not find the table of course • Schema not found exception: for some reason the schema name gets additional double-quotes (e.g. trino.exceptions.TrinoUserError: [...] Schema '"dwh"' does not exists) All these errors only show up if profiling is enabled and table_pattern.allow is used. I'm willing and able to debug python code but currently I lack the understanding how the Trino connector works overall (e.g. where the SQL code is generated or where the check for pattern_allow is performed). Has anybody managed to ingest table statistics from Trino and has some idea how to debug these issues?
    g
    g
    g
    • 4
    • 10
  • a

    alert-coat-46957

    08/29/2022, 3:11 PM
    Hi Team, Does anyone know if we can integrate Databricks🧱 Data source with Datahub datahub? Do we have any document?
    m
    g
    q
    • 4
    • 9
  • s

    steep-finland-24780

    08/29/2022, 6:35 PM
    Hi, Our team has been using metabase as our primary BI tool I was wondering if anyone was also ingesting on datahub? I was wondering if you guys do any transformations or anything beyond the basic recipe to ingest the collections on metabase and use it as a container on datahub? so that user created collections were easily searchable on datahub Anyone has any tips on this matter?
    g
    w
    • 3
    • 7
  • m

    miniature-plastic-43224

    08/29/2022, 8:40 PM
    Team, I have a question about LDAP ingestion. I can see that all ingested users (CorpUserInfoClass) will always be setup as "active=True" (it is hardcoded at ldap.py). It means that if I need to filter out all "not active" users (which mostly means those who are not with the enterprise anymore) I need to apply ldap filter on my own. This is fine. However, model project has a note on "CorpUserInfo" object: "Deprecated! Use CorpUserStatus instead. Whether the corpUser is active, ...". However, I don't see CorpUserStatus during ldap ingestion, MCE doesn't have it. So, where should I get CorpUserStatus from?
    b
    • 2
    • 1
  • c

    careful-insurance-60247

    08/29/2022, 9:49 PM
    I have noticed a few character case miss matches in urns with mssql and tableau when ingesting linage. Will other database source ingestion processes support the convert_urns_to_lowercase function that snowflake has?
    • 1
    • 1
  • c

    cool-translator-98249

    08/29/2022, 10:57 PM
    Hi, I just got the install done and am trying a first few ingestions. When I do a dry run of our first CLI ingestion, I'm getting an error on the sink of:
    Copy code
    [2022-08-29 22:53:33,805] ERROR    {datahub.entrypoints:195} - Command failed: 
    	Tree is empty.
    g
    • 2
    • 3
  • a

    alert-fall-82501

    08/30/2022, 5:33 AM
    Copy code
    note: This error originates from a subprocess, and is likely not a problem with pip.
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> sasl
    d
    g
    • 3
    • 13
  • a

    alert-fall-82501

    08/30/2022, 5:35 AM
    can anybody suggest on this issue ? Actually I am trying to ingest metadata from hive to datahub . facing this issue while installing pip install 'acryl-datahub[hive]'
  • f

    few-carpenter-93837

    08/30/2022, 6:18 AM
    Hey guys, just to confirm, in the current state, does ingestion through CLI overwrite all elements of a dataset (for example tag's), unless we use a custom transformer logic to first request the current state from server?
    b
    g
    • 3
    • 4
  • m

    microscopic-mechanic-13766

    08/30/2022, 7:34 AM
    Good morning team, I am trying to connect Spark on Jupyter notebooks to Datahub. I have created a notebook which spark session is the following
    val spark = SparkSession.builder().appName("test-application").config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.43").config("spark.extraListeners","datahub.spark.DatahubSparkListener").config("spark.datahub.rest.server", "<http://datahub-gms:8080>").enableHiveSupport().getOrCreate()
    After that, the initial datasets (which are not ingested in Datahub as they are .csv files) are modified. My "problem" is that after executing all of the notebook, nothing appears on Datahub. Is it needed to install anything in Jupyter itself, or does it look for the jars in some repository like Maven?? I would really appreciate some guidance on how this connection works! Thanks in advance 🙂
    g
    h
    • 3
    • 3
  • b

    brave-tomato-16287

    08/30/2022, 7:36 AM
    Hello all. After increasing the server limit from 20000 to 100000 we are still facing with the tableau ingestion error:
    Copy code
    {\'message\': \'Showing partial results. The '
    'request exceeded the "\n'
    '                                   "100000 node limit. Use pagination, additional filtering, or both in the query to adjust results.\', '
    '\'extensions\': "\n'
    Can anybody suggest something?
    h
    • 2
    • 7
  • a

    alert-fall-82501

    08/30/2022, 7:45 AM
    Copy code
    sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:databricks.pyhive
  • a

    alert-fall-82501

    08/30/2022, 7:46 AM
    can anyone suggest on this ,Actually I am trying to ingest metadata from hive to datahub
    g
    • 2
    • 1
  • b

    bumpy-journalist-41369

    08/30/2022, 9:11 AM
    Hello. I have problem ingesting data from S3 buckets. I have setup Datahub in Kubernetes cluster and using the UI and not the cli. The ingestion source looks like this: sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: s3 config: path_spec: include: ‘s3://<bucket_name>/<table_name>/{partition_key[0]}={partition[0]}/*.parquet’ platform: s3 aws_config: aws_access_key_id: ***** aws_region: us-east-1 aws_session_token: ******* aws_secret_access_key: ****** pipeline_name: ‘urnlidataHubIngestionSource:7ba22ca7-6c50-4b71-a766-8e89fa8fac52’ The S3 bucket structure is the following: Bucket_name: -Table_name - Sh_date=2022-08-30 -part-00000-7a70bb8c-48b0-4c9b-bea0-585c9146c8cf.c000.snappy.parquet -part-00001-7a70bb8c-48b0-4c9b-bea0-585c9146c8cf.c000.snappy.parquet ……. The ingestion fails and the output is the following:
    exec-urn_li_dataHubExecutionRequest_28e3ac39-d148-4306-b0f7-08dd063c52b9.log
    d
    • 2
    • 8
  • b

    bumpy-journalist-41369

    08/30/2022, 9:11 AM
    Does anyone have any idea how to fix the issue?
  • c

    colossal-hairdresser-6799

    08/30/2022, 9:27 AM
    UPSERT
    Python Emitter
    Add or update aspect (tags, terms, owners)
    Hi, When looking at the documentation for adding tags, terms and owners to dataset all the examples includes 1. Get the current owners
    Copy code
    current_tags: Optional[GlobalTagsClass] = graph.get_aspect_v2(
        entity_urn=dataset_urn,
        aspect="globalTags",
        aspect_type=GlobalTagsClass,
    )
    2. Check if tag not already exist
    Copy code
    if current_tags:
        if tag_to_add not in [x.tag for x in current_tags.tags]:
    3. If it doesn’t add to list of tags
    Copy code
    # tags exist, but this tag is not present in the current tags
            current_tags.tags.append(TagAssociationClass(tag_to_add)) <- new tag
    4. Then add the current_tags with an UPSERT.
    Copy code
    event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
            entityType="dataset",
            changeType=ChangeTypeClass.UPSERT,
            entityUrn=dataset_urn,
            aspectName="globalTags",
            aspect=current_tags,
    )
    My understanding of an UPSERT is “if the aspect exist update that aspect and if not add it”. So what I don’t understand is why we would need to go through 1-3 if we’re using UPSERT in the end anyways?
    b
    g
    • 3
    • 3
  • c

    colossal-hairdresser-6799

    08/30/2022, 9:54 AM
    graph.emit
    Information regarding successful update or skipped write due to aspect already exists
    Hi, When using graph.emit to update an aspect is there any way to see if it was updated or just skipped since it already existed? Right now I can only see a log saying
    INFOmetadata ingestionOwner urnlicorpGroup:test already exists, omitting write
    b
    • 2
    • 1
1...656667...144Latest