https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • p

    purple-terabyte-64712

    03/06/2023, 11:48 AM
    Hi, the ingestion file follow some metadata standard?
    a
    • 2
    • 1
  • a

    alert-football-80212

    03/06/2023, 3:33 PM
    Hi all 👋 Does someone have and example for dbt ingstion recipe, and also available for a few questions?
    a
    a
    • 3
    • 7
  • r

    red-waitress-53338

    03/06/2023, 5:53 PM
    Hi Team, I am sourcing from BigQuery and have enabled column level profiling, but I cannot see the column level profiling stats on the UI, I can only see table level stats.
    a
    a
    • 3
    • 6
  • l

    lively-dusk-19162

    03/06/2023, 8:18 PM
    Hello everyone, Can anyone clarify me whther writing resolver test code in graphql module? And what that resolver test code is for?
    ✅ 1
    a
    • 2
    • 2
  • r

    red-waitress-53338

    03/06/2023, 10:10 PM
    Hi Team, Is there a way to extract BigQuery tags as part of the ingestion? If yes, I am wondering where those tags will be stored in DataHub, because in BigQuery there are Tag Templates and they are key-value pairs - those key-value pairs cannot be mapped directly to tags in Datahub (they are like hashtags). I think the right place would be the properties section in DataHub where those Tag Templated can be mapped.
    g
    • 2
    • 2
  • c

    clever-author-65853

    03/07/2023, 7:25 AM
    Hey community, is there a way to tables sync from snowflake could inherited the dbt documentation that ingest them?
    ✅ 1
    a
    • 2
    • 1
  • e

    elegant-salesmen-99143

    03/07/2023, 7:44 AM
    is there a way to add pattern_add_dataset_terms Transformer not within one ingest recipe, but across all ingests, all datasources, all at once?
    ✅ 1
    a
    • 2
    • 2
  • b

    brave-judge-32701

    03/07/2023, 7:45 AM
    I’m under spark3.2.3, use spark-shell to run a sql:
    create table test.testtable4 as select * from test.testtable3
    , but the table testtable4's upstream is
    sql at <console>:23
    not
    testtable3
    , does it is a compatibility issue? And spark run on hive, spark create hive table can not be show in datahub immediately , I need to run batch Ingestion task to Ingest hive metastore data. Slack Conversation
    ✅ 1
  • f

    fresh-balloon-59613

    03/07/2023, 9:44 AM
    Hi, I am trying to visualize the lineage of tables and stored procedures using inlets and outlets in Airflow. After running the DAG and if we check the lineage to get the dataset details it is creating the new dummy table/stored procedure. But I want that to point to original dataset which we have ingested in datahub.
    a
    • 2
    • 1
  • g

    great-flag-53653

    03/07/2023, 3:48 PM
    Hi. I'm on the latest version 0.10.0 and I've successfully ingested into my local environment from BigQuery, Looker and DBT - but now that I'm trying to ingest "LookML" I'm getting error messages like this. After searching here in slack the only suggested answers previously have been to upgrade GMS to newer versions, but I'm on the latest version
    Copy code
    Pipeline running successfully so far; produced 1 events in 2 minutes and 13.81 seconds.
    |[2023-03-07 16:45:46,896] ERROR    {datahub.ingestion.run.pipeline:62} -  failed to write record with workunit lookml-view-LookerViewId(project_name='lookml-bq-prd', model_name='ip-model', view_name='account') with ('Unable to emit metadata to DataHub GMS', {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}) and info {'message': 'HTTPConnectionPool(host=\'localhost\', port=8080): Max retries exceeded with url: /entities?action=ingest (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'localhost\', port=8080): Read timed out. (read timeout=30)"))', 'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,lookml-bq-prd.view.account,PROD)'}
    a
    • 2
    • 8
  • n

    numerous-scientist-83156

    03/07/2023, 4:00 PM
    Hello, I am trying to build a ADLS gen 2 ingestor in Python, the ingestor should, based on the storage account and the container within that storage account, pull every folder, which in our case will be used as the dataset name, and get the schema based on the files within. So the current config is:
    Copy code
    source:
      type: adlsg2.source.ADLSg2Source
      config:
        platform_instance: ${ADLS_ACCOUNT}
        env: ${ENVIRONMENT}
        adls:
          account_name: ${ADLS_ACCOUNT}
          container_name: ${ADLS_CONTAINER}
          base_path: /
          tenant_id: ${TENANT_ID}
          client_id: ${CLIENT_ID}
          client_secret: ${CLIENT_SECRET}
    Currently they way I am making this, I am using the storage accounts name as the platform instance, since the platform is
    adlsGen2
    it makes sense for the instance to be
    ourInstanceTest
    I want to make datahub container, based on the name of the adls container name, that the datasets can live in. The idea being that it gives us a nice structured urn and search flow in Datahub _`urnlidataset:(urnlidataPlatform:<platform>,<platform_instance>.<adls_container>.<dataset_name>,<env>)`_ I use the
    gen_containers()
    function from the
    datahub.emitter.mcp_builder
    module, to make the
    MetaDataWorkUnits
    , and then use the
    add_dataset_to_container
    function to add my dataset to the container. But this is where I run into trouble.. If i just do as described above my dataset urn looks like this:
    urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<dataset_name>,TEST)
    There is no container present in the urn? There is also no container present in the breadcrumbs (See first image) If i then add the adls containers name as part of the dataset name separeted by a forward slash
    <container_name>/<dataset_name>
    I get the following URN:
    urn:li:dataset:(urn:li:dataPlatform:adlsGen2,<platform_instance>.<container_name>/<dataset_name>,TEST)
    This time the container is pressent in the dataset urn, and the breadcrumbs also works.. sort of On of the breadcrumbs does not separate the
    platform_instance
    and the
    container_name
    from each other, so the breadcrumb ends up looking like
    <platform_intstance>.<container_name>
    which is not what I want. (See second image) The reason for this test is that, from what i understand this is how they do it in the existing [s3 ingestor](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]ta-ingestion/src/datahub/ingestion/source/s3/data_lake_utils.py) Though I do not think the s3 ingestor uses the
    platform_instance
    and just makes a container to represent that? It also seems a off since it generates its datasets with
    /
    between containers and dataset name, where mssql uses
    .
    ? But what I do not understand is that the [mssql](https://github.com/datahub-project/datahub/blob/1d3339276129a7cb8385c07a958fcc93ac[…]etadata-ingestion/src/datahub/ingestion/source/sql/sql_utils.py) ingestor also "just" calles the
    gen_containers
    function, and uses the
    platform_instance
    , but their implementation works without wierd breadcrumbs? Am I miss understanding how the platform instance is suppose to work? Or how containers are suppose to be configured?
    g
    • 2
    • 5
  • m

    magnificent-sunset-43871

    03/07/2023, 4:58 PM
    Hi folks ! I'm trying to ingest a json schema via meltano. The schema looks like this
    schema.json
    m
    • 2
    • 2
  • m

    magnificent-sunset-43871

    03/07/2023, 4:59 PM
    And the meltano config is like this
    meltano.yml
  • m

    magnificent-sunset-43871

    03/07/2023, 5:00 PM
    this is the recipe
    recipe.yml
  • m

    magnificent-sunset-43871

    03/07/2023, 5:01 PM
    with this configs i can start the ingestion like this : meltano --cwd ./meltano invoke datahub ingest -c ingest-recipe.yml
  • m

    magnificent-sunset-43871

    03/07/2023, 5:01 PM
    But i get an Error:
  • m

    magnificent-sunset-43871

    03/07/2023, 5:01 PM
    Failed to process file output/schemas/Administrative-Unit.json Traceback (most recent call last): File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/source/schema/json_schema.py", line 374, in get_workunits_internal yield from self._load_one_file( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/source/schema/json_schema.py", line 279, in _load_one_file meta: models.SchemaMetadataClass = get_schema_metadata( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 546, in get_schema_metadata schema_fields = list(JsonSchemaTranslator.get_fields_from_schema(json_schema)) File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 510, in get_fields_from_schema yield from JsonSchemaTranslator.get_fields( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 475, in get_fields yield from generator.__get__(cls)( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 359, in _field_from_complex_type yield from JsonSchemaTranslator.get_fields( File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 467, in get_fields datahub_type = JsonSchemaTranslator.get_type_mapping(json_type) File "/home/dev/repos/bedag/dat-mgmt-poc-di-dk/dat-mgmt-poc-di-dk/meltano/.meltano/utilities/datahub/venv/lib/python3.8/site-packages/datahub/ingestion/extractor/json_schema_util.py", line 440, in get_type_mapping return JsonSchemaTranslator.field_type_mapping.get(json_type, NullTypeClass)
  • m

    magnificent-sunset-43871

    03/07/2023, 5:02 PM
    it's because the jsonschemaTranslator cannot translate the "type" when it's an array:
  • m

    magnificent-sunset-43871

    03/07/2023, 5:02 PM
    when i remove the array as the type and only write a string, then it works
  • m

    magnificent-sunset-43871

    03/07/2023, 5:03 PM
    do you maybe have some suggestions or allready know this bug ? maybe i also just configured it wrong
  • m

    magnificent-sunset-43871

    03/07/2023, 5:03 PM
    thx in advance for your help
  • m

    magnificent-sunset-43871

    03/07/2023, 5:05 PM
    i don't think, that this error is related to meltano because meltano under the hood just starts the python command for the ingestion so big question is : is it possible to ingest json-schemas, which have an array as a type ? (like type: ["string", "null"])
  • g

    green-lock-62163

    03/07/2023, 5:48 PM
    Hello, I try to launch
    lineage_emitter_dataset_finegrained.py
    from the metadata-ingestion/library. Where is the test data corresponding to the column lineage produced in that code section ? I might be missing an obvious setup phase ...
    a
    • 2
    • 6
  • f

    few-branch-52297

    03/07/2023, 10:00 PM
    Hi, I've a use case where there is a producer java application writing to kafka and a consumer java application reading from Kafka. I need to capture the lineage of these custom producer/consumer applications. • Should I use the Java Emitter (Rest/Kafka) for this? • Is there any sample code in Java to emit lineage?
    a
    • 2
    • 1
  • n

    nice-farmer-35903

    03/07/2023, 11:45 PM
    @astonishing-answer-96712 added a workflow to this channel: *Community Support Bot *.
  • i

    important-afternoon-19755

    03/08/2023, 1:39 AM
    Hi, Team. I want to use Python Emitter for glue metadata ingestion. After install
    'acryl-datahub[datahub-rest]'
    in python, I try to import
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    but I got error. Does anybody know the solution? Below is the traceback of error.
    Copy code
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    /tmp/ipykernel_6987/4112535850.py in <module>
          5 
          6 from datahub.emitter.mcp import MetadataChangeProposalWrapper
    ----> 7 from datahub.emitter.rest_emitter import DatahubRestEmitter
          8 from datahub.emitter.mce_builder import (
          9     get_sys_time,
    
    ~/.local/lib/python3.7/site-packages/datahub/emitter/rest_emitter.py in <module>
         11 from requests.exceptions import HTTPError, RequestException
         12 
    ---> 13 from datahub.cli.cli_utils import get_system_auth
         14 from datahub.configuration.common import ConfigurationError, OperationalError
         15 from datahub.emitter.mcp import MetadataChangeProposalWrapper
    
    ~/.local/lib/python3.7/site-packages/datahub/cli/cli_utils.py in <module>
         11 import requests
         12 import yaml
    ---> 13 from pydantic import BaseModel, ValidationError
         14 from requests.models import Response
         15 from requests.sessions import Session
    
    ~/.local/lib/python3.7/site-packages/pydantic/__init__.cpython-37m-x86_64-linux-gnu.so in init pydantic.__init__()
    
    ~/.local/lib/python3.7/site-packages/pydantic/dataclasses.cpython-37m-x86_64-linux-gnu.so in init pydantic.dataclasses()
    
    ~/.local/lib/python3.7/site-packages/pydantic/main.cpython-37m-x86_64-linux-gnu.so in init pydantic.main()
    
    TypeError: dataclass_transform() got an unexpected keyword argument 'field_specifiers'
    g
    a
    • 3
    • 7
  • a

    adorable-computer-92026

    03/08/2023, 10:59 AM
    Hello, i want to know when i run this command :'datahub docker ingest-sample-data' from where the data was ingested and where it is stored (in which database), is it in MySQL ? thank you!
    ✅ 1
    b
    • 2
    • 5
  • f

    fresh-zoo-34934

    03/08/2023, 12:40 PM
    [Bulk Emit MetadataChangeProposal] Hi Team, I tried to add assertion data to Datahub using python, is there a way to send this data in bulk? Or perhaps using yaml file ingestion?
    ✅ 1
    plus1 1
    a
    • 2
    • 2
  • w

    white-hydrogen-24531

    03/08/2023, 6:28 PM
    Hi Team, we created set of glossary terms through
    business_glossary.yml
    file and they do show up in the UI after successfully executing the yml file. But when we attach the terms to hive datasets , dathub randomly throws exceptions about duplicate terms Dathub Version: 0.8.45
    Copy code
    Caused by: java.lang.IllegalStateException: Duplicate key EntityAspectIdentifier(urn=urn:li:glossaryTerm:ABC>TEST, aspect=glossaryTermKey, version=0) (attempted merging values com.linkedin.metadata.entity.EntityAspect@4c802884 and com.linkedin.metadata.entity.EntityAspect@4c802884)
    ✅ 1
    a
    • 2
    • 6
  • b

    bitter-school-38051

    03/08/2023, 9:22 PM
    Hi everybody! I have added new custom ingesting (according to https://datahubproject.io/docs/metadata-ingestion/adding-source/), built both frontend docker container and gms docker container (using ./gradlew datahub frontendbuild ./gradlew datahub frontenddist ./gradlew datahub frontenddocker and ./gradlew metadata servicewar:docker with build) and deploy both containers, I can see new source in UI but it fails and in logs there is error Failed to find a registered source for type my_source_name: ‘Did not find a registered class for my_source_name’. Maybe I did forgot something else to do? I am running all datahub containers datahub-actions, datahub-gms, datahub-frontened containers
    ✅ 1
    a
    • 2
    • 4
1...108109110...144Latest