https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • f

    few-air-56117

    01/04/2022, 12:13 PM
    Hi guys, i have a question. i put datahub on google k8s, but i am not sure how can i run the igestion scripts. on my local machine i used recipe.yaml and run datahub ingest. But i am not sure how ca i do it in k8s. Thx 😄
    b
    i
    • 3
    • 5
  • g

    gentle-nest-904

    01/04/2022, 12:26 PM
    Hi guys, I have a question as well. I'm not a technical expert anymore, but i'm an ex developer (15y ago) so hopefully you guys can guide/help me
    n
    • 2
    • 1
  • g

    gentle-nest-904

    01/04/2022, 12:27 PM
    i got API's that can be called from an application called AFAS, those API's return metadata specifications of their API's as such and i.e. describe the data elementes returned by the API's. Is such a thing possible to load automatically into DataHub?
    b
    m
    l
    • 4
    • 5
  • n

    nice-country-99675

    01/04/2022, 2:22 PM
    👋 Hi Team! I was taking a look at
    readhisft-usage
    new source ingestor.... and I found this scenario. In the regular
    redshift
    metadata ingestor I have a database alias, and some schemas and tables filtered out. Is there a way to apply the same settings the the
    redshift usage
    recipe?
    d
    • 2
    • 10
  • f

    few-air-56117

    01/04/2022, 5:49 PM
    Hi, its posible to start an ingestion from python and also to interact with a k8s datahub from python (ex. deleting a dataset)?
    h
    e
    • 3
    • 4
  • s

    some-crayon-90964

    01/04/2022, 9:52 PM
    Two Quick question: 1. Is there a way to have ingestion runs frequently (such as BigQuery -> Datahub every 12 hours); 2. How do we config the yml file such that it lists all projects without we hardcoding all projects we have Thanks in advanced!
    h
    b
    • 3
    • 5
  • n

    nice-planet-17111

    01/05/2022, 1:15 AM
    Hi team, 🙂 I was trying to ingest
    dbt
    on bigquery... until i realized ingesting bigquery now supports lineage information. So, I was testing out to use lineage information only out of bigquery, after hard deleting all dbt entities. However, When i do it.. dbt entities(=nodes) still appear in the lineage graphs like haunting ghosts .. How can i REALLY delete them ? 🙂
    b
    m
    h
    • 4
    • 16
  • l

    loud-holiday-22352

    01/05/2022, 7:10 AM
    Hello,when i run ‘datahub ingest -c ./recipe.yml’,then will get an error 【ValueError: This version of acryl-datahub requires GMS v0.8.0 or higher】,Is there something wrong with the installation? thank you.
    n
    l
    • 3
    • 3
  • g

    gentle-sundown-2310

    01/05/2022, 6:38 PM
    Hey Team, I am ingesting mysql database and this is my recipe:
    source:
    type: mysql
    config:
    # Coordinates
    host_port:
    database: tableau_advocate_lsnr
    # Credentials
    username:
    password:
    schema_pattern:
    allow:
    - "tableau_advocate_lsnr"
    profiling:
    enabled: true
    profile_pattern:
    allow:
    - ".*standard_module_daily"
    sink:
    type: "datahub-rest"
    config:
    server: "<http://localhost:8080>"
    d
    m
    b
    • 4
    • 21
  • g

    gentle-sundown-2310

    01/05/2022, 6:39 PM
    I am getting this error:
    2 validation errors for MySQLConfig
    profile_pattern
    extra fields not permitted (type=value_error.extra)
    profiling
    extra fields not permitted (type=value_error.extra)
    l
    • 2
    • 1
  • w

    wide-helicopter-97009

    01/05/2022, 9:42 PM
    Hi Team, for the metadata-ingestion in this document, do you have a demo in any previous town hall meeting for this ingestion method? https://datahubproject.io/docs/metadata-ingestion
    h
    • 2
    • 3
  • d

    damp-ambulance-34232

    01/06/2022, 4:10 AM
    Hi, is there any datahub api service to delete a urn?
    i
    b
    g
    • 4
    • 15
  • g

    gentle-florist-49869

    01/06/2022, 8:33 PM
    Hello, I'm working and creating a custom Kafka emitter (Python) , please someone here could you please provide or share one example where I can construct MySQL object ( columns, schema, etc..) please # Construct a MetadataChangeProposalWrapper object. metadata_event = MetadataChangeProposalWrapper( entityType="dataset", changeType=ChangeTypeClass.UPSERT, entityUrn=builder.make_dataset_urn("mysql", "emmiter_project.source.user-compras", "DEV"), aspectName="datasetProperties", aspect=dataset_properties,
    h
    • 2
    • 6
  • b

    bumpy-translator-90745

    01/06/2022, 8:40 PM
    Hi - I am trying to run the pip install for bigquery -
    pip install 'acryl-datahub[bigquery]'
    . I am getting the following error. I am not sure how to resolve it. Thanks!
    h
    • 2
    • 2
  • g

    gray-wall-52477

    01/06/2022, 11:59 PM
    Hey, this is the first extraction that I'm doing. I have a simple recipe:
    Copy code
    source:
      type: mysql
      config:
        host_port: XXX
        database: XXX
        username: XXX
        password: XXX
        env: Production
        include_views: False
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    It runs fine and I get this as output:
    Copy code
    {'entities_profiled': 0,
     'failures': {},
     'filtered': [],
     'query_combiner': None,
     'soft_deleted_stale_entities': [],
     'tables_scanned': 215,
     'views_scanned': 0,
     'warnings': {'DB.AAA': ['unable to map type BIT(length=1) to metadata schema'],
                  'DB.BBB': ['unable to map type BIT(length=1) to metadata schema'],
                  'DB.CCC': ['unable to map type BIT(length=1) to metadata schema'],
                  'DB.DDD': ['unable to map type BIT(length=1) to metadata schema']},
     'workunit_ids': [LIST OF TABLES HERE],
     'workunits_produced': 215}
    Sink (datahub-rest) report:
    {'downstream_end_time': None,
     'downstream_start_time': None,
     'downstream_total_latency_in_seconds': None,
     'failures': [],
     'records_written': 0,
     'warnings': []}
    As you can see, it says 215 tables are scanned and there are 4 warnings but
    records_written
    is 0 🤔
    • 1
    • 1
  • d

    damp-ambulance-34232

    01/07/2022, 2:38 AM
    Hey, did datahub support hive/trino struct type. I got some errors when ingest table with struct field type
    Copy code
    [2022-01-07 09:36:48,326] ERROR    {datahub.ingestion.run.pipeline:86} - failed to write record with workunit hive.ghtk.table_with_struct_type with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/2/com.linkedin.schema.SchemaMetadata/fields/9/jsonProps :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/2/com.linkedin.schema.SchemaMetadata/fields/10/jsonProps :: unrecognized field found but not allowed\nERROR ::
    l
    m
    • 3
    • 10
  • r

    red-piano-51229

    01/07/2022, 8:13 AM
    Hi, I have files on Amazon S3, but mainly consist of unstructured data such as images, video, audio and text files. How can I make use of DataHub to ingest metadata from them so that my users can search through the catalogue? Additionally, would my users be able to pinpoint the exact location of the file once they discovered it on DataHub?
    b
    • 2
    • 1
  • q

    quaint-branch-37931

    01/07/2022, 3:17 PM
    Hey! I have been having some trouble with user accounts. What I'd like to do is pull users from Azure AD, and allow the users to login to their ingested account using OIDC. The ingestion and login work fine, but when logging in a new user is created, separate from the existing one. I think this happens because the ingester uses the text before the "@" in the mail address as the user id. but the OIDC authentication seems to use the full email address. When I configure the ingester to use the full email, the accounts still mismatch due to different casing. Is there a good way to avoid these kinds of issues? I have noticed that for example the metabase plugin also creates users, so I guess this problem is a bit broader than just these two plugins. If there currently is no way to do this, I would be happy to contribute a solution!
    h
    i
    +2
    • 5
    • 13
  • g

    gentle-florist-49869

    01/07/2022, 3:19 PM
    Hi team, I have a simple question : What`s the difference between MCE/MAE consumer job inside de GMS and standalone, please? is it possible inside GMS ( env MCE/MAE = true) get endpoins metrics like this http://localhost:9091/actuator/metrics ? thank you
    s
    a
    e
    • 4
    • 11
  • q

    quaint-branch-37931

    01/10/2022, 12:56 PM
    Hi! I'm trying to set up lineage for a system involving AWS Glue, AWS Athena and metabase. A series of spark jobs produce data in the glue catalog, metabase then reads it using athena as a backend. When ingesting this into datahub, the metabase source creates a new Athena datasource for all source tables, which already exist in the glue platform. Is there a good way to solve this?
    m
    • 2
    • 9
  • c

    clever-australia-61035

    01/10/2022, 2:32 PM
    Hi.. I’m facing an issue while ingesting oracle views into Datahub metadata repository as following: DatabaseError: (cx_Oracle.DatabaseError) DPI-1037: column at array position 0 fetched with error 1406. Could anyone suggest for this please?
    r
    m
    c
    • 4
    • 11
  • w

    wide-helicopter-97009

    01/10/2022, 7:55 PM
    Hi Team, will the ingestion solution compatible with custom onboarded entities? if no, what do we need to do to ingestion customized entity metadata? https://datahubproject.io/docs/metadata-ingestion
    m
    o
    • 3
    • 4
  • s

    shy-parrot-64120

    01/11/2022, 12:25 AM
    Hello folks!!! can you please guide me through do i really need an Java installed to use
    kafka-connect
    extra for ingestion? receiving following error:
    Copy code
    File "/home/airflow/.local/lib/python3.9/site-packages/jpype/_jvmfinder.py", line 212, in get_jvm_path
        raise JVMNotFoundException("No JVM shared library file ({0}) "
    jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.
    h
    n
    o
    • 4
    • 6
  • s

    salmon-rose-54694

    01/11/2022, 2:11 AM
    hi experts, How can i ingest an updated table? Such as: I origin ingest with this, and I update “description”: null to “description”: “a new description” and set “version”: 1 to “version”: 2, but it’s not working. The description still empty. Any thing I am miss? Thanks for your help.
    Copy code
    [
    {
        "auditHeader": null,
        "proposedSnapshot": {
            "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
                "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,abtest.abtest.abtestv3_allocation,PROD)",
                "aspects": [
                    {
                        "com.linkedin.pegasus2avro.schema.SchemaMetadata": {
                            "schemaName": "abtest.abtest.abtestv3_allocation",
                            "platform": "urn:li:dataPlatform:mysql",
                            "version": 1,
                            "created": {
                                "time": 0,
                                "actor": "urn:li:corpuser:unknown",
                                "impersonator": null
                            },
                            "lastModified": {
                                "time": 1641798176000,
                                "actor": "urn:li:corpuser:unknown",
                                "impersonator": null
                            },
                            "deleted": null,
                            "dataset": null,
                            "cluster": null,
                            "hash": "",
                            "platformSchema": {
                                "com.linkedin.pegasus2avro.schema.MySqlDDL": {
                                    "tableSchema": ""
                                }
                            },
                            "fields": [
                                {
                                    "fieldPath": "id",
                                    "jsonPath": null,
                                    "nullable": false,
                                    "description": null,
                                    "type": {
                                        "type": {
                                            "com.linkedin.pegasus2avro.schema.NumberType": {}
                                        }
                                    },
                                    "nativeDataType": "INTEGER(display_width=11)",
                                    "recursive": false,
                                    "globalTags": null,
                                    "glossaryTerms": null
                                }
                            ],
                            "primaryKeys": null,
                            "foreignKeysSpecs": null
                        }
                    }
                ]
            }
        },
        "proposedDelta": null,
        "systemMetadata": {
            "lastObserved": 1629696884482,
            "runId": "d2584674-03d3-11ec-8de4-9ae590158f91",
            "properties": null
        }
    }
    ]
    m
    l
    • 3
    • 3
  • m

    melodic-helmet-78607

    01/11/2022, 3:54 AM
    Hi team, it is possible to ingest glossary terms with percent symbol (%) or any other symbols somewhere for the glossary name? Or is it possible to use different urn and display name? I'm thinking of separate field for abbreviation/full name/fqn
    m
    • 2
    • 2
  • t

    thankful-businessperson-69424

    01/11/2022, 12:16 PM
    Does DataHub support pulling metadata from elasticsearch source?
    s
    p
    +2
    • 5
    • 10
  • s

    some-crayon-90964

    01/11/2022, 4:14 PM
    Hi team, i am recently trying to implement my own transformers, but when i tried to run the recipe, it keeps giving me error saying the modulo cannot be found even when i have already put the python in the same dir with the recipe. Can i get some suggestions on how to fix this? Thanks.
    l
    c
    • 3
    • 9
  • g

    gentle-florist-49869

    01/11/2022, 6:15 PM
    Hello folks!!! can you please guide me through into my lab to create a kafka emmitter please? I'm trying to construct datahub object with SchemaMetadata, but I received some errors, so do you know or help to try fix the code please? into my kafka emmitter python have : # Construcao dataset propriedades do objeto schema_metadata = SchemaMetadata( schemaName="dataset_name", platform="sql" version="0", hash="", platformSchema=MySqlDDL(tableSchema=""), fields="idproduto", ) # Construcao do MetadataChangeProposalWrapper objeto metadata_event = MetadataChangeProposalWrapper( entityType="dataset", changeType=ChangeTypeClass.UPSERT, entityUrn=builder.make_dataset_urn("mysql", "fabio-mysql.fabio-dataset.user-table"), aspectName="schemaMetadata", aspect=schema_metadata, ) but show the error: File "/home/fabiocastro/datahub/metadata-ingestion/src/datahub/emitter/fabiocastro_kafka_emitter.py", line 54 hash="", ^ SyntaxError: invalid syntax
    m
    m
    • 3
    • 31
  • w

    wide-helicopter-97009

    01/11/2022, 8:46 PM
    Hi Team, I am trying out your custom_transform_example transformer module in this document https://datahubproject.io/docs/metadata-ingestion/transformers/, but I got the module not found error. do you have any solution on this one? thanks
    m
    m
    f
    • 4
    • 8
  • r

    red-pizza-28006

    01/12/2022, 1:06 PM
    Hi team - after upgrading to 0.8.22, i started getting this exception in the snowflake-usage ingestion
    Copy code
    [2022-01-12 14:03:58,671] ERROR    {datahub.ingestion.run.pipeline:85} - failed to write record with workunit operation-aspect-SUMUP_DWH_PROD.ACCESS_MANAGER.ACCESS_MANAGER_REVOKE_LIST-2022-01-11T23:34:10.293000+00:00 with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: java.lang.RuntimeException: Unknown aspect operation for entity dataset\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:132)\n\tat sun.reflect.GeneratedMethodAccessor245.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:172)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:326)\n\tat com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\n\tat java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)\n\tat java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)\n\tat java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\n\tat com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:218)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:242)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:211)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:181)\n\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:164)\n\tat com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:120)\n\tat com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\n\tat com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\n\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:96)\n\tat com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\n\tat com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\n\tat com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\n\tat <http://com.linkedin.r2.transport.ht|com.linkedin.r2.transport.ht>
    Any ideas?
    m
    l
    b
    • 4
    • 7
1...232425...144Latest