DataHub #ingestion

few-air-56117

01/04/2022, 12:13 PM

Hi guys, i have a question. i put datahub on google k8s, but i am not sure how can i run the igestion scripts. on my local machine i used recipe.yaml and run datahub ingest. But i am not sure how ca i do it in k8s. Thx 😄

gentle-nest-904

01/04/2022, 12:26 PM

Hi guys, I have a question as well. I'm not a technical expert anymore, but i'm an ex developer (15y ago) so hopefully you guys can guide/help me

gentle-nest-904

01/04/2022, 12:27 PM

i got API's that can be called from an application called AFAS, those API's return metadata specifications of their API's as such and i.e. describe the data elementes returned by the API's. Is such a thing possible to load automatically into DataHub?

nice-country-99675

01/04/2022, 2:22 PM

👋 Hi Team! I was taking a look at

readhisft-usage

new source ingestor.... and I found this scenario. In the regular

redshift

metadata ingestor I have a database alias, and some schemas and tables filtered out. Is there a way to apply the same settings the the

redshift usage

recipe?

few-air-56117

01/04/2022, 5:49 PM

Hi, its posible to start an ingestion from python and also to interact with a k8s datahub from python (ex. deleting a dataset)?

some-crayon-90964

01/04/2022, 9:52 PM

Two Quick question: 1. Is there a way to have ingestion runs frequently (such as BigQuery -> Datahub every 12 hours); 2. How do we config the yml file such that it lists all projects without we hardcoding all projects we have Thanks in advanced!

nice-planet-17111

01/05/2022, 1:15 AM

Hi team, 🙂 I was trying to ingest

dbt

on bigquery... until i realized ingesting bigquery now supports lineage information. So, I was testing out to use lineage information only out of bigquery, after hard deleting all dbt entities. However, When i do it.. dbt entities(=nodes) still appear in the lineage graphs like haunting ghosts .. How can i REALLY delete them ? 🙂

loud-holiday-22352

01/05/2022, 7:10 AM

Hello,when i run ‘datahub ingest -c ./recipe.yml’,then will get an error 【ValueError: This version of acryl-datahub requires GMS v0.8.0 or higher】,Is there something wrong with the installation? thank you.

gentle-sundown-2310

01/05/2022, 6:38 PM

Hey Team, I am ingesting mysql database and this is my recipe:

source:

type: mysql

config:

# Coordinates

host_port:

database: tableau_advocate_lsnr

# Credentials

username:

password:

schema_pattern:

allow:

- "tableau_advocate_lsnr"

profiling:

enabled: true

profile_pattern:

allow:

- ".*standard_module_daily"

sink:

type: "datahub-rest"

config:

server: "<http://localhost:8080>"

gentle-sundown-2310

01/05/2022, 6:39 PM

I am getting this error:

2 validation errors for MySQLConfig

profile_pattern

extra fields not permitted (type=value_error.extra)

profiling

extra fields not permitted (type=value_error.extra)

wide-helicopter-97009

01/05/2022, 9:42 PM

Hi Team, for the metadata-ingestion in this document, do you have a demo in any previous town hall meeting for this ingestion method? https://datahubproject.io/docs/metadata-ingestion

damp-ambulance-34232

01/06/2022, 4:10 AM

Hi, is there any datahub api service to delete a urn?

gentle-florist-49869

01/06/2022, 8:33 PM

Hello, I'm working and creating a custom Kafka emitter (Python) , please someone here could you please provide or share one example where I can construct MySQL object ( columns, schema, etc..) please # Construct a MetadataChangeProposalWrapper object. metadata_event = MetadataChangeProposalWrapper( entityType="dataset", changeType=ChangeTypeClass.UPSERT, entityUrn=builder.make_dataset_urn("mysql", "emmiter_project.source.user-compras", "DEV"), aspectName="datasetProperties", aspect=dataset_properties,

bumpy-translator-90745

01/06/2022, 8:40 PM

Hi - I am trying to run the pip install for bigquery -

pip install 'acryl-datahub[bigquery]'

. I am getting the following error. I am not sure how to resolve it. Thanks!

gray-wall-52477

01/06/2022, 11:59 PM

Hey, this is the first extraction that I'm doing. I have a simple recipe:

Copy code

source:
  type: mysql
  config:
    host_port: XXX
    database: XXX
    username: XXX
    password: XXX
    env: Production
    include_views: False

sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

It runs fine and I get this as output:

Copy code

{'entities_profiled': 0,
 'failures': {},
 'filtered': [],
 'query_combiner': None,
 'soft_deleted_stale_entities': [],
 'tables_scanned': 215,
 'views_scanned': 0,
 'warnings': {'DB.AAA': ['unable to map type BIT(length=1) to metadata schema'],
              'DB.BBB': ['unable to map type BIT(length=1) to metadata schema'],
              'DB.CCC': ['unable to map type BIT(length=1) to metadata schema'],
              'DB.DDD': ['unable to map type BIT(length=1) to metadata schema']},
 'workunit_ids': [LIST OF TABLES HERE],
 'workunits_produced': 215}
Sink (datahub-rest) report:
{'downstream_end_time': None,
 'downstream_start_time': None,
 'downstream_total_latency_in_seconds': None,
 'failures': [],
 'records_written': 0,
 'warnings': []}

As you can see, it says 215 tables are scanned and there are 4 warnings but

records_written

is 0 🤔

damp-ambulance-34232

01/07/2022, 2:38 AM

Hey, did datahub support hive/trino struct type. I got some errors when ingest table with struct field type

Copy code

[2022-01-07 09:36:48,326] ERROR    {datahub.ingestion.run.pipeline:86} - failed to write record with workunit hive.ghtk.table_with_struct_type with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/2/com.linkedin.schema.SchemaMetadata/fields/9/jsonProps :: unrecognized field found but not allowed\nERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/2/com.linkedin.schema.SchemaMetadata/fields/10/jsonProps :: unrecognized field found but not allowed\nERROR ::

red-piano-51229

01/07/2022, 8:13 AM

Hi, I have files on Amazon S3, but mainly consist of unstructured data such as images, video, audio and text files. How can I make use of DataHub to ingest metadata from them so that my users can search through the catalogue? Additionally, would my users be able to pinpoint the exact location of the file once they discovered it on DataHub?

quaint-branch-37931

01/07/2022, 3:17 PM

Hey! I have been having some trouble with user accounts. What I'd like to do is pull users from Azure AD, and allow the users to login to their ingested account using OIDC. The ingestion and login work fine, but when logging in a new user is created, separate from the existing one. I think this happens because the ingester uses the text before the "@" in the mail address as the user id. but the OIDC authentication seems to use the full email address. When I configure the ingester to use the full email, the accounts still mismatch due to different casing. Is there a good way to avoid these kinds of issues? I have noticed that for example the metabase plugin also creates users, so I guess this problem is a bit broader than just these two plugins. If there currently is no way to do this, I would be happy to contribute a solution!

gentle-florist-49869

01/07/2022, 3:19 PM

Hi team, I have a simple question : What`s the difference between MCE/MAE consumer job inside de GMS and standalone, please? is it possible inside GMS ( env MCE/MAE = true) get endpoins metrics like this http://localhost:9091/actuator/metrics ? thank you

quaint-branch-37931

01/10/2022, 12:56 PM

Hi! I'm trying to set up lineage for a system involving AWS Glue, AWS Athena and metabase. A series of spark jobs produce data in the glue catalog, metabase then reads it using athena as a backend. When ingesting this into datahub, the metabase source creates a new Athena datasource for all source tables, which already exist in the glue platform. Is there a good way to solve this?

clever-australia-61035

01/10/2022, 2:32 PM

Hi.. I’m facing an issue while ingesting oracle views into Datahub metadata repository as following: DatabaseError: (cx_Oracle.DatabaseError) DPI-1037: column at array position 0 fetched with error 1406. Could anyone suggest for this please?

wide-helicopter-97009

01/10/2022, 7:55 PM

Hi Team, will the ingestion solution compatible with custom onboarded entities? if no, what do we need to do to ingestion customized entity metadata? https://datahubproject.io/docs/metadata-ingestion

shy-parrot-64120

01/11/2022, 12:25 AM

Hello folks!!! can you please guide me through do i really need an Java installed to use

kafka-connect

extra for ingestion? receiving following error:

Copy code

File "/home/airflow/.local/lib/python3.9/site-packages/jpype/_jvmfinder.py", line 212, in get_jvm_path
    raise JVMNotFoundException("No JVM shared library file ({0}) "
jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.

salmon-rose-54694

01/11/2022, 2:11 AM

hi experts, How can i ingest an updated table? Such as: I origin ingest with this, and I update “description”: null to “description”: “a new description” and set “version”: 1 to “version”: 2, but it’s not working. The description still empty. Any thing I am miss? Thanks for your help.

Copy code

[
{
    "auditHeader": null,
    "proposedSnapshot": {
        "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
            "urn": "urn:li:dataset:(urn:li:dataPlatform:mysql,abtest.abtest.abtestv3_allocation,PROD)",
            "aspects": [
                {
                    "com.linkedin.pegasus2avro.schema.SchemaMetadata": {
                        "schemaName": "abtest.abtest.abtestv3_allocation",
                        "platform": "urn:li:dataPlatform:mysql",
                        "version": 1,
                        "created": {
                            "time": 0,
                            "actor": "urn:li:corpuser:unknown",
                            "impersonator": null
                        },
                        "lastModified": {
                            "time": 1641798176000,
                            "actor": "urn:li:corpuser:unknown",
                            "impersonator": null
                        },
                        "deleted": null,
                        "dataset": null,
                        "cluster": null,
                        "hash": "",
                        "platformSchema": {
                            "com.linkedin.pegasus2avro.schema.MySqlDDL": {
                                "tableSchema": ""
                            }
                        },
                        "fields": [
                            {
                                "fieldPath": "id",
                                "jsonPath": null,
                                "nullable": false,
                                "description": null,
                                "type": {
                                    "type": {
                                        "com.linkedin.pegasus2avro.schema.NumberType": {}
                                    }
                                },
                                "nativeDataType": "INTEGER(display_width=11)",
                                "recursive": false,
                                "globalTags": null,
                                "glossaryTerms": null
                            }
                        ],
                        "primaryKeys": null,
                        "foreignKeysSpecs": null
                    }
                }
            ]
        }
    },
    "proposedDelta": null,
    "systemMetadata": {
        "lastObserved": 1629696884482,
        "runId": "d2584674-03d3-11ec-8de4-9ae590158f91",
        "properties": null
    }
}
]

melodic-helmet-78607

01/11/2022, 3:54 AM

Hi team, it is possible to ingest glossary terms with percent symbol (%) or any other symbols somewhere for the glossary name? Or is it possible to use different urn and display name? I'm thinking of separate field for abbreviation/full name/fqn

thankful-businessperson-69424

01/11/2022, 12:16 PM

Does DataHub support pulling metadata from elasticsearch source?

some-crayon-90964

01/11/2022, 4:14 PM

Hi team, i am recently trying to implement my own transformers, but when i tried to run the recipe, it keeps giving me error saying the modulo cannot be found even when i have already put the python in the same dir with the recipe. Can i get some suggestions on how to fix this? Thanks.

gentle-florist-49869

01/11/2022, 6:15 PM

Hello folks!!! can you please guide me through into my lab to create a kafka emmitter please? I'm trying to construct datahub object with SchemaMetadata, but I received some errors, so do you know or help to try fix the code please? into my kafka emmitter python have : # Construcao dataset propriedades do objeto schema_metadata = SchemaMetadata( schemaName="dataset_name", platform="sql" version="0", hash="", platformSchema=MySqlDDL(tableSchema=""), fields="idproduto", ) # Construcao do MetadataChangeProposalWrapper objeto metadata_event = MetadataChangeProposalWrapper( entityType="dataset", changeType=ChangeTypeClass.UPSERT, entityUrn=builder.make_dataset_urn("mysql", "fabio-mysql.fabio-dataset.user-table"), aspectName="schemaMetadata", aspect=schema_metadata, ) but show the error: File "/home/fabiocastro/datahub/metadata-ingestion/src/datahub/emitter/fabiocastro_kafka_emitter.py", line 54 hash="", ^ SyntaxError: invalid syntax

wide-helicopter-97009

01/11/2022, 8:46 PM

Hi Team, I am trying out your custom_transform_example transformer module in this document https://datahubproject.io/docs/metadata-ingestion/transformers/, but I got the module not found error. do you have any solution on this one? thanks

red-pizza-28006

01/12/2022, 1:06 PM

Hi team - after upgrading to 0.8.22, i started getting this exception in the snowflake-usage ingestion

Copy code

[2022-01-12 14:03:58,671] ERROR    {datahub.ingestion.run.pipeline:85} - failed to write record with workunit operation-aspect-SUMUP_DWH_PROD.ACCESS_MANAGER.ACCESS_MANAGER_REVOKE_LIST-2022-01-11T23:34:10.293000+00:00 with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: java.lang.RuntimeException: Unknown aspect operation for entity dataset\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n\tat com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:132)\n\tat sun.reflect.GeneratedMethodAccessor245.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:172)\n\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:326)\n\tat com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\n\tat java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)\n\tat java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)\n\tat java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\n\tat com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\n\tat com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:218)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:242)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:211)\n\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:181)\n\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:164)\n\tat com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:120)\n\tat com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\n\tat com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\n\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n\tat com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\n\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n\tat com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n\tat com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:96)\n\tat com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\n\tat com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\n\tat com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\n\tat <http://com.linkedin.r2.transport.ht|com.linkedin.r2.transport.ht>

Any ideas?