Hi! I am working on data profiling for our MySQL d...
# ingestion
g
Hi! I am working on data profiling for our MySQL datasource. I can run it locally with
profiling.enabled: True
in my recipe. I can see aspect names
datasetProfile
in the output, along with the values in the respective aspects being produced, in terms of
rowCount
,
columnCount
,
fieldProfiles
, etc. However, when I ingest onto the UI, with sink type set to
datahub-kafka
, I don't see the data. The
Stats
tab remains disabled. Any idea why this might be happening?
h
That's odd. Are you sure profiling in enabled for all datasets and you are checking the same datasets for which the datasetProfile aspects are produced ? I made that mistake once 🙂
g
I have profiling enabled for some datasets because I am testing it out. The database has a lot of table because of which profiling on all tables will be time consuming especially for preliminary testing and implementation. Also, I am looking up the datasets on datahub for whom these datasetProfile aspects are being produced. Below is the recipe I am using.
Copy code
source:
  type: "mysql"
  config:
    host_port: "<host port>"
    database: "<database name>"
    username: "<username>"
    password: "<password>"
    schema_pattern:
      allow:
        - "<regex pattern allowed>"
    table_pattern:
      allow:
        - "<regex pattern allowed>"
    profiling:
      enabled: True
      # profile_table_level_only: true
      query_combiner_enabled: False
      turn_off_expensive_profiling_metrics: True
      max_number_of_fields_to_profile: 10
    profile_pattern:
      allow:
        - "<regex pattern allowed>"


sink:
  type: "datahub-kafka"
  config:
    connection:
      bootstrap: "<some url>"
      schema_registry_url: "<some url>"
    topic_routes:
      MetadataChangeEvent: <topic route>
      MetadataChangeProposal: <topic route>
m
Hi @gifted-queen-80042: are you filling out the
topic_routes
?
e
Can you check whether the topic name for MCP is there?
g
Yes @mammoth-bear-12532 and @early-lamp-41924, the topic_routes are filled in the recipe here.
Copy code
topic_routes:
      MetadataChangeEvent: ...
      MetadataChangeProposal: ...
I also have set
profiling.query_combiner_enabled: False
. Otherwise, it shows issues. Whats strange is I can see datasetProfile aspects when ingested locally on to a file but I don't see the Stats tab being enabled for those datasets when ingesting onto the Datahub UI.
For context my local datahub version is
0.8.27
and the one we have hosted is
version v0.8.14
Also, here are some snapshots from the gms logs :
Copy code
16:47:10.102 [qtp1933493643-23495] INFO  c.l.m.r.entity.AspectResource:126 - INGEST PROPOSAL proposal: {aspectName=container, systemMetadata={lastObserved=...., runId=mysql-....}, entityUrn=urn:li:dataset:(urn:li:dataPlatform:mysql,...,PROD), entityType=dataset, aspect={contentType=application/json, value=ByteString(length=66,bytes=7b22636f...6532227d)}, changeType=UPSERT}
16:47:10.116 [qtp1933493643-23495] INFO  c.l.m.filter.RestliLoggingFilter:56 - POST /aspects?action=ingestProposal - ingestProposal - 500 - 14ms
16:47:10.117 [qtp1933493643-23495] ERROR c.l.m.filter.RestliLoggingFilter:38 - java.lang.RuntimeException: Unknown aspect container for entity dataset
16:47:10.118 [pool-11-thread-1] ERROR c.l.common.callback.CallbackAdapter:90 - Failed to convert callback error, original exception follows:
com.linkedin.r2.message.rest.RestException: Received error 500 from server for URI <http://localhost:8080/aspects>
	at com.linkedin.r2.transport.http.common.HttpBridge$1.onResponse(HttpBridge.java:76)
	at com.linkedin.r2.transport.http.client.rest.ExecutionCallback.lambda$onResponse$0(ExecutionCallback.java:64)
	at io.opentelemetry.javaagent.instrumentation.api.concurrent.RunnableWrapper.run(RunnableWrapper.java:28)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
16:47:10.119 [generic-mce-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataChangeProposalsProcessor:68 - MCP Processor Error
com.linkedin.r2.RemoteInvocationException: com.linkedin.data.template.RequiredFieldNotPresentException: Field "value" is required but it is not present
.
.
.
.
.
.
.
.
.
16:47:09.759 [generic-mce-consumer-job-client-0-C-1] ERROR c.l.m.k.MetadataChangeProposalsProcessor:69 - Message: {"auditHeader": null, "entityType": "dataset", "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:mysql,.....,PROD)", "entityKeyAspect": null, "changeType": "UPSERT", "aspectName": "container", "aspect": {"value": "{\"container\": \"urn:li:container:.....\"}", "contentType": "application/json"}, "systemMetadata": {"lastObserved": ...., "runId": "mysql-....", "registryName": null, "registryVersion": null, "properties": null}}
16:47:09.760 [generic-mce-consumer-job-client-0-C-1] INFO  c.l.m.k.MetadataChangeProposalsProcessor:79 - Error while processing FMCP: FailedMetadataChangeProposal - {error=com.linkedin.r2.RemoteInvocationException: com.linkedin.data.template.RequiredFieldNotPresentException: Field "value" is required but it is not present
s
You mean the local CLI is
v0.8.27
and server is
v0.8.14
? There have been many changes between those versions. You can see the releases in https://github.com/linkedin/datahub/releases?page=2 Strongly recommend to update the server. There have been many changes and they almost certainly will not work together.
g
I see. Is profiling supported in
v0.8.14
?
s
Profiling was added back then recently. There have been many fixes and improvements since then. But the docs you are seeing are of the current CLI. We do not host docs of older versions currently. So all the options you are seeing currently in docs might not be present in the older CLI and server.
g
Hi @square-activity-64562. I misread the version. The server version of Datahub is
v0.8.20
s
Same thing applies. You would want to use a CLI version which is same as server version. We test current CLI with current server versions only. You can look at releases page https://github.com/linkedin/datahub/releases and search for "profil" to see profile related changes. If you test with quickstart (latest master including unreleased changes) https://datahubproject.io/docs/quickstart/ and find a reproducible issue please file a bug report https://github.com/linkedin/datahub/issues
plus1 1
thank you 1
g
Thanks @square-activity-64562 this issue is now fixed. Appreciate it.