Hi datahubers. I do have questions about aspect ve...
# getting-started
w
Hi datahubers. I do have questions about aspect versions. When running the metadata-ingestion bootstrap multiple times, I do see that the SchemaMetadata aspect exists in multiple versions but other aspects have only one version. My question is: When does the system decide to create a new version? A related question is about the ingestion agents (metadata producers): Is the idea that they are stateless, e.g. should they run periodically and push metadata into datahub independent of whether metadata has changed or not? And the final question: Assume that a table gets removed from a relational database. How and when should this table be removed from datahub and by whom? Below is the content of the MySQL table where SchemaMetadata has multiple versions although the content is always the same.
Copy code
> select urn, aspect,version from metadata_aspect where urn='urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)';
+--------------------------------------------------------------------+-----------------------------------------+---------+
| urn                                                                | aspect                                  | version |
+--------------------------------------------------------------------+-----------------------------------------+---------+
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.InstitutionalMemory |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.Ownership           |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.common.PlatformLocation    |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.dataset.DatasetProperties  |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.dataset.UpstreamLineage    |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       0 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       1 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       2 |
| urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD) | com.linkedin.schema.SchemaMetadata      |       3 |
+--------------------------------------------------------------------+-----------------------------------------+---------+
b
I can see that there are 3 questions here 1. When is a new version created? 2. Is it okay to keep sending the same MCE? 3. How to remove entities?
1. Is mostly answered in https://github.com/linkedin/datahub/blob/master/docs/what/aspect.md#what-is-a-metadata-aspect, but the tl;dr is that a new version is added when the value is updated. (v0 being always the latest version). The reason why you have multiple versions of
SchemaMetadata
is likely due to the always changing
AuditStamp
in the aspect.
2. Because of 1. it is okay for the metadata producer to keep sending the same metadata aspect without keeping track of actual "changes"
3. There is actually no hard-deletion. Removal is simply tracked by another aspect `Status`: https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Status.pdsc. The responsibility lies with the metadata provider, which if it's truly stateless, will need to do a diff against GMS.
w
Thanks for your explanation, very helpful. Since the content of
SchemaMetadata
always uses the same
AuditStamp
and also all the rest of the data is exactly the same, I believe that there’s a bug here. The content in the MySQL database is the same except for the version. I’ve opened an issue with some details (#1663), I believe the problem is caused that the “version” field is represented as Integer by the RestLi server but as Long when read from the database.
b
Thanks. We'll look into the issue. Also FYI
SchemaMetadata
is a legacy model and hence including many unnecessary fields, such as
version
. We may change that to something more succinct in the near future.