Hello, I'm Jongmin Kim, who is working on the LG E...
# advice-metadata-modeling
s
Hello, I'm Jongmin Kim, who is working on the LG Electronics Data platform team. Recently, our team has been very interested in the Datahub and is analyzing the it. About this analyzing, I have a question like the following: At the Datahub, all data (dataset, corpuser, corpGroup, etc.) is stored in one table (the table name is 'metadata_aspect_v2') So, I wonder if there is any performance problem, when storing all data in only one table. I think, Most systems store data in multiple tables, for performance reasons. My Question is whether Datahub have DB search performance problem, because of above(All data is stored only one table).
h
Following
b
The database table is not used for search and it is primarily used to lookup data by id which is part of the primary key. Text search is provided by Elasticsearch which does separate entities into separate indices. These indices perform very well without additional sharding up to around 2 million entities (i.e. datasets). After that, increasing the shard count to roughly 1 shard per million entities can maintain search performance around 1 second. This is approximate because the amount and type of metadata can vary from one instance to another.
s
Hello, David Leifker. Thank you for, kind explanation about my inquiry. Your explanation has been very helpful in the understanding, about search and lookup process of the Datahub. And, I think, the below page is describing about your explanation in detail. https://datahubproject.io/docs/architecture/metadata-serving/ (DataHub Serving Architecture) I would like to inquire about your explanation and the contents of the above page. Question 1 Looking at your description, "the data stored in the datahub" is stored in mysql for lookup, and in elasticsearch for search. All the data is stored in the mysql for lookup, and the search data is duplicatly stored in the elasticsearch. So, my first question is whether the following is correct. Some data is stored redundantly, in mysql and elasticsearch, like above. Question 2 Looking at the above page(DataHub Serving Architecture), It seems that, primary-key based read and secondary index based read is, routed to seperate. My second question is about this routing function Can you explain more about the routing function? (For example, where is stored, this route function, and how it is managed, etc.)
b
Question 1: Most data is redundantly stored in mysql and Elasticsearch. A few things are stored redundantly in Kafka topics and Elasticsearch. For backup and restore, it is possible to backup mysql and re-create the Elasticsearch indices. Question 2: The mysql database is the primary data store. It contains all versions (historical values). You can think of ES as having a copy of the latest/current version of data only. In cases when historical values are needed or in cases where ES only contains partial data to be returned, then GMS Service Tier must fetch from both ES and mysql.
s
Hello, David Leifker. Thank you again, for your kind explanation of the inquiry. Your answer is very helpful for the analysis about Datahub. I hope that a lot of cooperation will be continued, here.