Hi I m new to DataHub and I m trying to understand the DB mo DataHub #getting-started

Hi, I'm new to DataHub and I'm trying to understan...

shy-lizard-17779

10/27/2020, 1:39 AM

Hi, I'm new to DataHub and I'm trying to understand the DB model and source of truth. I saw that this file contains a table: https://github.com/linkedin/datahub/blob/b8e18b0b5d56b4fa69b4bc35e8055176f9577dee/docker/postgres/init.sql However I don't fully understand if all of the data (even if there are 10B+ records), will be contained there or not and if you saw that it is a method that scale, or other tables are generated on the fly. Is it also append only table?

bumpy-keyboard-50565

10/27/2020, 2:14 AM

This is the single table that will hold all the metadata. No schema evolution is needed. That said, MySQL (or other traditional RDBMS) isn't really designed to scale beyond 100s of millions of rows. We're actively working on a NoSQL-based implementation which should scale beyond 100s of billions of rows.

mammoth-bear-12532

10/27/2020, 2:44 AM

@shy-lizard-17779 : do you need those 1B+ records to be in the DB or do you just need a metadata log that can contain all this data and get shipped off to big data lake (with interesting subset of that metadata stored in db)?

shy-lizard-17779

10/27/2020, 3:42 AM

Basically are 2-3 large metadata tables that we are collapsing into a system similar to datahub. The data is internal for datascientists. I was implementing a datahub myself (based on postgres+elasticsearch) before encountering into this. We have about 7B rows subject to grow in a 3TB Postgres cluster. To scale the DB we use Citus plugin for postgres that allows to very easily partition the data among multiple machines with a "multi master" approach that we found very convenient.

mammoth-bear-12532

10/27/2020, 3:44 AM

Got it. Do you think you cannot do this with the way Datahub uses MySQL?

shy-lizard-17779

10/27/2020, 3:45 AM

I thought it was DB agnostic (supporting both). We are trying to move away from MYSQL since we believe Citus made Postgres quite superior from a cluster manageability perspective/performance

mammoth-bear-12532

10/27/2020, 3:46 AM

Yes it is agnostic though at LinkedIn we run it on MySQL. I’m of course very curious about the 7B things you have in there :) are they entities or relationships or events?

shy-lizard-17779

10/27/2020, 3:47 AM

Just entities and relationships that our datascience would search on

mammoth-bear-12532

10/27/2020, 3:47 AM

Got it

mammoth-bear-12532

10/27/2020, 3:48 AM

So do you see anything in the ddl that makes you think Citus wouldn’t be able to scale this?

shy-lizard-17779

10/27/2020, 3:48 AM

Our first internal prototype between postgres and elatic works very well for what we have to do (from the point of view of how fast the queries are, and how easily we can index). However schema evolvability is not really easy, and we are trying to understand how to not do "not invented here"

mammoth-bear-12532

10/27/2020, 3:49 AM

Hah right.. which is why we do blobby storage for the things we don’t need indexes for in the rdbms and delegate that to elastic

shy-lizard-17779

10/27/2020, 3:49 AM

I don't fully know what the query model is in datahub

mammoth-bear-12532

10/27/2020, 3:50 AM

I don’t know exactly how your prototype is set up, but I assume you tell Postgres everything about your model

shy-lizard-17779

10/27/2020, 3:50 AM

we split the table in two, one for the aspects, the other one for the relationships, so they can be two ways

shy-lizard-17779

10/27/2020, 3:51 AM

Yeah, we don't have the best model management yet

mammoth-bear-12532

10/27/2020, 3:51 AM

We use graphdb (neo4j) for relnship queries

mammoth-bear-12532

10/27/2020, 3:51 AM

MySQL/Postgres for primary key lookup

mammoth-bear-12532

10/27/2020, 3:51 AM

And elastic for search and secondary index lookup

mammoth-bear-12532

10/27/2020, 3:52 AM

There is work in progress to support strongly consistent secondary indexes in rdbms in a way that doesn’t create agility challenge with schema evolution

mammoth-bear-12532

10/27/2020, 3:53 AM

I would also ask, how fast is metadata changing and are you getting a constant stream of changes?

shy-lizard-17779

10/27/2020, 3:54 AM

We don't need strongly consistency for the query usually. Given the numbers, the updates are fairly frequent. Mostly in batches.

bumpy-keyboard-50565

10/27/2020, 3:55 AM

This is actually quite an impressive size you have there. Do you mind to share your company name?

mammoth-bear-12532

10/27/2020, 3:55 AM

Cool .. as you must have noticed the “metadata” column is the blobby piece, the aspect,urn,version represent the pkey

bumpy-keyboard-50565

10/27/2020, 3:55 AM

I'm assuming it's mostly operational metadata or jobs-related metadata?

bumpy-keyboard-50565

10/27/2020, 3:57 AM

DataHub supports MySQL, Postgres, MarinaDB out of the box. The DB driver should support MS SQL & Oracle in theory but we have not tested those.

mammoth-bear-12532

10/27/2020, 3:57 AM

@bumpy-keyboard-50565 I think you meant MariaDB

bumpy-keyboard-50565

10/27/2020, 3:58 AM

lol yes sorry typo.

bumpy-keyboard-50565

10/27/2020, 3:58 AM

I'm not familiar with Citus but I'm assume it's similar to Vitess (https://github.com/vitessio/vitess)?

mammoth-bear-12532

10/27/2020, 3:59 AM

@shy-lizard-17779 : we support customizable retention of versions, so you can have last N versions of each metadata entry to be retained so that table doesn’t grow beyond bound

shy-lizard-17779

10/27/2020, 4:04 AM

A few things: • versioning is great, in our DB we added an end date (of when the data is replaced), so we can easily look up by date with a simple query • In our elastic we index not only the entity, but also all of the entities that it is related to (so we can give more powerful queries). Is it possible in datahub? I guess those queries should all be done on neo4j given that in the metadata db everything is serialized Yeah it sounds similar to Vitesse. Actually Citus is now owned my MSFT and also comes with azure :) Sorry can't disclose yet

bumpy-keyboard-50565

10/27/2020, 4:08 AM

More high-level information on retention/versioning: https://github.com/linkedin/datahub/blob/master/docs/what/aspect.md While it's possible to combine metadata from multiple entities into one search index, we generally encourage using graph db for that kind of query patterns.

mammoth-bear-12532

10/27/2020, 4:10 AM

@shy-lizard-17779 : yes better to let graphdb do all the relnships

shy-lizard-17779

10/27/2020, 4:14 AM

Do you feel neo4j will scale? We feel it mighthat neo4j would just crash under that. e.g. a query that returns 100k entities, joining on 4 tables, each with a few conditions. Queries in the future will also involve some text (descriptions). Elastic search just scales seamenglessly. No matter how much load we added to it, it just worked, and with the SQL extension it is just wonderful to use

mammoth-bear-12532

10/27/2020, 4:16 AM

Does elastic support joins now? I was under the impression that it is a single table abstraction

mammoth-bear-12532

10/27/2020, 4:17 AM

So you would need to denormalize

shy-lizard-17779

10/27/2020, 4:17 AM

Some joins but we don't use it. As was saying before, we denormalize on level of relationships in nested fields

mammoth-bear-12532

10/27/2020, 4:18 AM

There is nothing preventing you from doing the same denormalization in datahub as well

🙌 1

shy-lizard-17779

10/27/2020, 4:18 AM

That you can run queries x document. E.g. SELECT entity FROM index WHERE nested(nicknames, nickname.name="love jeff")

mammoth-bear-12532

10/27/2020, 4:25 AM

Depending on how your metadata change event looks, You will have to tap into the indexing pipeline and “materialize” the denormalization into the index. If that makes sense.

shy-lizard-17779

10/27/2020, 4:26 AM

I see. So to tap in the indexing pipeline, is there already any code saying "when this entity updates, signal the identities that are connected"?

mammoth-bear-12532

10/27/2020, 4:27 AM

https://github.com/linkedin/datahub/tree/master/metadata-jobs/mae-consumer-job

mammoth-bear-12532

10/27/2020, 4:29 AM

Specifically the index builders at : https://github.com/linkedin/datahub/tree/master/metadata-builders/src/main/java/com/linkedin/metadata/builders/search

mammoth-bear-12532

10/27/2020, 4:29 AM

You can customize it to your hearts content

microscopic-waitress-95820

10/27/2020, 4:31 AM

Specifically your index builder can "subscribe" to changes in several entities by updating the list of snapshots here https://github.com/linkedin/datahub/blob/master/metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java#L26

bumpy-keyboard-50565

10/27/2020, 4:35 AM

neo4j 4.x now claims to be horizontally scalable. Internally we have our proprietary in-memory graph DB that scales to LinkedIn's economic graph scale (billions or nodes & edges) and maintains millisecond query time. Performance in graph is purely a function of how much to hold in memory.

shy-lizard-17779

10/27/2020, 1:49 PM

@bumpy-keyboard-50565 interesting! are you using/plan to use neo4j internally for datahub or a custom DB? Do you mind sharing if you did any scaling testing?

bumpy-keyboard-50565

10/27/2020, 2:29 PM

We're currently using neo4j and plan to transition to our custom DB in the next few quarters. The DAO is designed in such a way that it's mostly implementation agonistic. The metadata graph we have currently is still small O(10M) nodes so no scaling issue so far.

Open in Slack

Previous Next