Thread
#getting-started
    s

    shy-lizard-17779

    1 year ago
    Hi, I'm new to DataHub and I'm trying to understand the DB model and source of truth. I saw that this file contains a table: https://github.com/linkedin/datahub/blob/b8e18b0b5d56b4fa69b4bc35e8055176f9577dee/docker/postgres/init.sql However I don't fully understand if all of the data (even if there are 10B+ records), will be contained there or not and if you saw that it is a method that scale, or other tables are generated on the fly. Is it also append only table?
    b

    bumpy-keyboard-50565

    1 year ago
    This is the single table that will hold all the metadata. No schema evolution is needed. That said, MySQL (or other traditional RDBMS) isn't really designed to scale beyond 100s of millions of rows. We're actively working on a NoSQL-based implementation which should scale beyond 100s of billions of rows.
    m

    mammoth-bear-12532

    1 year ago
    @shy-lizard-17779 : do you need those 1B+ records to be in the DB or do you just need a metadata log that can contain all this data and get shipped off to big data lake (with interesting subset of that metadata stored in db)?
    s

    shy-lizard-17779

    1 year ago
    Basically are 2-3 large metadata tables that we are collapsing into a system similar to datahub. The data is internal for datascientists. I was implementing a datahub myself (based on postgres+elasticsearch) before encountering into this. We have about 7B rows subject to grow in a 3TB Postgres cluster. To scale the DB we use Citus plugin for postgres that allows to very easily partition the data among multiple machines with a "multi master" approach that we found very convenient.
    m

    mammoth-bear-12532

    1 year ago
    Got it. Do you think you cannot do this with the way Datahub uses MySQL?
    s

    shy-lizard-17779

    1 year ago
    I thought it was DB agnostic (supporting both). We are trying to move away from MYSQL since we believe Citus made Postgres quite superior from a cluster manageability perspective/performance
    m

    mammoth-bear-12532

    1 year ago
    Yes it is agnostic though at LinkedIn we run it on MySQL. I’m of course very curious about the 7B things you have in there 😃 are they entities or relationships or events?
    s

    shy-lizard-17779

    1 year ago
    Just entities and relationships that our datascience would search on
    m

    mammoth-bear-12532

    1 year ago
    Got it
    So do you see anything in the ddl that makes you think Citus wouldn’t be able to scale this?
    s

    shy-lizard-17779

    1 year ago
    Our first internal prototype between postgres and elatic works very well for what we have to do (from the point of view of how fast the queries are, and how easily we can index). However schema evolvability is not really easy, and we are trying to understand how to not do "not invented here"
    m

    mammoth-bear-12532

    1 year ago
    Hah right.. which is why we do blobby storage for the things we don’t need indexes for in the rdbms and delegate that to elastic
    s

    shy-lizard-17779

    1 year ago
    I don't fully know what the query model is in datahub
    m

    mammoth-bear-12532

    1 year ago
    I don’t know exactly how your prototype is set up, but I assume you tell Postgres everything about your model
    s

    shy-lizard-17779

    1 year ago
    we split the table in two, one for the aspects, the other one for the relationships, so they can be two ways
    Yeah, we don't have the best model management yet
    m

    mammoth-bear-12532

    1 year ago
    We use graphdb (neo4j) for relnship queries
    MySQL/Postgres for primary key lookup
    And elastic for search and secondary index lookup
    There is work in progress to support strongly consistent secondary indexes in rdbms in a way that doesn’t create agility challenge with schema evolution
    I would also ask, how fast is metadata changing and are you getting a constant stream of changes?
    s

    shy-lizard-17779

    1 year ago
    We don't need strongly consistency for the query usually. Given the numbers, the updates are fairly frequent. Mostly in batches.
    b

    bumpy-keyboard-50565

    1 year ago
    This is actually quite an impressive size you have there. Do you mind to share your company name?
    m

    mammoth-bear-12532

    1 year ago
    Cool .. as you must have noticed the “metadata” column is the blobby piece, the aspect,urn,version represent the pkey
    b

    bumpy-keyboard-50565

    1 year ago
    I'm assuming it's mostly operational metadata or jobs-related metadata?
    DataHub supports MySQL, Postgres, MarinaDB out of the box. The DB driver should support MS SQL & Oracle in theory but we have not tested those.
    m

    mammoth-bear-12532

    1 year ago
    @bumpy-keyboard-50565 I think you meant MariaDB
    b

    bumpy-keyboard-50565

    1 year ago
    lol yes sorry typo.
    I'm not familiar with Citus but I'm assume it's similar to Vitess (https://github.com/vitessio/vitess)?
    m

    mammoth-bear-12532

    1 year ago
    @shy-lizard-17779 : we support customizable retention of versions, so you can have last N versions of each metadata entry to be retained so that table doesn’t grow beyond bound
    s

    shy-lizard-17779

    1 year ago
    A few things: • versioning is great, in our DB we added an end date (of when the data is replaced), so we can easily look up by date with a simple query • In our elastic we index not only the entity, but also all of the entities that it is related to (so we can give more powerful queries). Is it possible in datahub? I guess those queries should all be done on neo4j given that in the metadata db everything is serialized Yeah it sounds similar to Vitesse. Actually Citus is now owned my MSFT and also comes with azure 😃 Sorry can't disclose yet
    b

    bumpy-keyboard-50565

    1 year ago
    More high-level information on retention/versioning: https://github.com/linkedin/datahub/blob/master/docs/what/aspect.md While it's possible to combine metadata from multiple entities into one search index, we generally encourage using graph db for that kind of query patterns.
    m

    mammoth-bear-12532

    1 year ago
    @shy-lizard-17779 : yes better to let graphdb do all the relnships
    s

    shy-lizard-17779

    1 year ago
    Do you feel neo4j will scale? We feel it mighthat neo4j would just crash under that. e.g. a query that returns 100k entities, joining on 4 tables, each with a few conditions. Queries in the future will also involve some text (descriptions). Elastic search just scales seamenglessly. No matter how much load we added to it, it just worked, and with the SQL extension it is just wonderful to use
    m

    mammoth-bear-12532

    1 year ago
    Does elastic support joins now? I was under the impression that it is a single table abstraction
    So you would need to denormalize
    s

    shy-lizard-17779

    1 year ago
    Some joins but we don't use it. As was saying before, we denormalize on level of relationships in nested fields
    m

    mammoth-bear-12532

    1 year ago
    There is nothing preventing you from doing the same denormalization in datahub as well
    s

    shy-lizard-17779

    1 year ago
    That you can run queries x document. E.g. SELECT entity FROM index WHERE nested(nicknames, nickname.name="love jeff")
    m

    mammoth-bear-12532

    1 year ago
    Depending on how your metadata change event looks, You will have to tap into the indexing pipeline and “materialize” the denormalization into the index. If that makes sense.
    s

    shy-lizard-17779

    1 year ago
    I see. So to tap in the indexing pipeline, is there already any code saying "when this entity updates, signal the identities that are connected"?
    You can customize it to your hearts content
    m

    microscopic-waitress-95820

    1 year ago
    Specifically your index builder can "subscribe" to changes in several entities by updating the list of snapshots here https://github.com/linkedin/datahub/blob/master/metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java#L26
    b

    bumpy-keyboard-50565

    1 year ago
    neo4j 4.x now claims to be horizontally scalable. Internally we have our proprietary in-memory graph DB that scales to LinkedIn's economic graph scale (billions or nodes & edges) and maintains millisecond query time. Performance in graph is purely a function of how much to hold in memory.
    s

    shy-lizard-17779

    1 year ago
    @bumpy-keyboard-50565 interesting! are you using/plan to use neo4j internally for datahub or a custom DB? Do you mind sharing if you did any scaling testing?
    b

    bumpy-keyboard-50565

    1 year ago
    We're currently using neo4j and plan to transition to our custom DB in the next few quarters. The DAO is designed in such a way that it's mostly implementation agonistic. The metadata graph we have currently is still small O(10M) nodes so no scaling issue so far.