Just throwing this out there for anyone who wants ...
# contribute-code
m
Just throwing this out there for anyone who wants to run with it. At my previous company I had built a basic catalog that automatically pulled in database column info, and allowed end users to add descriptions to each database/column/table. Pretty standard right there. But the other thing I did was allow users to link columns together (identifying, for example, that the
id
column on the
product
table was the same as the
product_id
on the
purchase
table, and that the
customer_id
column there was the same as
personid
on the
customer
table, etc etc. What that then allowed for was someone could say "I'm trying to join
product
and `customer`" and the app would then determine the shortest path of JOIN clauses to do that, and give them starter SQL. In this example, that would be:
Copy code
SELECT *
FROM product P
   JOIN purchase P2 ON P2.product_id = P.id
   JOIN customer C ON C.personid = P2.customer_id
Sometimes it found inappropriate shorter paths (such as joining source systems that had no real relationship to each other, but happened to contain a column from a common place), but for the most part it was really helpful, especially when it found the complex paths that required many JOINs to get the job done (like, I don't know, joining a user action to the salesperson responsible for their access; that's a weird example and I don't think anyone would want that, but the system could do it). It's kind of like lineage, except it's lateral and not hierarchical. Anyway, it was fun to do but writing code for DataHub is outside of my wheelhouse at the moment, but if someone out there is looking for a cool little project, this might be interesting.
b
Chris - this sounds like a super useful capability. We've definitely heard some requests for identifying foreign key relationships more clearly (along with visualizing them) inside DataHub. I really like your approach. Did you auto-extract this information from the data stores along with allowing users to define the relationships? Did you have any issues related the manual effort required to add these relationships? Or them becoming stale over time?
m
The only auto-extraction was the table/column structure information, no looking for foreign key definitions. I didn't even think of that, actually. But in my particular case it probably would not have helped. This was a Redshift-based data lake that pulled in data from many different source systems that did not talk with one another; it was only when they all came together that anyone could try to trace anything (such as which user bought which product and what did they do inside it; the source activity database would have a user ID passed to it, but it had no sense that it actually came from some other user database). What we had were replicas of data from unconnected systems; there's very close to a 0% chance any foreign key identification across source systems was done. But it's possible that was done within a source system, so that would be worth looking into. Manual effort was limited only by the willingness of people to do the crowdsource work. That company had no kind of data governance in place (even casually; I was pretty much it) so there were no actual owners of the data. The people who used my app were the data analysts who would populate fields they were familiar with or used frequently. That left a huge percentage of columns and tables with no information about them, but we had a good little application for the tables that got used a lot. As for staleness: the auto-extract was done every three hours (I also used this system to alert of changes since there was very poor communication about any database work being done), so the structure was always up to date. For user-supplied info: definitions/comments are just as volatile as anywhere else (but it would be cool, come to think of it, to flag any changed columns; like if the data type changed from int to varchar, someone should be alerted to confirm the definition is still good). Any links between columns could likewise go stale, but the nature of that seems a little less of a risk. Once you identify "these columns are all the product ID" it's pretty rare for any of those to change (famous last words, I know). Adding new columns to the list of "product ID" columns is not a problem. I used a table that mapped an ID/descriptor (such as "product ID from System X") to a list of column IDs that were bracketed, something like "[1][5][12][23]" to show that those four columns are joinable. A bit messy, I know, but I'm pretty sure I had a reason for doing that as opposed to having them in separate key-multiple-value pairs. (It may not have been a good reason, but I probably had one. If someone were to look at my code it would either say "I did it this way for [this very good reason]" or "I wanted to do it [this other way] but couldn't because of [limiting circumstance] so if that limitation gets removed, you can rewrite this section.")