question Question s of the Week What are the biggest hiccup DataHub #random

:question: Question(s) of the Week: What are the b...

little-megabyte-1074

08/22/2022, 6:00 PM

❓ Question(s) of the Week: What are the biggest hiccups you’ve faced in rolling out DataHub to your wider team/organization? What resources, frameworks, or features would help drive DataHub adoption? @big-carpet-38439, @echoing-airport-49548 and I are excited for your feedback! Please leave your comments in the 🧵 and we’ll pick a random DataHub Swag winner next Monday, August 29th blob excited

excited 1

able-evening-90828

08/23/2022, 5:01 PM

One of the biggest things we have encountered is how to make sure the deployment of DataHub to GKE is secured. You guys have done an amazing jobs documenting the deployment steps and various usage tips, but I have found very little information on best practices of securing and hardening a Kubernetes deployment of DataHub. We really love all the features DataHub offers, but we also want to make absolutely sure that whatever data we put in will not be improperly accessed, especially if we store any of our customers' data. I imagine other people have similar concerns. So whatever guidance you can provide regarding this would help with adoption. There are a few things that we have found so far: 1. By default, the root account is

datahub/datahub

and there was no documentation on how to overwrite this. I learned through asking on Slack. 2. It is impossible to overwrite the

datahub

username to something else because the urn is hardcoded in many different places. For example, when I overwrote it to a different username, I found that I wasn't able to invite users. 3. By default, the GMS has an external load balancer and therefore an external IP address. And it doesn't have any auth! Per discussion with the team at the office hour earlier, we should turn the LB off by default.

better-orange-49102

08/24/2022, 7:53 AM

As shared with John and Maggie previously, we're trying to tackle the problem of letting third parties with their own databases push their metadata into datahub. The issue is that the datahub CLI is *too powerful*; people can send delete commands and possibly policies? to the rest endpoint and delete stuff that they have no business touching. The rest endpoint also does not check your privileges. As admin, it is a problem. The alternative is Kafka topic, which does not accept token authorisation and the last alternative is UI ingestion, which is also troublesome because I need networked access between databases of different projects, which makes our networks folks uncomfortable since we generally keep backends of separate teams separated for IT security considerations. Currently, I am toying with the idea of specifying networked folders that people can drop their files (generated from datahub ingest
) into the folder, I will pick it up and scan it for legit content, before ingesting into datahub rest. (The workflow remains to be worked out) Or a separate API to receive these files. However, I will lose stateful ingestion and so will need to create my own version. I am curious if anyone else has similar concerns, or am able to tolerate these concerns.

2 Views

Open in Slack

Previous Next