hello I ve got a few questions on functionality that I can DataHub #getting-started

Join Slack

:hello: I've got a few questions on functionality ...

# getting-started

brave-forest-5974

11/05/2021, 8:22 AM

hello I've got a few questions on functionality that I can't work out from the docs or the demo site. 🧵

brave-forest-5974

11/05/2021, 8:28 AM

Search, specifically ranking. It looks like ranking can only be affected by giving boost/extra weighting to particular fields in an Aspect. Now that product analytics are being ingested as well (I think?) are we able to use these to improve ranking for example: • entities that are viewed more often should come higher in the ranking • entities viewed by other members of my team should be higher in my ranking Also, for entities that have usage statistics, say BigQuery tables, can that information be used to improve ranking. for example: • a table queried 10,000 times in the past week should rank higher than a table queried 10 times.

plus1 2

brave-forest-5974

11/05/2021, 8:30 AM

Capturing changes made in the UI If a user makes changes to any metadata using the UI, I would like to capture that event and stream the change to our own data warehouse. How could I do that?

plus1 1

brave-forest-5974

11/05/2021, 8:33 AM

Lineage scaling For a given entity I see that the Lineage initially loads immediate up/downstreams. At what scale of immediate up/downstreams does the UI start to suffer?

brave-forest-5974

11/05/2021, 8:37 AM

GraphQL Endpoint security: The documentation mentions that improving the security of the API is "on the horizon". What's the timescale on this?

plus1 1

brave-forest-5974

11/05/2021, 8:41 AM

Upgrading It's always a possibility that an upgrade would fail. Is there documentation available on options for taking/restoring backups and reverting without losing data, or cloning into a new environment for a "blue-green" release in the case of a failed upgrade.

brave-forest-5974

11/05/2021, 8:42 AM

General Scaling What are the most common places you see scaling bottlenecks? What rules of thumb do you have when sizing the infrastructure when you expect to be serving 10,000, 100,000, 1,000,000 entities

mammoth-bear-12532

11/08/2021, 5:59 PM

Hi @brave-forest-5974 great questions… answers inline .. 1. Ranking: We will have basic ranking functionality in the OSS project but the more advanced version which uses a continuously trained model which takes into account various signals (usage, graph connectivity, data quality etc.), is only available in the Acryl Data managed version of DataHub. We have made the interfaces pluggable so that open source adopters can plug-in their own implementations if necessary. 2. ETL: This is very easy to do, just subscribe to the two Kafka topics:

MetadataAuditEvent_v4

and

MetadataChangeLog_Versioned_v1

and ETL them to your lake or data warehouse. 3. Lineage: We currently cap the lineage viz to 100 upstreams and downstreams for a single node, so the UI shouldn’t suffer too much. We have plans to make the UI and API support filtering search experience within the lineage graph to allow for navigating really dense graphs. 4. GraphQL API security: We are targeting to release support for access tokens in the next couple of months. cc @big-carpet-38439 who can add more context here 5. Upgrades: For primary data, we recommend using managed services (like AWS RDS or similar) and using its backup-restore functionality. For restoring indexes, we have a helpful guide here (https://datahubproject.io/docs/how/restore-indices/). In our managed offering, we follow a similar approach for our customer deployments. 6. Scaling: The datahub services are stateless and so are theoretically infinitely scalable horizontally. In terms of scaling bottlenecks, currently the bottlenecks that we are aware of are in the metadata service (for batch ingesting tons of metadata -> e.g. ingestion of 2K+ Looker entities takes about 15 seconds with parallel REST calls: quite a bit of low hanging fruit here), for up to a million entities with an average metadata footprint of 1MB per entity, we have seen customers being comfortably served with the large instance versions of the specific technologies (RDS, Elastic, Kafka) with a minimum of 3 hosts for the distributed systems. UI usage is typically not where you will see any bottlenecks, you will only see performance being a concern when you start using this for programmatic use cases. Obviously at that point, the workload matters a lot (e.g. are these predominantly primary key based queries or search queries etc). [all primary key based reads go to MySQL / Cassandra. Only graph queries / search queries will hit elastic]

big-carpet-38439

11/08/2021, 6:03 PM

4. The ability to generate an access token for use against the GraphQL API via an Authorization header will be coming late December or early January. I'd love to know how you would like to use the API 🙂

brave-forest-5974

11/09/2021, 1:34 PM

❤️ thank you both very much

3 Views

Open in Slack

Previous Next