• l

    little-megabyte-1074

    2 years ago
    Hi folks! My team is going to be building out a POC for DataHub in our upcoming sprint and I’m curious if anyone has suggestions/thoughts about the best way to collect usage analytics? I want to be able to quantify how folks are using the tool, what features have the highest adoption, etc. Our company uses Segment, but I’m not sure we want to invest too much time integrating with Segment for a POC.
    l
    b
    +1
    5 replies
    Copy to Clipboard
  • b

    billowy-eye-48149

    2 years ago
    Hello Team, Why is mysql container setup missing in the new docker compose file ?
    b
    m
    12 replies
    Copy to Clipboard
  • c

    calm-minister-22324

    2 years ago
    Hi everyone o/ I've been reading the documentations, and I wanted to make sure if DataHub would support my use case and if I understood its documentation correctly.... Use Case: I need to expose an API with the date of the latest available data on External Tables on Redshift Solution: Create a SQL ingestion to push the latest samples from the tables as datasets, and then use the dataset to extract/query the latest date available Would this be correct?
    c
    b
    6 replies
    Copy to Clipboard
  • f

    fast-exabyte-18411

    2 years ago
    Hi all, working on a POC of Datahub. Running into some issues with connecting to our external Kafka provider, as it uses SSL and Basic Auth for schema registry. Was hoping there would a way to configure consumers to use SSL via environment vairables, but it looks like the Spring Kafka library doesn't support these specific configs. There's some stackoverflow stuff that implies they at least don't have it for SSL https://stackoverflow.com/questions/51316017/spring-boot-spring-kafka-ssl-configuration-by-environment-variables-impossible So it would seem that this configuration has to be done here https://github.com/linkedin/datahub/blob/master/metadata-jobs/mae-consumer-job/src/main/resources/application.properties Has anybody else run into this? I'm thinking of testing out a fix locally and opening a PR, would love to know if I'm missing soemthing
    f
    p
    +2
    42 replies
    Copy to Clipboard
  • s

    silly-apple-97303

    2 years ago
    I'm trying to configure access to schema registry with basic auth enabled for the GMS. I was able to configure schema registry access for the MAE/MCE services with the following env variables:
    - name: SPRING_KAFKA_PROPERTIES_BASIC_AUTH_CREDENTIALS_SOURCE
          value: USER_INFO
        - name: SPRING_KAFKA_PROPERTIES_BASIC_AUTH_USER_INFO
          valueFrom:
            secretKeyRef:
              name: "kafka-schema-registry-credentials"
              key: "user-info"
    And the logs from both MAE/MCE look like:
    16:21:26.721 [main] INFO  i.c.k.s.KafkaAvroDeserializerConfig - KafkaAvroDeserializerConfig values: 
    	schema.registry.url = [redacted]
    	<http://basic.auth.user.info|basic.auth.user.info> = [hidden]
    	auto.register.schemas = true
    	max.schemas.per.subject = 1000
    	basic.auth.credentials.source = USER_INFO
    	<http://schema.registry.basic.auth.user.info|schema.registry.basic.auth.user.info> = [hidden]
    	specific.avro.reader = false
    	value.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
    	key.subject.name.strategy = class io.confluent.kafka.serializers.subject.TopicNameStrategy
    
    16:21:26.857 [main] INFO  o.a.kafka.common.utils.AppInfoParser - Kafka version: 2.2.1-cp1
    However, doing the same for the GMS is not working. Specifically I get these warn log messages on startup and the configs are not attached to the serializer:
    16:20:23.481 [main] INFO  i.c.k.s.KafkaAvroSerializerConfig - KafkaAvroSerializerConfig values: 
    	schema.registry.url = [redacted]
    	max.schemas.per.subject = 1000
    
    16:20:24.213 [main] WARN  o.a.k.c.producer.ProducerConfig - The configuration '<http://basic.auth.user.info|basic.auth.user.info>' was supplied but isn't a known config.
    16:20:24.215 [main] WARN  o.a.k.c.producer.ProducerConfig - The configuration 'basic.auth.credentials.source' was supplied but isn't a known config.
    
    16:20:24.217 [main] INFO  o.a.kafka.common.utils.AppInfoParser - Kafka version: 2.3.0
    When digging into this I noticed the MAE/MCE are using kafka
    2.2.1-cp1
    (confluent platform version) while the GMS is using
    2.3.0
    (non-confluent platform version). I'm thinking regular non confluent platform clients might not support the same set of schema registry configurations.
    s
    b
    +1
    13 replies
    Copy to Clipboard
  • s

    swift-account-97627

    2 years ago
    I've noticed various references to a "top consumers" feature in the front-end code, but nowhere else. I'm guessing this is an internal feature? I don't see it in the roadmap. Is it anticipated to be open sourced at some point or is it internal forever?
    s
    b
    2 replies
    Copy to Clipboard
  • s

    swift-account-97627

    2 years ago
    I have a number of related use cases for what I'd describe as "Data Profile" and/or "Data Quality" attributes, most of which are per-field. For example: completeness (~non-null percentage), distinct values, or histograms of distinct values in enumeration-like fields. I've been doing some quick-and-dirty prototyping by adding these attributes to the SchemaField model, but that feels wrong. It seems like "Data Profile" is really a separate aspect to "Schema", but they both contain per-field information, so I'm not sure how best to model this. I could add "DataProfile" to the set of dataset aspects. But then I'd have two aspects containing field-level information (
    SchemaMetadata.SchemaFields
    and something like
    DataProfile.FieldProfiles
    ). If this is the correct model, what would be a good way to associate each particular FieldProfile with a particular SchemaField? Or is there a different model that would be better? More generally, it seems like there's a tension between two models for field-level aspects:1. Dataset has many Aspects, some of which have metadata for many Fields 2. Dataset has many Fields, some of which have multiple Aspects I don't have an opinion on which of these models is more "correct", but the current implementation only seems to really support one aspect per field, and pushes any extensions to favour model (1) above. Is this a conscious design decision, or has this question just not come up yet?
    s
    b
    9 replies
    Copy to Clipboard
  • h

    high-hospital-85984

    2 years ago
    Hi all! Complete newbie question. After running
    quickstart.sh
    and
    ingestion/ingestion.sh
    I see items in the Upstream/Downstream tables in the “Relationship” tab for all dummy datasets. The lineage data is however empty, as well as the graph visualisation. Am I missing something?
    h
    b
    +1
    6 replies
    Copy to Clipboard
  • a

    able-garden-99963

    2 years ago
    Hi datahub team! A couple of quick questions:1. Is it possible to add constraints to entity fields (say, my CorpUser entity has an "email" field and I want it to be unique)? 2. Is it possible to add custom queries based on entity fields? (say, my CorpUser entity has "email" field and I want to query CorpUsers by email)? Thanks!
    a
    b
    4 replies
    Copy to Clipboard
  • h

    high-hospital-85984

    2 years ago
    Another question, from a newb trying to understand how Datahub works. In
    datahub/metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/
    we define e.g. ChartSnapshot and MLModelSnapshot. However, in
    Snapshot.pdl
    we only list MLModelSnapshot and not ChartSnapshot in the union. Why is that? Similary, in
    datahub/metadata-models/src/main/pegasus/com/linkedin/metadata/entity/
    we define a ChartEntity, but not a MLModelEntity, and the ChartEntity is not listed in the union in
    Entity.pdl
    . Why is that?
    h
    a
    +2
    12 replies
    Copy to Clipboard