https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • a

    agreeable-hamburger-38305

    10/05/2021, 11:11 PM
    Hi, I’ve been ingesting from BigQuery with profiling enabled. Does anyone know why it’s successful for some of the tables? The run finishes with no failures or warnings, but only some of the tables in the UI have column-level stats, others just show “no data”
    l
    • 2
    • 3
  • a

    agreeable-hamburger-38305

    10/06/2021, 5:39 AM
    Is it possible to skip columns with a certain name while doing sql profiling?
    b
    r
    • 3
    • 3
  • b

    boundless-room-44377

    10/06/2021, 8:45 PM
    hi, is it possible to create lineage MCEs between ML models and Datasets/FeatureTables ? looking at this code in the Python lib:
    Copy code
    def make_lineage_mce(
        upstream_urns: List[str],
        downstream_urn: str,
        lineage_type: str = DatasetLineageTypeClass.TRANSFORMED,
    ) -> MetadataChangeEventClass:
        mce = MetadataChangeEventClass(
            proposedSnapshot=DatasetSnapshotClass(
                urn=downstream_urn,
                aspects=[
                    UpstreamLineageClass(
                        upstreams=[
                            UpstreamClass(
                                dataset=upstream_urn,
                                type=lineage_type,
                            )
                            for upstream_urn in upstream_urns
                        ]
                    )
                ],
            )
        )
        return mce
    it looks like a
    DatasetSnapshotClass
    is required when defining the downstream URN, while I would. like to be able to set that as an ML Model or even FeatureTable
    g
    • 2
    • 8
  • r

    rough-eye-60206

    10/06/2021, 9:16 PM
    Hi, Can I ingest metadata with just description for the tables located in separate file (tables columns/datatypes has already been ingested before from hive). Can i achieve it using API's/emitters? Is that possible? Can someone please provide me an example, please ?
    g
    m
    • 3
    • 3
  • b

    better-orange-49102

    10/07/2021, 7:20 AM
    i noticed that the browsepath for datasets are in the format
    /prod/source/dataset_name
    , i'm just wondering if there are any implications for renaming it such that it becomes
    /prod/source/x
    , where x is not a short string that is not necessarily unique. I don't see any issues navigating to the datasets if i were to use the same value for x for all datasets.
    g
    b
    • 3
    • 4
  • a

    agreeable-hamburger-38305

    10/07/2021, 5:43 PM
    Hi all! I’m having trouble ingesting sample queries from BigQuery. Monthly queries and Top Users are successfully shown in the UI, but “No Sample Queries” in the “Queries” tab. Does anyone know what the reason might be? I made sure the table was queried in the past day, and I have access to the Project History Query in the BigQuery GCP console. Here’s the yaml I used:
    Copy code
    source:
      type: bigquery-usage
      config:
        env: "DEV"
        projects:
          - <project-id>
        top_n_queries: 10
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    m
    m
    • 3
    • 12
  • g

    gentle-father-80172

    10/07/2021, 10:40 PM
    Hi all - question - I see that
    Looker
    metadata ingestion has a nice set of permissions so that I can query the API. However,
    LookML
    ingestion via API requires Admin level access. Is there a specific reason for this? Trying to get Admin access from my boss so any extra info can be useful! Thanks! 😄
    m
    • 2
    • 3
  • c

    calm-sunset-28996

    10/08/2021, 12:35 PM
    Hey, at the moment we have some Qlik dashboards ingested and we would like to add lineage for them. However we don't have charts yet. I saw that you can't add lineage to dashboards, only to charts. Would it be a big refactor to add this? I can have a look at this if it's limited in scope.
    plus1 2
    l
    m
    +2
    • 5
    • 7
  • c

    clean-piano-28976

    10/08/2021, 2:23 PM
    Hi all 👋🏾– I wonder if anyone can help as I’m having difficulties ingesting data from Looker (both
    Looker
    and
    LookerML
    )?
    ✅ 1
    m
    • 2
    • 7
  • n

    numerous-yak-58823

    10/08/2021, 7:43 PM
    Any plans for integrating Spline in DataHub? It would unlock really powerful features like Spark automatic lineage even at column level.
    👍 1
    l
    s
    • 3
    • 8
  • m

    mammoth-lawyer-49919

    10/11/2021, 3:00 AM
    Hi all -I am ingesting metadata for Lookml using 0.8.14 version. Looks like there is an issue with dbname value for Athena platform.I am getting URN as _urnlidataPlatform:athena,.db_name.table_name,PROD_ Watch for . before db_name It should ideally come out as _urnlidataPlatform:athena,db_name.table_name,PROD_ I think below line should be changed to include athena as well https://github.com/linkedin/datahub/blob/ef2182f1021c02a72733dfaa058add10f14de0c7/metadata-ingestion/src/datahub/ingestion/source/lookml.py#L735
    m
    • 2
    • 3
  • b

    bland-orange-13353

    10/11/2021, 7:56 AM
    This message was deleted.
    s
    w
    g
    • 4
    • 3
  • w

    witty-keyboard-20400

    10/11/2021, 12:57 PM
    While trying to capture collections metadata from Mongodb, I faced this error:
    Copy code
    OperationFailure: BSONObj size: 17365481 (0x108F9E9) is invalid. Size must be between 0 and 16793600(16MB)
    Is there any configuration which let's the Datahub skip the document with larger field values ?
    m
    • 2
    • 3
  • a

    agreeable-hamburger-38305

    10/11/2021, 8:20 PM
    A question I got from a coworker: how easy/hard would it be to add a new custom/unsupported data source?
    m
    b
    +3
    • 6
    • 6
  • f

    fresh-fish-73471

    10/11/2021, 9:09 PM
    Query : How to change 8080 port for datahub-gms for docker based installation?            Tried approaches: 1. Changed port for datahub-gms mentioned in docker-compose.yml file pre-container creation. Container creation successful but below error encountered when trying to access the datahub-gms rest backend. 2. Installed through quickstart.sh after making changes to supposedly required files. Container creation successful but same error again encountered. 3. Tried alternate open ports with same issue being replicated.   ERROR: ConnectionError: HTTPConnectionPool(host='X.Y.Z.A', port=8082): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at  0x7f8537d00f50>:   Failed to establish a new connection: [Errno 111] Connection refused'))
    l
    • 2
    • 1
  • r

    rough-eye-60206

    10/11/2021, 10:15 PM
    Hello Team, Can anyone guide me on how to ingest Glossaryterms or provide me a sample code on how to generate/ingest glossary terms to UI. Please help me.
    g
    • 2
    • 2
  • w

    witty-keyboard-20400

    10/12/2021, 5:33 AM
    When we ingest metadata from different source systems in an organization, it's possible that at the origin a field is called "score", while downstream system may name the same field as "marks". How do we handle it in DataHub to convey to a user that both are same fields? Is there any way to capture semantics in DataHub?
    l
    b
    • 3
    • 4
  • n

    nice-planet-17111

    10/12/2021, 6:21 AM
    Hi team, is it possible to ingest metadata from
    CloudSQL
    (on GCP) ->
    datahub
    ? If it is, is there any quickstart guide or docs? Thanks in advance 🙂
    b
    s
    • 3
    • 4
  • b

    bumpy-activity-74405

    10/12/2021, 11:41 AM
    hey, is there a way to keep only
    n
    last versions of aspects? Mysql is growing rapidly in size and I don’t really have any use for old aspects
    b
    m
    • 3
    • 12
  • n

    nice-planet-17111

    10/12/2021, 12:35 PM
    Hi team 🙂 is there a way, or is there anyone that tried to detect metadata changes in data source "in real-time", and ingest them automatically (if possible, only the changed part) to datahub? I'm looking at the mce builder and emitters, but i think i'm completely lost. 😢
    👀 1
    m
    b
    k
    • 4
    • 4
  • b

    boundless-room-44377

    10/12/2021, 6:16 PM
    hi, I have an ingestion troubleshooting question. I have tried to execute code similar to this example: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_kafka.py and I don't get any errors, but also do not see the lineage connection show up in the UI if I run the same code but change the sink to the REST connector it works. my question is how I might go about debugging this? I looked in the containers at the kafka topics/messages and gms logs and didn't see anything that indicated an error or failure, but maybe I"m not looking in the right place note: I am running the "quickstart" DataHub on an EC2 instance. I've been able to run ingestion recipes that use the kafka sink to pull datasets in
    l
    • 2
    • 8
  • a

    agreeable-hamburger-38305

    10/12/2021, 8:21 PM
    I’m running into this weird issue where, when I ingest table metadata from BigQuery with profiling enabled, only one or two of the numeric columns actually get
    Min
    ,
    Max
    ,
    Mean
    ,
    Median
    and
    Standard Deviation
    , while all the others just show “unknown”.
    Null
    and
    Distinct
    stats and
    sample values
    are all working fine. Anyone know what might be causing this? The columns with missing stats have a high % of null (99.9x%), but there are still some valid values in there.
    m
    l
    h
    • 4
    • 7
  • r

    rapid-piano-43271

    10/13/2021, 2:59 AM
    @here I have a question regarding pull based meta data ingestion. Do we pull the entire schema of the data source? If so, how is it applied to the DataHub store, is the old schema completely replaced or does it perform upsert (update or insert) operations?
    l
    m
    • 3
    • 9
  • c

    cuddly-family-62352

    10/13/2021, 8:12 AM
    I have a question: for example, mysql, MCE events are generated by monitoring logs or by rotation?
    b
    m
    • 3
    • 13
  • m

    melodic-helmet-78607

    10/13/2021, 8:55 AM
    Hi team, does business glossary have versioning? As a user, I want to see previous definitions and want to see what has been changed
    m
    • 2
    • 1
  • m

    melodic-helmet-78607

    10/13/2021, 8:59 AM
    Also, any tips to differentiate entity / glossary that has been approved or not? My use case is that every ingested entity / glossary can be approved / rejected by datahub user via UI.
    b
    l
    • 3
    • 3
  • r

    rough-eye-60206

    10/13/2021, 4:48 PM
    Hello Team, 1. I want to know, how to associate different glossary terms to different datasets ??. Currently i was able to achieve it on the UI by manually assigning them. But i would like to know is there any other automated option/emitter's or API's for achieving that. Please can anyone help me or provide me an example if available for reference. Thank you.
    m
    b
    • 3
    • 6
  • w

    witty-keyboard-20400

    10/14/2021, 6:38 AM
    MongoDB ingestion failure:
    Copy code
    OperationFailure: not authorized on db_kg to execute command { aggregate: "system.views", pipeline: [ { $sample: { size: 100 } } ], allowDiskUse: true, cursor: {}, lsid: { id: UUID("305f9d4f-fd8b-4fbd-8cf6-9257c4399403") }, $clusterTime: { clusterTime: Timestamp(1634193294, 4), signature: { hash: BinData(0, 9B140107B447AC1BFBE704B411400CF7EEF4E04D), keyId: 7012684930027094017 } }, $db: "db_kg", $readPreference: { mode: "primaryPreferred" } }, full error: {'operationTime': Timestamp(1634193295, 1), 'ok': 0.0, 'errmsg': 'not authorized on db_kg to execute command { aggregate: "system.views", pipeline: [ { $sample: { size: 100 } } ], allowDiskUse: true, cursor: {}, lsid: { id: UUID("305f9d4f-fd8b-4fbd-8cf6-9257c4399403") }, $clusterTime: { clusterTime: Timestamp(1634193294, 4), signature: { hash: BinData(0, 9B140107B447AC1BFBE704B411400CF7EEF4E04D), keyId: 7012684930027094017 } }, $db: "db_kg", $readPreference: { mode: "primaryPreferred" } }', 'code': 13, 'codeName': 'Unauthorized', '$clusterTime': {'clusterTime': Timestamp(1634193295, 1), 'signature': {'hash': b'\x8f\x98\x0b\x97l\xbd\xab\x96\xcc\x91\x14QQ7\xc8)d\xd7W"', 'keyId': 7012684930027094017}}}
    sample size is just 100:
    schemaSamplingSize: 100
    What does ingestion's schema inference need to execute
    aggregate: "system.views",
    ?
    b
    • 2
    • 4
  • w

    witty-keyboard-20400

    10/14/2021, 11:50 AM
    For ingesting metadata from MongoDB, I want to see some examples of how
    collection_pattern.deny
    is mentioned in the config section and its sample values.
    m
    • 2
    • 1
  • c

    crooked-wolf-53758

    10/14/2021, 4:26 PM
    I have the json schema for a set of json documents. What would be the best way to ingest this metadata? @mammoth-bear-12532 I saw your reply to a similar question a few months ago, but the links no longer work.
    h
    m
    • 3
    • 5
1...141516...144Latest