https://datahubproject.io logo
Join Slack
Powered by
# integrate-dagster-datahub
  • d

    dazzling-judge-80093

    04/12/2022, 7:06 PM
    I’m/was working recently on building some easier abstraction capturing DataJob (Op)/DataFlow (Pipeline) And there is a new entity DataProcessInstance to capture actual pipeline run/OP run.
    👀 1
  • d

    dazzling-judge-80093

    04/12/2022, 7:06 PM
    https://github.com/datahub-project/datahub/pull/4615
    👀 1
  • d

    delightful-barista-90363

    04/12/2022, 7:07 PM
    I havent worked with software defined assets, but that seems like something that would better integrate with datahub then some older dagster features
  • d

    dazzling-judge-80093

    04/12/2022, 7:08 PM
    Yeah, I need to check it out as so far I was thinking in AssetMaterialization/capturing it from IOManager
    plus1 1
  • d

    dazzling-judge-80093

    04/12/2022, 7:10 PM
    It seems like this is the difference between the two:
    Copy code
    Software-defined assets vs. Asset Materializations#
    When working with software-defined assets, the assets and their dependencies must be known at definition time. When you look at software-defined assets in Dagit, you can see exactly what assets are going to be materialized before any code runs.
    
    Asset Materializations, on the other hand, are logged at run time. When you run an op, you find out which assets were materialized while the op is running. This allows for some flexibility, like if you wanted to determine which assets should be materialized based on the output of a previous op.
  • d

    dazzling-judge-80093

    04/12/2022, 7:11 PM
    which seems like Software defined asset is an expectation and Asset Materialization is the reality
  • d

    dazzling-judge-80093

    04/12/2022, 7:12 PM
    actually we can support both with the new model because you can define dependencies with every run and this way you will be able to see the graph/inputs/outputs per run
  • d

    delightful-barista-90363

    04/12/2022, 7:13 PM
    Thats another feature/question that I had with my team. Being how granular should lineage get wrt graphs, jobs, and ops
  • d

    delightful-barista-90363

    04/12/2022, 7:15 PM
    Theres also the question of how to best integrate external resources, as theres usually a lot of custom code (at least in my use case) surrounding an S3 or Postgres Resource for example
  • d

    dazzling-judge-80093

    04/12/2022, 7:17 PM
    do you expose asset materializations?
  • d

    dazzling-judge-80093

    04/12/2022, 7:17 PM
    then I think you don’t have to touch any custom code/resource
  • d

    delightful-barista-90363

    04/12/2022, 7:18 PM
    we do
  • d

    delightful-barista-90363

    04/12/2022, 7:19 PM
    i was referring more to inputs being a separate s3 bucket or database. We generally have a raw data bucket that gets run through transformations into separate s3 buckets
  • d

    delightful-barista-90363

    04/12/2022, 7:24 PM
    Assets i do agree could get by without custom code, and as a user, probably would want to only track lineage for assets that are materialized
  • d

    dazzling-judge-80093

    04/12/2022, 7:40 PM
    Are you saying in your resources you are doing transformations or the problem is more that how to capture those data which is not generated in Dagster but more before that?
  • d

    delightful-barista-90363

    04/12/2022, 7:45 PM
    Capture data not generated in Dagster
  • d

    delightful-barista-90363

    04/12/2022, 7:46 PM
    for example: download files from an s3 bucket, run transformations that yield assets to a separate bucket. Pretty much creating the connection from the source bucket -> dagster job -> target bucket
  • d

    delightful-barista-90363

    04/27/2022, 6:10 PM
    Hey, I got some room to discuss and work on a dagster integration now
  • d

    delightful-barista-90363

    04/28/2022, 5:57 PM
    Some initial ideas i had were
    • Datahub Dagster Integration
    ◦ Dagster control datahub ingest jobs (similar to https://datahubproject.io/docs/metadata-ingestion/schedule_docs/airflow)
    ◦ Datahub @resource for dagster (for pushing MCE)
    ◦ Datahub pull lineage from Dagster
    ◦ Datahub Pull metadata from dagster
    ◦ Datahub Dagster (Task) links to dagster job
  • d

    dazzling-judge-80093

    04/28/2022, 5:59 PM
    Hey, if you have time we can catch up about this tomorrow
    plus1 1
  • d

    delightful-barista-90363

    04/28/2022, 6:00 PM
    sounds good to me
  • d

    delightful-barista-90363

    04/29/2022, 2:32 PM
    gonna bump this and see if you wanna meet today
  • d

    dazzling-judge-80093

    04/29/2022, 2:33 PM
    When are you available?
  • d

    dazzling-judge-80093

    04/29/2022, 2:34 PM
    I will have one meeting 1 hour from now but we can have some discussion before or after that
  • d

    delightful-barista-90363

    04/29/2022, 2:35 PM
    im free all day
  • r

    rapid-king-93225

    05/04/2022, 10:38 AM
    do not forget their more rich concept around assets
  • d

    delightful-barista-90363

    07/06/2022, 5:36 PM
    For a small integration with dagster + datahub, i made the Rest and Kafka emitters available as Dagster Resources. This isn't the full blown "metadata exchange" between the two services. That will require a lot more work and time lol. Here is the PR for said work. https://github.com/dagster-io/dagster/pull/8764
    teamwork 4
    🎉 1
    D 1
    c
    • 2
    • 1
  • a

    alert-alligator-12918

    11/30/2022, 11:17 PM
    Hey! I was curious if anyone can point towards an example of this integration being used 🙂
    d
    l
    • 3
    • 3
  • s

    stocky-monkey-23735

    08/08/2023, 1:24 AM
    I saw this and got curious. Do we have progress related to this? I'm in an organization where we also have to implement this. We're using Datahub and Dagster along with DBT.
  • b

    bulky-shoe-65107

    10/16/2023, 12:37 AM
    has renamed the channel from "integration-dagster-datahub" to "integrate-dagster-datahub"