This message was deleted.
# hamilton-help
s
This message was deleted.
👀 1
e
Hey! So yes, this is doable, although it requires power-user mode 🙂 First, creating multiple drivers is not actually bad IMO, its just a little cumbersome. Assuming there are two time periods (
foo
and
bar
), you would use the resolve decorator, along with
subdag
. What the
resolve
decorator does is waits until after a config is available to create the DAG, rather than creating it from hardcoded items:
Copy code
@resolve(
    when=ResolveAt.CONFIG_AVAILABLE,
    decorate_with=lambda bar_start, bar_end, features : subdag(
         module_defining_features,
         inputs={
            'start': value(bar_start), 
            'end' : value(bar_end)
    )
)
def bar(feature_1: pd.Series, feature_2: pd.Series, feature_3: pd.Series) -> pd.Series:
    ...
Then repeat the same for
foo
. Note that this has two hardcoded assumptions (the feature names), so it isn’t what you want. However, you can make it more flexible using
@parameterized_subdag
— this enables you to run multiple subdags. The only challenge is dynamically specifying the outputs, which we don’t yet have available, so you’d likely want to either (a) do one subdag per each foo/bar combination or do something a little smarter (output a dataframe, and select just the columns you need.). I think to make this more ergonomic, we’d need to add parameterizing which outputs — will dig a bit to ensure there isn’t an easier way to do that specifically. That said, I think there’s another solution to your issue that’s pretty clean, and I’d love to get your take. Writing down now…
You could either: 1. Compute everything on the fly then have the last step in your DAG be selecting/breaking into columns (this could even use the
resolve
piece, or a custom results builder). 2. Pass each node its set of time ranges, so you only compute those, then use a node/custom results builder to join together at the end, and break columns into two.
Doing a quick mockup to demonstrate these
👀 1
OK, you’ve nerd-sniped me. Had some fun. See three possible implementations here: https://gist.github.com/elijahbenizzy/52dacb97f4c3513090ad6acaceda0ddc
We’re also looking at something we’re calling “driver chaining”, which might be a nice way to do what you’re already doing more efficiently — stay tuned for that. That said, I think you have options. IMO the extra complexity from dynamically managing the graph isn’t horrible, but you want to make sure it stays in one place. E.G. away from the feature definitions.
c
wow thank you this is extremely helpful. Love the different options and tradeoffs I'm leaning towards "power user" for a few reasons • i neglected to mention this but time isnt actually a column in our data, we're just returning aggregates over time periods and the aggregation is happening in SQL. We could move that aggregating over to the Hamilton side and do something like the "filter" method but we might have to redesign some stuff • the first layer of our DAG is just running SQL queries in AWS Athena and waiting for the results so they can all be run in parallel. That's one of the biggest motivations for wanting everything to be in 1 DAG (so that we can async them rather than waiting on each feature group one at a time)
@resolve and @inject look like exactly what i want tho, i'm going to mess around with what you sent and see if i can get it working 😃
e
Great! Yeah I think we can make the power-user mode a little cleaner as well. Using
inject
in that case is only because
@subdag
doesn’t allow you to dynamically specify what you want from the subdag (its from the fn name), so we use the
@inject
to group/join. When you get your final code would love to see what you have so we can tune/add shortcuts in the API 🙂 Feel free to reach out with any more Qs — all the code I gave you works, you can just run the file with
python
and see the outputs (it’ll run each case).
Another approach (if you want) is to do something like this: https://rajaswalavalkar.medium.com/aws-athena-how-to-execute-parallel-queries-how-to-control-the-flow-of-the-code-till-completion-c5628f976b88. E.G. you can have one function just query everything, then have downstream ones process it. That said, delegating parallelism to the DAG is probably a lot cleaner.
c
i do like the link you posted since we'd be able to avoid using async at all. Ultimately we aren't getting a lot out of using async, we would get the exact same benefits by kicking off all the queries we need to run and waiting for them all to finish in a for loop We actually do exactly this in a few of our other projects where we know ahead of time what queries we want to run. But the thing that tripped me up is in our featurestore we dont know which queries we'll need to run until the user requests features, and running all of them every time would be a bit too slow/expensive. Basically what i was missing was a way to get a list of only the necessary queries at run time (based on the requested features), so i resorted to just making every query a async function and using the AsyncDriver Now that you've introduced me to @inject/@resolve i feel like there should be some way to do that now and avoid using async
e
Yeah! I think you’ve got plenty of options. Let me konw if you need anything else — otherwise quite happy you like
@inject
/`@resolve`
The only thing I’d say is to try not to get too trigger-happy about them — they can be handy in a few spots but if relied on too heavily can make a codebase messy