Hamilton Open Source

Yep. You just annotate it with the dataframe
```import pandas as pd
def foo(df1: pd.DataFrame) -&gt; pd.DataFrame:```

image.png

Homework question? are you confused about the code, or how Hamilton fits into the picture?

Not homework! Just trying to understand the best way to approach the issue assuming I was using Hamilton.

I ended up using something like this:

```def _nest_series(**series:pd.Series) -&gt; pd.Series:
    df = pd.DataFrame(series)
    return df.apply(pd.Series.to_dict, axis=1)```
Reason I needed to do this is because saving 'nested' data structures from Pandas into GCP is supported by the google cloud libraries

Oh yep. You're missing returning `df` in the `_nest_series` function. But yeah that's one way to do it. I was asking in relation to Hamilton because you'd need to name the function arguments rather than them being in **kwargs.

(homework would be OK btw :slightly_smiling_face: )

<@U03N6LAJDMF> e.g. :
```def my_dataframe(col_a: pd.Series, col_b: pd.Series) -&gt; pd.DataFrame:
    return pd.DataFrame({'col_a': col_a, 'col_b': col_b})```


I have created a transform which renames the columns such that they align with the list I pass as 'final_vars' to dr.execute() but still get an error

Yep! Definitely possible if I understand correctly. That said, I think you'll need an `extract_columns` decorator on top of it. Let me know if this makes sense:

```@extract_columns(…)
def df_with_columns_renamed(df_original:pd.DataFrame) -&gt; pd.DataFrame:
    #rename columns and return original df```

Basically you can think about it using two layers:
1. Hamilton deals in just variables, doesn't really care whether they're DataFrames, series, etc…
2. The decorators/drivers have df-specific logic (which extends past pandas as well)
So you can make functions return anything, but if you want series to be variables they have to be individual nodes (hence extract_columns, which turns one DataFrame into multiple nodes, one per series)

<@U03N6LAJDMF> I’m happy to spend time with you this week if you wanted to screen share what you’re doing. Just DM me if you’re interested.

Just to check one thing on this answer - would I then need to reference the original df ('df_original' in the example) in the hamilton config?

You somehow need to supply it for computation. So yes, one way to do that is as part of the config. I'll be free in an hour if you wanted to chat live.

<@U03N6LAJDMF> I’m around if you wanted to chat live.

Hi <@U03AV0GG20H>, apologies I missed you

To expand on my problem I'm trying to build a pipeline whose last step is to perform a group operation on the target dataframe. The expected outcome of the operation is that it would reduce the total number of records in the dataframe. I am trying to understand:
1. How I can create a transform which is applied at the 'whole dataframe' level; and
2. How I can ensure that this transform is considered as the last stage in a DAG
Would this be possible?

Yes there's a couple of ways to do this. Are you around in two hours to chat?

I'll be around in about 4hrs time - is that OK?

<@U03N6LAJDMF> I’m around for the next couple of hours — so just let me know when you wan to chat.

Here’s the basic mechanics of the different approaches:

*Option 1: Encode this as a transform:*
create a hamilton function:
```def grouped_df(col1: pd.Series, ..., colN: pd.Series) -&gt; pd.DataFrame:
    # your group logic
    # new_df = ...
    return new_df```
in your driver:
```dr = driver.Driver(config, logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr.execute(['grouped_df'])```
*Option 2: Do it as a post step after running execute() in your driver*
```dr = driver.Driver(config, logic_module)
df = dr.execute(['col1', ..., 'colN'])
grouped_df = ... # your logic here.```
*Option 3: Run two Hamilton DAGs*
This is the merger of Option 1 &amp; Option 2.
```dr1 = driver.Driver(config, logic_module)
pre_grouped_df = dr.execute(['col1', ..., 'colN'])

dr2 = driver.Driver(other_config, grouping_logic_module, adapter=base.SimplePythonGraphAdapter(base.DictResult()))
result = dr2.execute(['grouped_df'], inputs={"raw_df": pre_grouped_df}). # you can write the function to operate on a dataframe, or columns```
*Option 4: Add a custom Result Builder to do this*
```class GroupedByResult(base.ResultMixin):
    """This is a class and it has to have a static method."""

    @staticmethod
    def build_result(*, group_by_names: typing.List[str], **outputs: typing.Dict[str, pd.Series]) -&gt; pd.DataFrame:
        """This function builds the result given the computed values."""
        df = pd.DataFrame(outputs)
        grouped_df = df.groupby(
            by=group_by_names,
        ) 
        # more logic here.          
        return grouped_df

# driver
dr = driver.Driver({ ... "group_by_names": ["COLUMN", "NAMES"]...} , modulez,
    adapter=base.SimplePythonGraphAdapter(result_builder=GroupedByResult()))

# to wire configuration through to the build_result function need to request it as an output.
output = ['USUAL', 'COLUMNS'] + ['group_by_names']
df = dr.execute(output)```
*Option 5: Do the filter/group on data load*
Not sure how applicable this would be — but if it’s based on an index or something, do this as step when you load the data. That way downstream functions only operate over the already filtered/grouped data.

They have different pros/cons and for me it depends on who and where the code is going to be run/maintained by and what you want to be easy to change or not.

IMO its generally nice to have all in the same DAG, but yeah, really depends on the use-case/how you want to reuse the transformations.

When I use the hamilton transform option (#1), how can I specify where in the dag/sequence the step happens?

I guess I'm trying to understand how to use a transform which potentially affects the vertical 'length' of a dataframe in the middle of a dag.

To see an example of what I'm trying to do, you can take a look at this <https://docs.google.com/spreadsheets/d/1xcMNn047WGejjHkr_5rSzc9b4QpID3HFPJYv3FL--VU/edit?usp=sharing|sheet>

Following up - we chatted and discussed that via pandas indexing, it’s possible to do this via operating over a series, as long as the index on the series is what you want it to be, then pandas will stitch everything correctly together in a dataframe.

Good question! Hamilton’s base assumption is non-asyncio based, but you should be able to run it with some caveats. There are a few ways that come to mind. Let me write some code and get back to you in a bit.

Super excited to see this use-case btw — we’ve been talking about how we could run hamilton  in an online setting but were looking for more real-world examples :slightly_smiling_face:

Okay I think the quickest option to get unblocked is to split I/O, which is what I’m assuming requires asyncio, from Hamilton computation.

E.g. within a fastapi app

```from hamilton import driver
dag = driver.Driver({...}, modules, adapter=...)

@app.get("/endpoint")
async def compute( ... ):
    data = await pull_from_db(...)
    result = dag.execute([output], inputs=data)
    # transform result for fastapi
    return result```

<@U03AGP3AFM5> happy to provide one :grinning:
Came across Hamilton once more thanks to <@U03AV0GG20H> ‘s MLOps world talk and realized it has the potential to be a great solution for us. Happy to support where I can!

To run Hamilton in an asyncio based way *with async functions within* a _running event loop,_ requires an async based driver at least — doesn’t seem hard, to do.

If running Hamilton *outside* an _event loop_ then it’s easy to call asyncio based operations within the function itself via `asyncio.run(…)`.

Otherwise Hamilton *can* *be run* within a _running event loop_ *only if*  the underlying DAG has *no* _async_  functions - which is what I’m suggesting as the stop gap measure.

<@U03RNHE33MF> created <https://github.com/stitchfix/hamilton/issues/167> to track this.

<@U03RNHE33MF> is splitting the I/O out a possibility for you?

<@U03AV0GG20H> that makes sense thank you! 
In principle it is possible; we do have a setup where we end up making DB calls quite deep within the logic, not just to read data but also to log outputs. No reason we couldn't extract this out other than having to refactor; from a performance perspective it would be better to limit round trips anyhow.
I think using asyncio.run as you suggest for the I/O could also be a good solution here; I'll have to try this tomorrow and see if it works! In general it would be ideal if we could retain the input operations as function args similar to how Hamilton usually operates, but for output operations calling them explicitly is no trouble at all.

So yeah, I think the first two ideas are good short-term/getting unblocked. One thing that’s interesting re: db/asyncio is that you can use overrides to replace it. So, if you have a hamilton DAG in which one step yields the DB call, you can use an override to replace that with the outside-of-hamilton async db call.

This is cool cause it allows:
1. The ability to unit test your DAGs by injecting data instead of that external operation
2. The ability to switch data providers (E.G. from sync outside of a service context to async inside one)
3. The ability to generate data as your service would in bulk in a natural way
Not sure if (3) is useful but this is something that we’ve thought about

<@U03AGP3AFM5> good point! For the data input steps this is great solution. This combined with calling the write operations with asyncio.run seems like it should cover things!
Actually 3 is useful! Being able to replay requests and inject data to the service is a key consideration for us, which the override definitely facilitates.

<@U03RNHE33MF> just to clarify, using`asyncio.run()` when running within the fastapi app won’t work — you’ll get
```RuntimeError: asyncio.run() cannot be called from a running event loop```


I’ll add some code examples to the github issue for clarity on what does and does not work.

<@U03AV0GG20H> ahh right, yeah that makes sense… thanks for the clarification and code examples!

<@U03RNHE33MF> is the logging back to the DB critical to the application and does it need to be on the request path to return a web response?

<@U03AV0GG20H> As it stands, yes the logging is critical, since we want to be certain of when particular components of our pipeline were executed. In principle we could log these start and stop times without writing to the DB; but this would require us to refactor the components to ensure they always return, so that we can write these outputs to the DB in batch at the end of the computation and ostensibly outside of Hamilton.
In any case, we would need to write to the DB at some part of the request no matter what.


<@U03RNHE33MF> makes sense thanks. Sounds like the request doesn't need to be fast, or does it?

Otherwise could you add some toy code to the issue that mirrors your structure of the DAG please? That way I can have a realistic example to develop against?

<@U03AV0GG20H> Well it didn’t originally but we’re trying to improve performance at the moment; in part why we’re shifting to async-first.
Absolutely, I’ll provide some today!

<@U03AV0GG20H> here’s a <https://gist.github.com/shelmigtwo/56d9f2874e93c6281c456c9187850dbf|Gist> I put together that should give a flavour of what I mean. Essentially we have some top level function `pipeline`, which orchestrates `component1` and `component2`, either of which could be requesting data from external sources, and all of which log data into the db at the beginning and end of each function.
Hope it makes sense, let me know if anything is unclear / I can support somehow!

Hey, taking a quick peak at the code. Some qs:

• How deep will the pipeline be? The one you have is like 2-3 nodes deep — curious if that’s a representation of what you’ve got or it’ll get more complex.
• In `computation1` (and presumably `computation2`), you do three things — log, calculate, then log. is the idea of breaking up logging into two pieces for the purpose of measuring (E.G. time), or could it be one piece?
Likely going to try to prototype an async hamilton implementation (I’ve actually done it in the past so I have some code I can dig up), but it will add a bit of complexity, so it’ll be on a branch for now (happy to walk you through installing it). OTOH, separating the compute from the externalities might help out with your use-case, especially when you want to log differently in bulk…

But obviously I’m not an expert on your code, so kinda just poking around now :slightly_smiling_face:

Also thanks for the gist! Makes things super clear :raised_hands:

Hey! Had some fun hacking -- would love your feedback. Basically, I've gotten (a slightly modified version of) your gist to work and I'm pretty happy with the solution. Slightly experimental, but we'd love your feedback on it. Turned out to be a few simple class additions!

Note its not yet unit tested (coming soon), but it has an example + a README to play around with: <https://github.com/stitchfix/hamilton/pull/171>.

To install, you can either:
• check out and do `pip install -e .`
• install directly from git `pip install "git+<https://github.com/stitchfix/hamilton@async-prototype>"`

Next up I'll be adding some unit tests to ensure it won't break as we change things, but I'm feeling pretty confident it works

<@U03RNHE33MF> hey! This releases the `async` implementation! Looking forward to how it works for your case. <https://hamilton-opensource.slack.com/archives/C03M34FM058/p1660621246625529>

question — what is the data source? Is it static like in your example? or?

and just so I understand, you have some data — that may or may not have various tags specified — and given a category, or subcategory, you want to get out something like as list of products and category, that match the “filter” you provided?

but in reality we probably hit a database

all the data would always have all the tags specified, but the motivation for something like this is more to answer the question “What is the corresponding category to this subcategory tag?”

one option - drop the `category_tag` and `subcateogry_tag` functions, and instead provide them as inputs + setting defaults on the `category` and `subcategory` functions.

E.g.
```def subcategory(data: list, subcategory_tag: str = None) -&gt; str:
    if subcategory_tag is None:
        return None
    d = {d['subcategory_tag']: d['subcategory'] for d in data}
    return d[subcategory_tag]


def category(data: list, subcategory: str, category_tag: str = None) -&gt; str:
    if category_tag is None and subcategory is None:
        raise ValueError

    if category_tag is not None:
        d = {d['category_tag']: d['category'] for d in data}
    else:
        d = {d['subcategory']: d['category'] for d in data}

    val = category_tag if category_tag is not None else subcategory
    return d[val]```
and thus the driver looks like:
```dr = driver.Driver({}, my_functions, adapter=SimplePythonGraphAdapter(DictResult))

result_1 = dr.execute(['category'], inputs={"category_tag": "PB"})
print(result_1)

result_2 = dr.execute(['category'], inputs={"subcategory_tag": "NPB"})
print(result_2)```


option 2: create two DAGs one for each case.
```def subcategory(data: list, subcategory_tag: str) -&gt; str:
    d = {d['subcategory_tag']: d['subcategory'] for d in data}
    return d[subcategory_tag]


@config.when_not(category_tag=None)
def category__cat_tag(data: list, category_tag: str) -&gt; str:
    d = {d['category_tag']: d['category'] for d in data}
    return d[category_tag]


@config.when_not(subcategory_tag=None)
def category__subcat_tag(data: list, subcategory: str) -&gt; str:
    d = {d['subcategory']: d['category'] for d in data}
    return d[subcategory]```
and then driver code:
```dr = driver.Driver({"category_tag": "PB"}, my_functions, adapter=SimplePythonGraphAdapter(DictResult))
result_1 = dr.execute(['category'])
print(result_1)

dr = driver.Driver({"subcategory_tag": "NPB"}, my_functions, adapter=SimplePythonGraphAdapter(DictResult))
result_2 = dr.execute(['category'])
print(result_2)```


ok great thank you — I’m free for a call now if you are

yeah sorry give me 5 minutes — need to deal with my son.

(also, as a meta-note, thank you so much for including code in your original post! Made it super easy to reason about :raised_hands: )

So you want to create multiple functions that output dataframes, that you would then want to run extract_columns on to expose those columns?

Yes, I'm already passing in a list of dicts to `@parameterize` (as an expanded dict comprehension), ideally I could just specify multiple output columns there.

:thinking_face:  hmm. Would have to think about this one.

Challenge would be to ensure things are still evident as to what is going on and thus readable.

That said, `extract_columns` is just syntactic sugar for:

```def column_a(my_df: pd.Dataframe) -&gt; pd.Series:
   return my_df['column_a']

def column_b(my_df: pd.Dataframe) -&gt; pd.Series:
    return my_df['column_b']```
Which I think (would need to write some code to prove this to myself) you could write as a separate parametrize function itself, rather than sticking it all into one parameterize function.

Building on what <@U03AV0GG20H> said, going from memory — currently doing it that exact way is going to be tricky — mechanically it should work but the names will conflict with each other. E.G. we’ll extract the same columns on all parameterizations. Does each parameteization product the same set of volume? Or different ones? There are a few approaches I can think of:
1. Split into two — have a function for each parameterization that's an identity with extract_columns 
2. Fold it into a single parametrization where the function only returns the colum (as Stefan is suggesting). 
Not at my computer now but I'll be mulling this over. 

Also highly recommend trying the new `@parameterize` decorator — allows for both values and inputs :)

Each parameterization produces a different set of three columns. (if that's what you're asking?) I'll try stuff out and see what I can come up with.

(I'm already using `@parameterize`, it's great -- although the docs are a little confusing still, they talk about `source()` and `value()` initially and then about `upstream()` and `literal()`, are they the same things? Or is upstream/literal an old way of writing it?)

Yep, what I was asking. And ugh, need to fix it! We settled on source and value, upstream and literal are older. We switched halfway through making it. Will fix the docs, thanks!

&gt; (I’m already using `@parameterize`, it’s great -- although the docs are a little confusing still, they talk about `source()` and `value()` initially and then about `upstream()` and `literal()`, are they the same things? Or is upstream/literal an old way of writing it?
[edit] what <@U03AGP3AFM5> said [/edit]. If you have time please feel free to create an issue for this — else I’ll try to get to this  in the afternoon, if not early next week.

OK, tries to make it a little clearer. Still use `literal`/`upstream` internally and in some places to describe it, but the APIs in documentation are made consistent in this PR: <https://github.com/stitchfix/hamilton/pull/192>

<@U03AGP3AFM5> do you need to update gitbook too?

For those searching for combined `@parameterize_sources` and `@extract_columns` functionality (like I was): <https://github.com/stitchfix/hamilton/issues/196>

Hey Michael -- I'll be digging into that soon. In the meanwhile, feel free to post your thoughts + use-case! The more general we can make it/more use-cases we think about before building the happier we'll be,

Hey Elijah, I think the use cases described in the issue were pretty much what I was thinking: a multi-input and multi-output function that is called multiple times through use of a parameterization

The approach I would take now is to use `@parameterize_sources` with a function that outputs a DataFrame and then unique functions to then extract the columns (there is a good example with in the issue that shows this with the my_disaggregator functions with `@extract_columns`).

in case anyone is wondering, There’s a branch up with functionality for “Using tables/dataframes for parameterization” <https://github.com/stitchfix/hamilton/issues/196|Issue 196>. If you wanted to play around with it — see this <https://github.com/stitchfix/hamilton/issues/196#issuecomment-1317496367|comment> — we’d love any feedback.

Yeah! So I'd love for y'all to take the API for a spin -- its not super polished yet but it would be great to get the community's take on what the API should look like from first principles.

~Hmm, example of what you want to do?~ Ahh see the edit. I think `parameterize` comes close, but it's not quite the same (e.g. you already have to define parameters).

In most cases we've found that there are a few specific configurations, so a combination of `config.when`, `parameterize`, and optional params with defaults tends to do the best and ensure self-documenting, clear-to-read pipelines. 

~There's a tool we built to completely replace the functionality of a function called `dynamic_node` but it's not proven generally useful yet and tends to build really confusing pipelines — one of those things that was built for a workflow at stitch fix that we probably should have bypassed entirely.~ 

My issue is the data I'm pulling from is coming from columns with dynamic names (representing vintages). I can know/assume those names in advance based on the current date, but they're going to be changing as time passes.

OK, so a few ways to do this:

```def df(...):
    return pd.DataFrame([[1,1,3,np.nan], [np.nan,2,3,4], [np.nan,0,5,6], index=['2022-01', '2022-02', '2022-03'], columns=['v202009', 'v202010', 'v202011', 'v202012'])

def last_value_series(df: pd.DataFrame, parameterized_cols: List[str]) -&gt; pd.Series:
    return df.loc[:, parameterized_cols].iloc[-1]```
Then you pass `parameterized_cols` into the driver as part of `config` or a runtime input.

(although I'd recommend naming them `version` or something. If I understand what you're doing --  at Stitch Fix this is typically done with partitions over a dataset -- E.G. we save a new dataset for each time we regenerate and then run using that `as of` date)

Ok that makes sense. And I can just treat the dataframe as another element of the config dict when I create the Driver as well. (instead of as individual columns)

Yep, so you can pass in the dataframe however you want, but it'll be a single unit. you don't have to do individual columns instead. Would recommend making it part of a module called `data_loaders` then passing in overrides, but entirely depends on your preferred approach :slightly_smiling_face:

Thanks for the question. Just to clarify. Do you want to output multiple data frames? Or input multiple data frames? Or both?

Great. I can respond more fully in about an hour. Short answer is you can pass them in directly and have functions that require them as input, or have functions to load them. The thing to think about is ensuring they have indexes that are compatible for aligning rows of that's required.

Thanks Stefan,

Can you point me to a docs about this?

In terms of creating loading functions — you can just create one for each dataframe, e.g. multiple functions like <https://github.com/stitchfix/hamilton/blob/main/examples/data_quality/pandera/data_loaders.py#L54-L72>

Or  pass in the dataframes as static input when constructing the driver - e.g. instead of pandas series like shown here, you instead pass in dataframes - <https://github.com/stitchfix/hamilton/blob/main/examples/hello_world/my_script.py#L10-L14> - and you’d have some functions take these dataframes in as input.

Happy to give pointers on the code that you’re writing - or set up a call to walkthrough things.

<@U03AV0GG20H> Thanks, let me get started with this. If anything comes up, I will hit you up. 

Thank you 

That’s a great question. Think I know but I want to double-check with a quick test. To clarify, it would be something like this:

```def df() -&gt; pd.DataFrame:
   return pd.DataFrame.from_records([{'a' : 1}])

def a() -&gt; pd.Series:
   return pd.Series([1])```

<@U03UTL3H5ME> <@U03AGP3AFM5> will show some code. But in short the driver does not unpack a dataframe passed into it so it should compute 'a' from the function definition. If you want to short circuit computation I think overrides parameter on  `.execute()` is the way to go.

OK, cool, an example to clarify. I think I may have misunderstood at first. To demonstrate overrides, this is how you might approach it. Note that I’m passing in an override for `a` (edited a bit for clarity)
```import pandas as pd
from hamilton.ad_hoc_utils import create_temporary_module
from hamilton.driver import Driver

@extract_columns('a', 'b')
def a() -&gt; pd.Series:
    return pd.DataFrame.from_records([{'a' : 1, 'b': 2}])

def c(a: pd.Series) -&gt; pd.Series:
    return a*2

# This is smart enough to not run "a" and use the input 
result = Driver({}, create_temporary_module(df, c)).execute(final_vars=['c', 'a'], overrides={'a' : pd.Series([2])})```


This is cause Hamilton thinks of inputs, etc… as distinct items. A dataframe is a dataframe unless you tell it to extract columns. Overrides allow you to short-circuit execution, but the names have to match up. E.G. `a` in this case matches a passed in series.

Note that if you happen to pass a dataframe in that has `a` as an input it will not use that.
```# this has no knowledge of the fact that the dataframe has the column `a` in it
result = Driver(
    dict(df=pd.DataFrame.from_records([{'a' : 10, 'b': 20}])), 
    create_temporary_module(df, c)).execute(final_vars=['c'])```
You’d have to pass the series in as an override to get it to use that. Hope this helps!

Hey! Saw that you were working on it, thank you! Will take a look a little later today and see if I can point you in the right direction. 

Hey <@U045PP127DJ> , took a quick look. Found (one of) the bugs :slightly_smiling_face: Might want to rename directories/figure out the directory structure.

Thanks for the question! I have a meeting and will get back to you shortly.

You <https://hamilton-docs.gitbook.io/docs/overview-of-concepts/the-hamilton-driver#parameterizing-the-dag|parameterize> the DAG based on the config — so that we construct a DAG for each fixed parameterization.

So you can have
```@config.when(mean_type='A')
def moving_average__A(input1: ... ) -&gt; ...:
    # logic for A version
    ...

@config.when(mean_type='B')
def moving_average__B(input1: ... ) -&gt; ...:
    # logic for B version
    ...```
That will switch out and run different logic — you can hard code values here if you like and it’ll be close to the transforms.

Since Hamilton is python object agnostic — if you’re happy coupling your code to that class object, declare it as input and use it directly in the function:
```@config.when(mean_type='special')
def moving_average__special(custom_object: ITS_TYPE, input1: ...) -&gt; ...:
   # logic to use the custom object... e.g.
   return custom_object.compute(input1)```

Then at driver time you just need to pass in the right “DAG config”, and then any inputs that are required to run the DAG.

In terms of
&gt;  I’d like to keep track of all of these parameters in some python (data)class/attrs object.
you have options — depends on how you want to manage that code.
i.e. with the transforms themselves (as alluded to above), with the driver code, or outside of hamilton code entirely.

So the other options are something like:

```def moving_average(config_required: dict) -&gt; ...:
   # use a dict of things to parameterize this function's logic 
   ...```
and the driver code would be
```# for first DF
config = {
  'config_required': {...} # or load from some other place
  'input': df_1
}
dr = driver.Driver(config, modules)
df = dr.execute([...])

# for second DF
config = {
  'config_required': {...} # or load from some other place
  'input': df_2
}
dr = driver.Driver(config, modules)
df = dr.execute([...])```

thanks -
I think I get it. What I was missing is that config can contain all of the set of “inputs”
Couple of followups:
1. Do the objects in the Driver config object  (`config_required` in this example) need to be json serialazable (dicts/etc.) or can they be any python object (e.g.  a callback, a class, etc.)
2. can these other parameters/objects be in inputs vs in config?
    a. and what’s the difference between putting inputs in the config vs passing it to driver.execute as shown for example <https://hamilton-docs.gitbook.io/docs/tutorials/using-hamilton-in-a-notebook#:~:text=df%20%3D%20dr.execute(%5B%27avg_3wk_spend%27%5D%2C%20inputs%3Dinput_data)|here> ?

Yep — its really open-ended.

1. Any python object is fine (although serializable will probably make your life easier at some point down the line…). If you’re passing in a callback or class I’d highly recommend using the type of config-based delegatation <@U03AV0GG20H> recommended and baking it in the dataflow itself. E.G. if instead of passing in a function you embed that into the DAG and choose it depending on a parameter, your life will get a lot easier.
2. Yep! Either one is fine. The difference between config and input is a little subtle. _config_ allows you to use it at DAG construction (E.G. with `config.when`). So it can actually shape the DAG. Inputs (as well as config) are passed in during runtime — so you can actually declare a dependency on an it. If it’s not passed in, execution will error out!

So:
```# can be run with bar supplied in inputs *or* config
def foo(bar: int) -&gt; int:
    ...

# bar has to be supplied in "config"
@config.when(bar=1)
def foo__1() -&gt; int:
    ...```
Note there’s the notion of overrides (not `config` or `inputs` but a third, more mysterious thing :laughing:) that enables you to short-circuit node execution and replace a node. But I digress…

awesome - thanks. I think that’s likely enough to get me started with a prototype and see what I break :stuck_out_tongue:

Awesome! Happy to answer more qs as they come up :)

thanks for the question. Will get to you after my meeting too!

In your case (if I understand correctly that is), I think you can create node C that depends on B (the model) and creates (queries) the dataframe from the model.
Something like this:

```def A(text_column: ...) -&gt; Model:
   # fit topic model
   ...

def B(A: Model) -&gt; pd.Series:
   # create series of topic assignments
   ... 

def C(A: Model) -&gt; pd.DataFrame:
   # extract metadata from model
   return get_data_from(A)```
Now when you run this via a driver and you want both the output of C and the output of B — you probably want to switch to a `DictResult`  builder (because creating a single dataframe doesn’t make sense here, right?)
```adapter = base.SimplePythonGraphAdapter(base.DictResult())
dr = driver.Driver(dag_config, modules, adapter=adapter)
result_dict = dr.execute(['B', 'C'])```

To clarify — fn outputs can be anything! We just like series/dataframes quite a bit, but its entirely up to you.

Thanks folks. In the provided example, would my result dict contain my two data frames?

So the result dict only contains results for the final vars you queried for (`C` and `B`), so yep! I think so (one is a series in the code we supplied…)

the result_dict object would look something like:
```
result_dict = {
   'B': output of B.
   'C': output of C 
}```

Fantastic! I'll give this a whirl tomorrow and let you know how I get on. Thank you for the help!

One quick follow-up on this... Using the original example, the text column is actually the result of other "upstream" nodes in the graph - so ideally the graph would be able to return two data frames - one such contains the output of the DAG up to + including the generation of the topics, and another summary data frame based on extracting the metadata from the node containing the fitted model... Can I still do that?

in short yes. How depends on whether you had a function to create that dataframe? or if you were using the Hamilton driver to return you that dataframe via the ResultBuilder…

happy to jump on a call and walk through some code.

Thanks! In this case I was using the Hamilton driver to build/return dataframe #1, and could then write a function in the DAG which returns dataframe #2 (I guess)?

Sorry - just looked at the result builder docs. Assume the best thing for me to do is to use the dictionary result builder and use the output dict to create the dataframes I want... Let me know if there's a better way!

Yep! I think that's the best -- you can have two (unjoined) results.

You can always build custom results builders or do a bunch of other thigns, but for now I think that's the best way to unblock you.

<@U03N6LAJDMF> to put it into code you could do:
```def A(text_column: ...) -&gt; Model:
   # fit topic model
   ...

def B(A: Model) -&gt; pd.Series:
   # create series of topic assignments
   ... 

def C(A: Model) -&gt; pd.DataFrame:
   # extract metadata from model
   return get_data_from(A)

def df_result(A: pd.Series, col2: pd.Series, col3: pd.Series, ... ) -&gt; pd.DataFrame:
  return pd.DataFrame({'A': A, 'col2': col2, 'col3': col3, ...})```
Then the driver is like the following:
```adapter = base.SimplePythonGraphAdapter(base.DictResult())
dr = driver.Driver(dag_config, modules, adapter=adapter)
result_dict = dr.execute(['df_result', 'C'])```
If that makes sense. Before the result builder was performing the logic in `df_result` for you and you didn’t need to specify it. Here we’ve hardcoded it — and results in only requesting two outputs from the Hamilton driver.

A second way would be to request all the columns like before, and you create the first dataframe after getting the results from Hamilton:
```adapter = base.SimplePythonGraphAdapter(base.DictResult())
dr = driver.Driver(dag_config, modules, adapter=adapter)
result_dict = dr.execute(['A', 'col2', 'col3', ...,  'C'])
df_2 = result_dict['C']
del result_dict['C'] # remove from dict
df_1 = pd.DataFrame(**result_dict) # build dataframe```
The third way is as <@U03AGP3AFM5> mentioned, writing a custom result builder to encapsulate this logic for you — if that’s of interest I can provide a gist.

In case it’s helpful here’s a gist <https://gist.github.com/skrawcz/18c5ce2347f7dce274d83043bc33f982>

Good Q -- to use decorators easily you have to decorate the function, meaning that you'll have to import hamilton in that module. Our goal is to make hamilton _extremely_ lightweight -- not sure whether you're using pandas or not, but here are the requirements: <https://github.com/stitchfix/hamilton/blob/main/requirements.txt>. So I'd think of it less as "coupled" -- you can use the functions outside of Hamilton, and none of the decorators mess with the way the functions operate.

That said, if you're super set on not having it as a dependency (I can respect the desire towards having very low-dependency modules), there are a host of other options you could go for (including making a "bridge" of sorts or decorating the decorator with a try/catch of import... :laughing: ).

I guess I could have a `try / catch` for the `import hamilton` and in the catch make the decorators as no-ops :stuck_out_tongue:

<@U04829UAEDQ> yeah not a bad hack to try

Yep that was the degree of evil that I was hinting at :joy: e.g. you could create a mock function_modifiers whose getattr returns a no-op decorator…. Might even be a nice part of the framework to include?

that said our goal is really to make hamilton lightweight so having it as a dependency is not the end of the world…

I might have another way but need to prototype it

OK I'm both proud and ashamed of this code:
```class mock_modifier:
    def __getattr__(self, attr):
        return self

    def __call__(*args, **kwargs):
        def identity(fn):
            return fn
        return identity

try:
    from hamilton import function_modifiers
except ImportError as e:
    print('Cannot import hamilton, using a mock function modifier')
    function_modifiers = mock_modifier()


@function_modifiers.config.when(foo=1)
def bar() -&gt; int:
    return 1

@function_modifiers.extract_fields({'a' : int, 'b' : int})
def baz() -&gt; dict:
    return {'a' : 1, 'b' : 1}


if __name__ == '__main__':
    print(bar())
    print(baz())```
Hoping <@U03AV0GG20H>'s cleaner approach is less _clever_ :laughing:

Here’s what I’ve prototyped — requires a change to Hamilton though and coming up with a convention — <https://github.com/stitchfix/hamilton/pull/217>

but yeah you’d need to manage two modules — which maybe worse than just managing the import statement…

All good, we love answering questions! I'll let you in on a secret -- you can pass any valid python object to the driver. E.G.

if you define your node as:
```def with_inputs(input_df: pd.DataFrame, value_a: int, value_b: str) -&gt; pd.Series:
    ...```
Then you can run it like this:

```driver.execute(
    ['with_inputs'], 
    inputs={
        'value_a' : 1, 
        'value_b' : 'foo',
        'input_df' : pd.DataFrame(...)})```
If you want it to be a dict of values rather than a few specific ones, you could also pass in/declare a dict (but IMO specific values are more optimal here).

awesome.
that seems like it restricts me now to having to treat the input dataframe as a dataframe and not as a set of columns?
i.e. will I be able to later refer to columns of input_df as nodes in some other function?
I guess I’m struggling to see how the DAG works to go from `input_df` to say some `def output_column(input_column_inside_input_df: pd.Series)`

or would I have to do something like `extract_columns` to expose the columns in `input_df` to the DAG?

<@U04829UAEDQ>  `<http://df.to|df.to>_dict('series')` I think will do the trick for you. — it will return a dict of { col_name -&gt; series} that you can then pass in as input.

Yeah — to be clear hamilton will think of it as a df if you pass it in as a df, not as a bunch of series. The to_dict method will allow you to input it as a dict of series instead, or you could use extract_columns within the DAG. I think <@U03AV0GG20H> ‘s approach is cleaner/more explicit. 

I guess that works as long as the DF is not too huge. Should be fine in my usecase.
Thanks !

It shouldn’t increase memory — I assume it should just be references.

i.e. looking to pass all my derived variables to a function to create a test/train split without laboriously naming each one in the call to the function.

Yep, so writing out all the node names as function arguments is one way:

```def model_function(col1: pd.Series, ..., colN: pd.Series) -&gt; ...:
   # update the function signature for each and every column we want for the model```
It’s then very clear when things are changing from a change management process, but yes this requires updates anytime something changes — and maybe be verbose to write out.

Question, just to understand a bit better before answering more, what’s the pain for you? Development? Or is this going to change frequently, if so how frequently?

Hi Stefan, it's classification problem so engineering quite a few derived variables, splitting into test/train and then the fitting model/assessing model accuracy. I guess I was stuck on how logically I go from my last derived variable to passing a new dataframe of all my derived cols to be split.

As you mention this is a process which is likely to get reviewed semi-regularly (every quarter maybe) with the creation/removal of certain DVs. To avoid verbosity I guess it'd be nice to not have to add every DV as a series to the function call. Although agreed that this would make it clear what's involved.

Realistically I was looking for a shortcut but it sounds like this might just be the go-to approach at this point?

Yep you can operate over dataframes, I’d just construct a function to do that, and then have the rest operate over dataframes, e.g.

```def data_set(col1: pd.Series, ..., colN: pd.Series) -&gt; pd.DataFrame:
   # this function describes the columns that go into the data set.
   # logic to create the dataframe
   return df

@extract_fields('train_set', 'test_set')
def train_test_split(data_set: pd.DataFrame, split_ratio: float, ...) -&gt; Dict[str, pd.DataFrame]:
   # logic to split the data_set
   return {'train_set': train_df, 'test_set': test_df}

def train_model(train_set: pd.DataFrame, ...) -&gt; ...:
    # fit the model...```
You can see some of this structure in the <https://github.com/stitchfix/hamilton/tree/main/examples/model_examples/scikit-learn|scikit-learn example>.

Link to <https://hamilton-docs.gitbook.io/docs/reference/api-reference/available-decorators#extract_fields|extract_fields doc>.

Happy to jump on a call too to help explain.

Cheers Stefan I'll give it a go and maybe revisit if I encounter any blockers :+1:

Cool, in terms of structuring your code, it might be helpful to split your code into:
• feature creation python modules
• modules to create the data sets
• model fitting modules
That way you can instantiate a Driver and the DAG easily to include everything, or just the portions that you want to run — i.e. giving yourself flexibility with respect to the driver code you want to write and run.

Yes I can see the benefit of that I think :thinking_face: I'll discuss further with James tmrw, tyvm!

If I may introduce my opinion here. I use hamilton to generate dataframes for training and testing but I split feature gen and model train/score. This makes it very easy to select features I want to use without a need to explicitly put them in a function.

<@U048HD29YP5> love it — blog post perhaps? Or do you want to submit an example to the examples folder?

```@config.when(cfg_val='0')
@extract_columns('x','y','z')
def concat_v0(
  X_e1: pd.Series, Y_e1: pd.Series, Z_e1: pd.Series, 
  X_e2: pd.Series, Y_e2: pd.Series, Z_e2: pd.Series, 
  X_e3: pd.Series, Y_e3: pd.Series, Z_e3: pd.Series, 
  X_e4: pd.Series, Y_e4: pd.Series, Z_e4: pd.Series) -&gt; pd.DataFrame:

    output = pd.DataFrame({'x': [X_e1, X_e2, X_e3, X_e4],
                           'y': [Y_e1, Y_e2, Y_e3, Y_e4],
                           'z': [Z_e1, Z_e2, Z_e3, Z_e4],}) 

    return output


@config.when(cfg_val='1')
@extract_columns('x','y','z')
def concat_v1(
  X_e1: pd.Series, Y_e1: pd.Series, Z_e1: pd.Series, 
  X_e2: pd.Series, Y_e2: pd.Series, Z_e2: pd.Series, 
  X_e3: pd.Series, Y_e3: pd.Series, Z_e3: pd.Series ) -&gt; pd.DataFrame:
    
    output = pd.DataFrame({'x': [X_e1, X_e2, X_e3],
                           'y': [Y_e1, Y_e2, Y_e3],
                           'z': [Z_e1, Z_e2, Z_e3]})
    return output
    ```


That should work without issue. The functions don’t need to take the same inputs at all.

Otherwise note, the convention is that there should be two underscores in the function name, i.e. they should be named:

```@config.when(cfg_val='0')
@extract_columns('x','y','z')
def concact__v0(...):

@config.when(cfg_val='1')
@extract_columns('x','y','z')
def concact__v1(...):```

thanks, I meant is there a less verbose way of doing the above :stuck_out_tongue:
for example defining the inputs as **kwargs and telling the driver what kwargs to use depending on the config?

(why the __ convention? does that get stripped from documentation or something?)

it gets stripped — so that there’s only one `concat` node in the graph.

&gt; thanks, I meant is there a less verbose way of doing the above :stuck_out_tongue:
&gt; for example defining the inputs as **kwargs and telling the driver what kwargs to use depending on the config?
what would that help with? Part of the value of hamilton is it’s pretty explicit what depends on what :slightly_smiling_face: — so a new person coming to own/use the code can know easily.

otherwise you could just make one of the inputs a dict…

I guess that’d require passing in the whole dataframe instead, and indeed that would obfuscate what the dependency is.

OK this is also related to this idea: <https://hamilton-opensource.slack.com/archives/C03MANME6G5/p1667381483500809>. Starting to look into it and I think you might find some overlap? I concur with <@U03AV0GG20H> though -- Hamilton represents a verbosity trade-off -- the code is easier to read/maintain but might be a little wordier.

Toy example 1, dataframe as input, perform a filter, return dataframe as output

```
numbers = range(10)
df1 = pd.DataFrame({
    'a': numbers,
})

# filter to even numbers
df1.query('a%2==0')```


Toy example 2, two dataframes as input to be merged on a common key, some ops performed on the merged table

```
numbers = np.arange(10)
df1 = pd.DataFrame({
    'a': numbers,
    'b': numbers,
})

df2 = pd.DataFrame({
    'a': numbers,
    'c': numbers*2,
})

# merge
df3 = df1.merge(df2, on='a')
# perform ops on merged df
df3['d'] = df3.eval('b+c')```

You can work with data frames as args to your functions

Hamilton does not enforce working on series

Thanks! Happy you enjoyed it! Also thanks <@U048HD29YP5> — appreciate the answer. TO add more…
This is how you would do series:
```def numbers() -&gt; pd.Series:
    return pd.Series(range(10))

def a(numbers: pd.Series) -&gt; pd.Series:
    return a[a%2==0]

...
driver.execute(['a']) # Will give you a dataframe with `a_filtered` in it```
You can also do this at a dataframe level:

```def df1() -&gt; pd.DataFrame:
    return pd.DataFrame({
        'a': numbers,
    })

def df_filtered(df1: pd.DataFrame) -&gt; pd.DataFrame:
    return df1.query('a%2==0')```
IMO the first is a little more “hamiltonic”, but they both work well.

For your second case, its pretty natural:
```def df1() -&gt; pd.DataFrame:
    return ...

def df2()  -&gt; pd.DataFrame:
    return ...

@extract_columns('a', 'b', 'c')
def df3(df1: pd.DataFramge, df2: pd.DataFrame) -&gt; pd.DataFrame:
    out = df1.merge(df2, on='a')
    out['d'] = out.eval('b+c')
    return out```
Then you can ask for ‘a’, ‘b’, and ‘c’, and the framework will put them together. The extract_columns _also_ exposes `df3` as a node, so you can use that too.

Very slick. Okay I'll have to play with these and gain some better intuition but this is a great place to start. 

Thanks all!

Awesome! Feel free to ask us if you have any more Qs :slightly_smiling_face:

Hey! Hamilton uses them in two ways:

(1) validates that the inputs are the right type
(2) validates that nodes are matches

That said, <@U03AV0GG20H> thought about how to make it a looser type-check as an option. <https://github.com/stitchfix/hamilton/issues/181>. Our hope is its relatively unobtrusive and makes your code more readable, but would be curious why you want to turn it off :slightly_smiling_face:

Clarification on (1) — we validate inputs to the DAG (not each function call) match what’s expected when you do `execute()`.

But otherwise it’s used for DAG construction (point (2)).
It’s straightforward to augment — depends on what you want to achieve / model :slightly_smiling_face:

thanks - I think we were debating whether to have functions/nodes return pd.Series vs np.arrays, and realized that we’d have to do a bunch of ‘refactoring’ to make that work due to the type hints (unless from the get-go we declare them as unions).

couple of follow ups:
1. if our DAG outputs are now numpy arrays instead of pd.Series, will the driver still produce a pd.Dataframe? if not, will return a dictionary instead (I guess we can always turn it into a df with `from_dict`)
2. if the outputs are a mix of pd.Series and other types (ints, etc.), will the driver automatically fallback to returning dictionaries? i.e. does it concatenate into a df only if everything is a pd.Series, and otherwise returns a dict?

&gt; realized that we’d have to do a bunch of ‘refactoring’ to make that work
Search and replace shouldn’t be hard to place in a bunch of unions?
Otherwise you can define a graph adapter to make them equivalent. Let me know if you want code for that.

Regarding:
(1) If using the default driver and adapter then yes — since the following seems to work.
```a = {'a': np.array([1,2,3]), 'b': np.array([3, 4, 5])}
pd.DataFrame(a)
   a  b
0  1  3
1  2  4
2  3  5```
If you want a dictionary of output then you need to pass in adapter to the driver that will do that. Sorry but I see we’re lacking in docs here — but the idea can be seen in <https://hamilton-docs.gitbook.io/docs/reference/api-reference/available-drivers> and you change step 3 …

regarding:
(2) with the default driver and adapter, it will try to do <https://github.com/stitchfix/hamilton/blob/main/hamilton/base.py#L172> — so will rely on Pandas being able to stitch things together appropriately. Else it will fail.

For both (1) &amp; (2) - if you want a dictionary back — then you need to instantiate <https://github.com/stitchfix/hamilton/blob/main/hamilton/base.py#L35> and pass that to <https://github.com/stitchfix/hamilton/blob/main/hamilton/base.py#L333>, and in turn pass that to the driver.

One of the design goals of Hamilton was that you should hopefully only have to touch/change driver code, and leave logic untouched…

```from hamilton.function_modifiers import extract_columns

# -- could be in a data_loaders.py module -- #
def source_data1_df(filename1: str) -&gt; pd.DataFrame:
    # load source data1 DF
    return df

def source_data2_df(filename2: str) -&gt; pd.DataFrame:
    # load source data2 DF.
    return df

@extract_columns('COL1', 'COL2', ...)
def merged_data_fx_df(source_data1_df: pd.DataFrame, source_data2_df: pd.DataFrame) -&gt; pd.DataFrame:
    # logic to merge
    return merged_df

# -- could be in a more general transforms.py module -- #
def transform_col1(COL1: pd.Series) -&gt; pd.Series:
   # transform COL1

...```
In Hamilton you have a few options.

You could

1. Create functions to load the data that return dataframes, and then write functions that depend on those to merge/map dataframes. Then have transform functions that work from the basis of that merged dataframe. This is what I’ve stubbed out in the code above. The driver would then only need to be passed configuration to know where to load the dataframes…
2. You could do the merging/mapping outside of hamilton, and then pass in the merged dataframe to the driver. The transform logic functions would operate on the basis of those inputs. In terms of what do you actually pass to the driver, you can write a function that extracts the columns from the dataframe, or pass in the individual columns.
In terms of the two options, (1) is a superset of (2), as the transform functions should be the same in both. It just depends how much of the workflow you want to model within Hamilton itself.

Also, welcome to the community, glad you're giving hamilton a spin!

<@U03AV0GG20H> In terms of data lineage, I’m thinking of passing an empty pd.DataFrame() to my driver first, and then using the functions it scours to load in both DFs that I want to combine. This is so in the DAG I can see which columns (i.e. data) comes from which table. Do you think something like this is possible?

<@U04D4M4V24E>, just to clarify, you want to:

(1) load two data frames (df1, and df2) comprised of M columns and N columns respectfully.
(2) create a single dataframe that is potentially &lt;= M + N columns wide, using a common primary key (i.e. index).
(3) and you want to have lineage at column level granularity to help identify what column came from what dataframe?

If so, you could do something like this:
```@extract_columns('a', 'b', 'c')
def source_table_df1(...) -&gt; pd.DataFrame:
    # logic to load data -- and ensure common index
    df = ...
    return df

@extract_columns('x', 'y', 'z')
def source_table_df2(...) -&gt; pd.DataFrame:
    # logic to load data -- and ensure common index
    df = ...
    return df

def joined_df(a: pd.Series, b: pd.Series, c: pd.Series, x: pd.Series, y: pd.Series, z: pd.Series) -&gt; pd.DataFrame:
    # rather than doing pandas join, construct the dataframe manually 
    df = pd.DataFrame({...})
    return df

# you'd then request `joined_df` from execute().```
Just to show what it would look like without expressing what columns were in the dataframes:
```def source_table_df1(...) -&gt; pd.DataFrame:
    # logic to load data -- and ensure common index
    df = ...
    return df

def source_table_df2(...) -&gt; pd.DataFrame:
    # logic to load data -- and ensure common index
    df =  ...
    return df

def joined_df(source_table_df1: pd.DataFrame, source_table_df2: pd.DataFrame) -&gt; pd.DataFrame:
    # do a pandas join
    df = source_table_df1.join(...)
    return df

# you'd then request `joined_df` from execute().```

Meanng running it in separate threads in parallel? I'd say yep -- I'm pretty sure we avoid global state and with the GIL we're good to go.

<@U04B6S6J9J9> to clarify, what part do you want in parallel?

I have a flask app deployed with nginx and gunicorn (read: copies of webapp on multiple workers). I'm planning on using hamilton within the guts of my webapp because it makes the transformations from payload -&gt; response so much more organized.

I want to make sure that running hamilton in such an application deployed this way won't run into issues

Oh awesome! FYI not sure if you've seen <https://github.com/stitchfix/hamilton/blob/main/examples/async/README.md>.

So I'm scouring for global state to double-check but I can't find anything -- if it does I'd call that a bug and we're happy to fix.

<@U04B6S6J9J9> I think if you create a new driver for each request, you’re definitely fine.
If you are trying to use the same driver, across multiple concurrent requests, we’ll have to double check — and yeah if that’s a bug we should figure out how to enable that.

Got it, I think I can get by creating a separate driver for each request. Thanks for the reference material! I'll check it out

<@U04B6S6J9J9> on second thought. I think you should be able to reuse the driver.

Gunicorn spins up separate processes of the flask app. Flask IIRC, doesn’t use threads. So each request, would be in its own process. So would not interfere with any other requests.

Does each Driver create a lot of overhead or is it pretty lightweight since it's using most of Python's built in introspection?

&gt;  Does each Driver create a lot of overhead or is it pretty lightweight since it’s using most of Python’s built in introspection?
Depends on how big your DAG is? Worth timing it to convince yourself :slightly_smiling_face:

Makes sense!

Thanks again for all the guidance. Have had some good momentum so far and just trying to learn more as I go

If I were you I'd try it in parallel with the same driver then yell at us for doing something stupid if it doesn't work :laughing:

FYI created <https://github.com/stitchfix/hamilton/issues/234>

Glad to hear you're having a good time! And awesome question! So, a few thoughts:
1. We don't support dynamic parameterization/column creation -- <https://github.com/stitchfix/hamilton/issues/208|this issue> gets at it (although its slightly different). In general, having static (non-runtime-determined) makes the framework slightly less powerful, but is something we might consider.
2. Creating everything, IMO, isn't actually bad (although there's some nuance). Execution *only* runs the upstream nodes -- thus you can create 1000 of them and only use 2, and 998 will never be run so you won't take a performance hit. 
3. You can *also* do this by passing around dataframes instead of series. It becomes slightly less expressive, but you could imagine something like this:
```@parameterize(**lag_parameterization)
def lag_series(feature: pd.Series, lags: List[int]) -&gt; pd.DataFrame:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}```
Then you can use that upstream.

But yeah, which one you use depends a bit on exactly how you're using the lagged data downstream. E.G. (2) works if you keep adding transforms that use new lagged versions of features. (3) is nice if you want it to be runtime-based -- which dependencies are used. If I understand what you're going for, I think your approach is pretty good so far -- I'd be curious how it impacts your workflow!

<@U03AGP3AFM5> I think another approach could be to pass in the parametrization as a config? So the DAG would be static, it's just the parametrization could be passed in, rather than requiring it to be defined in the module... We'd need to build this feature in though.

In #3, I suppose I could just throw an @extract_columns on top to get the individual lagged columns back out? Seems similar to #2 but would allow me to pass in `lags` in initial_data or config

Unfortunately my use case has me passing around custom types (built on Polars dataframes) instead of pd.Series as in the example, so I'm likely sacrificing the utility of @extract_columns

<@U04BWTN9EPQ> yep, but then you'd be knowing them at compile-time, so why not just create those series individually? I guess the thing I'm not understanding is how you plan to use them -- is it:
• You want to run a test that uses a different set of lags every run (E.G. for optimizing some hyperparameter), or
• You want to use a bunch of different ones in different places, but its mostly the same run-to-run
Also, i would *love* polars support -- we'd have to be careful about pulling in dependencies, but I don't think it would be too hard for `extract_columns` to support multiple dataframes -- if you're interested in making an OS contribution at some point...

<@U03AV0GG20H> yeah -- was thinking along similar lines. We have `source`, `value`, and `config` could be a third but its unclear whether its the value, or the source... One could even imagine `{'lag' : source(config('lag'))}` meaning that its a source, the value of which comes from config. Will need to noodle.

<@U03AGP3AFM5> yep, essentially that first use case is what I'm going for

Got it -- so yeah, I think the cleanest way to do this would be something like the dataframe approach. If you have a single lag you're optimizing on its simple (this can return a series), if you're optimizing on a set of lags it can return a dataframe):

```def lagged_feature(feature: Series, lags: List[int]) -&gt; Series:
    # pretty sure there's a better vectorized way to do this, but...
    return pd.DataFrame({f"lag_{lag}" : feature.shift(lag) for lag in lags}```
Then you use that downstream. Exactly how is more dependent on how you intend to model your problem, but the core idea is labeling in some way to what is meaning to your workflow. E.G. you could have `lag_a`, `lag_b`, and `lag_c` (if you're looking for features based on three different lags :shrug: ) and extract that from your dataframe/pass it in via the `lags` parameter. Makes sense?

Makes sense-  thanks again. As I get further along using Polars I'll certainly think about where I might be able to contribute!

Awesome! Good luck on your work and let us know if you have any more questions. 

<@U04BWTN9EPQ> just to plant the thought. We'd love a Polars example for the examples section in the hamilton repo; nothing extensive, but something to show people how to get started would be great please :) (I'm happy to help document it - just need some code)

So yes! But some nuance here. You have options:
1. You can return a tuple -- then use that later (and split it up)
2. We have an `extract_fields` tool that will help you out
Following code from memory -- will run soon to ensure that it works perfectly.

(2) is my favorite -- would look like this:

```@function_modifiers.extract_fields('d', 'e')
def d_and_e(a: int, b: int, c: int) -&gt; dict:
    return {'d': a+c, 'e' : b+c}```
Currently we don't have an extract_items for tuples but that would be reasonable (and we're looking for contributions).

For (1) you'd do something like this:

```def d_and_e(a: int, b: int, c: int) -&gt; tuple:
    return (a+c, b+c)

def d(d_and_e: tuple) -&gt;int:
    return d_and_e[0]

def e(d_and_e: tuple) -&gt;int:
    return d_and_e[1]```
Which is IMO a little uglier.

Lol -- I think you need quotes here, this works for me:
`pip install "sf-hamilton[visualization]"`
-- no way intuitive, and our guide might have been off...
Feel like I've done it without quotes before? Or maybe just in a requirements.txt file...

<@U04D71CDUGM>, yep what <@U03AGP3AFM5> said. For zsh you need to enclose it in quotes. For bash IIRC, it’s not required…

Ahh therein lies my confusion — switched not too long ago