Hi everyone I try to use `parameterize extract columns` and Hamilton Open Source #hamilton-help

Hi, everyone! I try to use `parameterize_extract_c...

Roy Kid

04/29/2024, 12:45 PM

Hi, everyone! I try to use

parameterize_extract_columns

, and I got an error about result building I think. My code like this:

Copy code

@resolve(
            when=ResolveAt.CONFIG_AVAILABLE,
            decorate_with=lambda: parameterize_extract_columns(
                *[ParameterizedExtract(tuple(exp.name), {"exp": value(exp)}) for exp in self]
            ),
        )
        def mapper(exp: Experiment) -> dict:
            os.chdir(exp["run_dir"])
            dr = driver.Builder().with_modules(*modules).build()
            result = dr.materialize(*materializers, inputs=exp.param)
            # result = dr.execute(inputs=exp.param, final_vars=["load_sin"])
            return result

and the driver is like this:

Copy code

dr = (
            driver.Builder()
            .with_modules(tracker)
            .enable_dynamic_execution(allow_experimental_mode=True)
            .with_execution_manager(execution_manager)
            .with_adapter(base.SimplePythonGraphAdapter(base.DictResult()))
            .with_config({settings.ENABLE_POWER_USER_MODE: True})
            .build()
        )
        os.chdir(root)
        results = dr.execute(
                final_vars=[name for name in parameters],
            )

I don't know how to resiger types to result builder, or

parameterize_extract_columns

only support pd.dataframe?

Elijah Ben Izzy

04/29/2024, 2:16 PM

Hey — so I’m not sure what the error is (would need to run locally, not at my computer now), but it should work. That said, the result building should just put it into a dict. What’s failing? Also, the use of resolve is a little odd (but possibly quite clever) — it’s not taking in any parameters, but you’re using class state to feed the data in, right?

Roy Kid

04/29/2024, 2:19 PM

Sorry! I forget to paste the error message:

FAILED tests/test_proj.py::TestProject::test_map_reduce - NotImplementedError: Cannot get column type for [<class 'dict'>]. Registered types are {'pandas': {'dataframe_type': <class 'pandas.core.frame.DataFrame'>, 'column_type': <class 'pandas.core.series.Series'>}}

Roy Kid

04/29/2024, 2:21 PM

And I cant understand usage case of ``parameterize_extract_columns``. If I want to parameterize a list of

Experiment

with different name, how should I deal with this decorator(I have totally no clue, the "four columns, two for each parameterization" example is too difficult to understand.....)

Elijah Ben Izzy

04/29/2024, 2:52 PM

Woah, ok, you’re using Hamilton within Hamilton. Fun! Also crazy stuff.

parameterize_extract_columns

bascially does both parameterize, and extract_columns. 1. parameterize over inputs 2. extract columns for each of those inputs It does this with a dataclass for each one. Note this is… quite complex. However, the core problem is that you’re retruning a

dict

from the function and it has to be a dataframe (whatever type). Specifically, it has to be a dataframe, because we run one node that returns that dataframe for each parameterization, and then extract from each of these. So the function gets repeated

times. Does this make sense? TBH there are likely better ways of doing this (it’s probably a little more complex than necessary), but I don’t have full context into what you’re working on…

Roy Kid

04/29/2024, 3:05 PM

Thanks. Stefan provided the protocol, I'm not that smart🥲. That means no matter what type I used in function return, the final result always and must be dataframe(I think so)? I am trying to write a custom ``parameterize_extract_columns`` to deal with dict instead of pandas(I really don't like pands). Also, what's the difference between

parameterize

and

Collectable[]

for Parallel Execution? If I want to run a DAG with different arguments several times, which one is better? You can find source code here: https://github.com/MolCrafts/molexp/blob/master/src/molexp/tracker.py. After parallel execution I want to reduce the result from different experiments

Roy Kid

04/29/2024, 3:20 PM

Besides, even though I use

MultiThreadingExecutor

, the

parameterize

part still run one by one which makes me confused... https://github.com/MolCrafts/molexp/blob/bbd126c53fcb402e04e3ca8f4e8bf153a43a4a1c/src/molexp/project.py#L57

Elijah Ben Izzy

04/29/2024, 4:23 PM

Ahh, ok, so you really want something like

@parameterize_extract_fields

. Which we don’t have (but open up an issue?). I’m curious — do you need one node for each field? Or one node for each dict? If its one node for each dict you’ll want to use

@parameterize

Elijah Ben Izzy

04/29/2024, 4:25 PM

The difference between that and `Collect[]`/`Parallelizable[]` is that: 1.

@parameterize

is static/fixed 2. `Collect[…]`/`@Parallelizable[…]` is dynamic, decided at runtime. This uses a runtime-assigned (dynamic) output for a The executor isn’t smart enough to go over

@parameterize

— it can just repeat blocks between the

Parallelizable[…]

and

Collect[…]

nodes. So, what I think you want is the parallelizable construct — have one node list out your combinations and declare

Parallelizable

and have the other do

Collect

Roy Kid

04/29/2024, 4:47 PM

Thanks! I think i still need static one to track my input and output. I will have a try to implement exteact_field, if not successful, i would open an issue for asking help!

Elijah Ben Izzy

04/29/2024, 4:53 PM

Sounds good!

Open in Slack

Previous Next