Hi, everyone! I try to use `parameterize_extract_c...
# hamilton-help
r
Hi, everyone! I try to use
parameterize_extract_columns
, and I got an error about result building I think. My code like this:
Copy code
@resolve(
            when=ResolveAt.CONFIG_AVAILABLE,
            decorate_with=lambda: parameterize_extract_columns(
                *[ParameterizedExtract(tuple(exp.name), {"exp": value(exp)}) for exp in self]
            ),
        )
        def mapper(exp: Experiment) -> dict:
            os.chdir(exp["run_dir"])
            dr = driver.Builder().with_modules(*modules).build()
            result = dr.materialize(*materializers, inputs=exp.param)
            # result = dr.execute(inputs=exp.param, final_vars=["load_sin"])
            return result
and the driver is like this:
Copy code
dr = (
            driver.Builder()
            .with_modules(tracker)
            .enable_dynamic_execution(allow_experimental_mode=True)
            .with_execution_manager(execution_manager)
            .with_adapter(base.SimplePythonGraphAdapter(base.DictResult()))
            .with_config({settings.ENABLE_POWER_USER_MODE: True})
            .build()
        )
        os.chdir(root)
        results = dr.execute(
                final_vars=[name for name in parameters],
            )
I don't know how to resiger types to result builder, or
parameterize_extract_columns
only support pd.dataframe?
e
Hey — so I’m not sure what the error is (would need to run locally, not at my computer now), but it should work. That said, the result building should just put it into a dict. What’s failing? Also, the use of resolve is a little odd (but possibly quite clever) — it’s not taking in any parameters, but you’re using class state to feed the data in, right?
r
Sorry! I forget to paste the error message:
FAILED tests/test_proj.py::TestProject::test_map_reduce - NotImplementedError: Cannot get column type for [<class 'dict'>]. Registered types are {'pandas': {'dataframe_type': <class 'pandas.core.frame.DataFrame'>, 'column_type': <class 'pandas.core.series.Series'>}}
And I cant understand usage case of ``parameterize_extract_columns``. If I want to parameterize a list of
Experiment
with different name, how should I deal with this decorator(I have totally no clue, the "four columns, two for each parameterization" example is too difficult to understand.....)
e
Woah, ok, you’re using Hamilton within Hamilton. Fun! Also crazy stuff.
parameterize_extract_columns
bascially does both parameterize, and extract_columns. 1. parameterize over inputs 2. extract columns for each of those inputs It does this with a dataclass for each one. Note this is… quite complex. However, the core problem is that you’re retruning a
dict
from the function and it has to be a dataframe (whatever type). Specifically, it has to be a dataframe, because we run one node that returns that dataframe for each parameterization, and then extract from each of these. So the function gets repeated
n
times. Does this make sense? TBH there are likely better ways of doing this (it’s probably a little more complex than necessary), but I don’t have full context into what you’re working on…
r
Thanks. Stefan provided the protocol, I'm not that smart🥲. That means no matter what type I used in function return, the final result always and must be dataframe(I think so)? I am trying to write a custom ``parameterize_extract_columns`` to deal with dict instead of pandas(I really don't like pands). Also, what's the difference between
parameterize
and
Collectable[]
for Parallel Execution? If I want to run a DAG with different arguments several times, which one is better? You can find source code here: https://github.com/MolCrafts/molexp/blob/master/src/molexp/tracker.py. After parallel execution I want to reduce the result from different experiments
Besides, even though I use
MultiThreadingExecutor
, the
parameterize
part still run one by one which makes me confused... https://github.com/MolCrafts/molexp/blob/bbd126c53fcb402e04e3ca8f4e8bf153a43a4a1c/src/molexp/project.py#L57
e
Ahh, ok, so you really want something like
@parameterize_extract_fields
. Which we don’t have (but open up an issue?). I’m curious — do you need one node for each field? Or one node for each dict? If its one node for each dict you’ll want to use
@parameterize
The difference between that and `Collect[]`/`Parallelizable[]` is that: 1.
@parameterize
is static/fixed 2. `Collect[…]`/`@Parallelizable[…]` is dynamic, decided at runtime. This uses a runtime-assigned (dynamic) output for a The executor isn’t smart enough to go over
@parameterize
— it can just repeat blocks between the
Parallelizable[…]
and
Collect[…]
nodes. So, what I think you want is the parallelizable construct — have one node list out your combinations and declare
Parallelizable
and have the other do
Collect
r
Thanks! I think i still need static one to track my input and output. I will have a try to implement exteact_field, if not successful, i would open an issue for asking help!
e
Sounds good!