Issue with `overrides`. I am trying to use `overr...
# hamilton-help
s
Issue with
overrides
. I am trying to use
overrides
but the driver is saying I am missing input values from nodes higher up the dag. Where the input value missing, has a default
None
value in the function definition.
Copy code
def open_positions_charged_delivery_fees_w_trades(
    positions_charged_delivery_fees: pd.DataFrame, 
    filtered_open_deliverable_positions: pd.DataFrame
) -> pd.DataFrame:
    return filtered_open_deliverable_positions.merge(
        positions_charged_delivery_fees, 
        left_on=["Exch Contract"], 
        right_on=["KEY"], 
        how="left"
    )


def open_positions_w_prorata_fee_by_trade(
    open_positions_charged_delivery_fees_w_trades: pd.DataFrame,
    cob_date: datetime = None,
) -> pd.DataFrame:
    tbl =  (
        open_positions_charged_delivery_fees_w_trades
        .assign(total_qty_by_key=lambda _df: _df.groupby(["KEY"])["EXTENDED_QTY"].transform("sum"))
        .assign(factor=lambda _df: _df["EXTENDED_QTY"] / _df["total_qty_by_key"])
        .assign(prorata_fee=lambda _df: _df["delivery_fee"] *_df["factor"] * -1)
    )

    if cob_date is not None:
        tbl.insert(0, "cob_date", cob_date)

    return tbl
driver
Copy code
dr = (
    driver.Builder()
    # .with_config({
        # "files_to_process": derived_files_to_load(),
        # "report_filepath": ...
        # })
    # .enable_dynamic_execution(allow_experimental_mode=True)
    .with_modules(dataflow_delivery_fee_allocation)
    .with_adapter(base.PandasDataFrameResult())
    # .with_remote_executor(SynchronousLocalTaskExecutor())
    .build()
    )

cached_results = pd.read_excel(...)
df = dr.execute(["delivery_fee_pivot"], overrides={"open_positions_charged_delivery_fees_w_trades": cached_results})
# Stacktrace Error
Copy code
Traceback (most recent call last):
  File "C:\codebase\rec-delivery-fee\run.py", line 99, in <module>
    df = dr.execute(["delivery_fee_pivot"], overrides={"open_positions_charged_delivery_fees_w_trades": cached_results})
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\codebase\rec-delivery-fee\dfee-venv\Lib\site-packages\hamilton\driver.py", line 552, in execute
    raise e
  File "C:\codebase\rec-delivery-fee\dfee-venv\Lib\site-packages\hamilton\driver.py", line 542, in execute
    outputs = self.raw_execute(_final_vars, overrides, display_graph, inputs=inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\codebase\rec-delivery-fee\dfee-venv\Lib\site-packages\hamilton\driver.py", line 632, in raw_execute
    Driver.validate_inputs(
  File "C:\codebase\rec-delivery-fee\dfee-venv\Lib\site-packages\hamilton\driver.py", line 513, in validate_inputs
    raise ValueError(error_str)
ValueError: 1 errors encountered:
  Error: Required input file_name not provided for nodes: ['cob_date', 'table_1', 'table_2', 'table_3'].
e
So its looking like there are three nodes
table_1
,
table_2
,
table_3
and another one:
cob_date
that all need the
file_name
field. If you do
visualize_execution
instead of
execute
will it show that there are missing upstream dependecies?
s
Correct. But i have a cached file that is equivalent to the overrides step, and the subsequent steps do not need file_name or cob_date
e
Is this in the middle of a
Parallelizable
block?
s
hmm could be one sec
I have config when to exclude those nodes for when I dont pass
is_parallel
to the config
e
So just to be clear,
table_{1,2,3}
and
cob_date
should be made redundant by the override? There’s no way they’re in the path?
s
The override is a non-dependent subsequent step
Copy code
def open_positions_w_prorata_fee_by_trade(
    open_positions_charged_delivery_fees_w_trades: pd.DataFrame,
    cob_date: datetime = None,
) -> pd.DataFrame:
cob_date
is derived from
file_name
which is not passed to the config. Could
cob_date: datetime = None
make the dag think that those nodes are needed ?
e
Oh the problem is that its optional maybe? If you take away the
= None
does it work?
Or rather just take that away entirely
And hardcode
cob_date
s
well
file_name
->
cob_date
->
open_positions_w_prorata_fee_by_trade
(non dependent on either node but set to
None
because the
cob_date
column is already in my table from the overrides)
e
So this
open_positions_w_prorata_fee_by_trade
depends on
cob_date
s
Despite setting
cob_date
to
None
in the function to avoid using it when using
overrides
on a precomputed historical file, this still indicated to the dag that is was an upstream dependency. The FIX. Pass
cob_date
to the
overrides
as well.
e
Thanks @Seth Stokes! To add this is becuase it was defined as a function. This made the DAG not think of it as required, so its dependencies (which weren’t passed in) were required
👍 1
s
Morning, should
overrides
work when asking for a node in a
subdag
? I'm erroring out on an
input
when trying to use
overrides
on a downstream node.
e
Mind an example of what you’re doing? My guess is no, or rather, you’ll have to be smart about it. Nodes inside a subdag are namespaced
subdag_name.node_name
, meaning that the standard overrides won’t work (if I undrestand what you’re asking)
s
Copy code
# transform_cme_raw_fee_schedule.py
def raw_cme_fee_schedule(data_location: str) -> pd.DataFrame:
    return pd.read_excel(data_location, header=[1, 2, 3])

def execution_types() -> pd.DataFrame: ...


# fee_schedules.py
import cme
@subdag(
    cme,
    inputs={"data_location": source("data_location")}
)
def cme_fee_schedule(execution_types: pd.DataFrame) -> pd.DataFrame:
    ...

# run.py
from hamilton import driver, base

dr = (
    driver.Builder()
    .with_modules(fee_schedules)
    .with_adapter(base.PandasDataFrameResult())
    .build()
)

raw_cme_fee_schedule_ = pd.DataFrame()
dr.execute(["cme_fee_schedule"], overrides={"raw_cme_fee_schedule": raw_cme_fee_schedule_})
e
So yeah, the issue is: 1. Its in the subdag so the name will be
subdag_name.node_name
=
cme_fee_schedule.raw_cme_fee_schedule
2. It’s thus ambiguous as to which it refers to So, try it with the new name:
cme_fee_schedule.raw_cme_fee_schedule
as the key?
👍 1