This message was deleted.
# hamilton-help
s
This message was deleted.
👀 1
s
thanks for the question @Walber Moreira! What do you want
df
to refer to exactly? The output of
all_data_lazyframe
? If that’s the case, something like this should work:
Copy code
def all_data_lazyframe(glob_path: str) -> pl.DataFrame:
    pl.enable_string_cache()
    files = glob.glob(glob_path)
    dataframes = []
    for file in files:
        tname = "T" + file.split("_")[3]
        print(f"Turbine: {tname}")
        dataframes.append(
            pl.scan_csv(
                file,
                skip_rows=9,
                null_values="NaN",
                infer_schema_length=50_000,
            ).with_columns(pl.lit(tname, dtype=pl.Categorical).alias("turbine"))
        )
    return pl.concat(dataframes, how="vertical_relaxed", parallel=True).rename(
        RAW_NAME_MAPPING
    ).collect()

def transformed_timestamp(timestamp: pl.Series) -> pl.Series:
    return timestamp.str.to_datetime(format="%Y-%m-%d %H:%M:%S")

@extract_columns(*BASE_OUTPUT_COLS)
def remove_rows_when_power_below_cutin(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
    """Remove instances where turbine power is zero or less, but wind speed is above cut-in speed"""
    c1 = ~(
        (pl.col("power") <= 0)
        & (pl.col("wind_speed") >= TurbineParams.cutin_wind_speed.value)
    )
    c2 = pl.col("power").is_not_null()
    return df.filter(c1).filter(c2)
In Hamilton, after loading data, it’s best practice to do any global dataframe transforms, e.g. filters/sorts, etc. before exposing columns for downstream feature engineering — that’s why I moved the
@extract_columns
down.
w
In that case, that would work, because the order does not matter and the created dag would be: load -> filter -> transform on column. But if I do want to run the filter after a column transform, what would be the option? (A simple use-case would be a transform that make the filter work)
s
yep sure, let me sketch some code.
🙌 1
So it depends what you want to achieve. Extract columns is a convenience function for pulling a column reference out of a dataframe. The assumption is that computations are immutable, thus if you transform a column, and want to create a dataframe again, you need to do it manually:
Copy code
@extract_columns(..)
def all_data_lazyframe(...) -> pl.DataFrame:
   return df

def col_tr_A(...) -> pl.Series:
   return series

def colt_tr_B(..) -> pl.Series
   return series

def data_set(col_tr_A: pl.Series, col_tr_B: pl.Series, original_col: pl.Series) -> pl.DataFrame:
   # the arguments here specify what we want for the data set -- it could also reference all_data_lazyframe too!
   return pl.DataFrame(....)
You could just instead use dataframes all the way — but you lose some visibility:
Copy code
def all_data_lazyframe(...) -> pl.DataFrame:
   return df

def data_set(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
   col_tr_A = all_data_lazyframe[...] *...
   col_tr_B = all_data_lazyframe[...] *...
   
   return pl.concat([col_tr_A, col_tr_B, all_data_lazyframe])
If you want to gain some visibility with the latter approach, you could use @pipe. Which you choose depends on what’s going on downstream in my experience. We do have a @with_columns construct that works for
pyspark
, which we could also extend to
polars/pandas
quite easily that could simplify the ergonomics here a bit.
🙌 1
happy to jump on a quick call if that’s easier to explain/ask questions/open an issue to make an improvement.
w
Thank you for the explanation! I think that's enough to figure it out now.
s
@Walber Moreira sounds good. We’re always open to feedback if you find something to improve upon :)