Slackbot
02/10/2024, 10:27 PMStefan Krawczyk
02/10/2024, 10:33 PMdf
to refer to exactly? The output of all_data_lazyframe
?
If that’s the case, something like this should work:
def all_data_lazyframe(glob_path: str) -> pl.DataFrame:
pl.enable_string_cache()
files = glob.glob(glob_path)
dataframes = []
for file in files:
tname = "T" + file.split("_")[3]
print(f"Turbine: {tname}")
dataframes.append(
pl.scan_csv(
file,
skip_rows=9,
null_values="NaN",
infer_schema_length=50_000,
).with_columns(pl.lit(tname, dtype=pl.Categorical).alias("turbine"))
)
return pl.concat(dataframes, how="vertical_relaxed", parallel=True).rename(
RAW_NAME_MAPPING
).collect()
def transformed_timestamp(timestamp: pl.Series) -> pl.Series:
return timestamp.str.to_datetime(format="%Y-%m-%d %H:%M:%S")
@extract_columns(*BASE_OUTPUT_COLS)
def remove_rows_when_power_below_cutin(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
"""Remove instances where turbine power is zero or less, but wind speed is above cut-in speed"""
c1 = ~(
(pl.col("power") <= 0)
& (pl.col("wind_speed") >= TurbineParams.cutin_wind_speed.value)
)
c2 = pl.col("power").is_not_null()
return df.filter(c1).filter(c2)
In Hamilton, after loading data, it’s best practice to do any global dataframe transforms, e.g. filters/sorts, etc. before exposing columns for downstream feature engineering — that’s why I moved the @extract_columns
down.Walber Moreira
02/10/2024, 10:36 PMStefan Krawczyk
02/10/2024, 10:38 PMStefan Krawczyk
02/10/2024, 10:56 PM@extract_columns(..)
def all_data_lazyframe(...) -> pl.DataFrame:
return df
def col_tr_A(...) -> pl.Series:
return series
def colt_tr_B(..) -> pl.Series
return series
def data_set(col_tr_A: pl.Series, col_tr_B: pl.Series, original_col: pl.Series) -> pl.DataFrame:
# the arguments here specify what we want for the data set -- it could also reference all_data_lazyframe too!
return pl.DataFrame(....)
You could just instead use dataframes all the way — but you lose some visibility:
def all_data_lazyframe(...) -> pl.DataFrame:
return df
def data_set(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
col_tr_A = all_data_lazyframe[...] *...
col_tr_B = all_data_lazyframe[...] *...
return pl.concat([col_tr_A, col_tr_B, all_data_lazyframe])
If you want to gain some visibility with the latter approach, you could use @pipe.
Which you choose depends on what’s going on downstream in my experience. We do have a @with_columns construct that works for pyspark
, which we could also extend to polars/pandas
quite easily that could simplify the ergonomics here a bit.Stefan Krawczyk
02/10/2024, 10:58 PMWalber Moreira
02/11/2024, 11:26 AMStefan Krawczyk
02/11/2024, 4:18 PM