This message was deleted Hamilton Open Source #hamilton-help

Join Slack

This message was deleted.

# hamilton-help

Slackbot

02/10/2024, 10:27 PM

This message was deleted.

👀 1

Stefan Krawczyk

02/10/2024, 10:33 PM

thanks for the question @Walber Moreira! What do you want

df

to refer to exactly? The output of

all_data_lazyframe

? If that’s the case, something like this should work:

Copy code

def all_data_lazyframe(glob_path: str) -> pl.DataFrame:
    pl.enable_string_cache()
    files = glob.glob(glob_path)
    dataframes = []
    for file in files:
        tname = "T" + file.split("_")[3]
        print(f"Turbine: {tname}")
        dataframes.append(
            pl.scan_csv(
                file,
                skip_rows=9,
                null_values="NaN",
                infer_schema_length=50_000,
            ).with_columns(pl.lit(tname, dtype=pl.Categorical).alias("turbine"))
        )
    return pl.concat(dataframes, how="vertical_relaxed", parallel=True).rename(
        RAW_NAME_MAPPING
    ).collect()

def transformed_timestamp(timestamp: pl.Series) -> pl.Series:
    return timestamp.str.to_datetime(format="%Y-%m-%d %H:%M:%S")

@extract_columns(*BASE_OUTPUT_COLS)
def remove_rows_when_power_below_cutin(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
    """Remove instances where turbine power is zero or less, but wind speed is above cut-in speed"""
    c1 = ~(
        (pl.col("power") <= 0)
        & (pl.col("wind_speed") >= TurbineParams.cutin_wind_speed.value)
    )
    c2 = pl.col("power").is_not_null()
    return df.filter(c1).filter(c2)

In Hamilton, after loading data, it’s best practice to do any global dataframe transforms, e.g. filters/sorts, etc. before exposing columns for downstream feature engineering — that’s why I moved the

@extract_columns

down.

Walber Moreira

02/10/2024, 10:36 PM

In that case, that would work, because the order does not matter and the created dag would be: load -> filter -> transform on column. But if I do want to run the filter after a column transform, what would be the option? (A simple use-case would be a transform that make the filter work)

Stefan Krawczyk

02/10/2024, 10:38 PM

yep sure, let me sketch some code.

🙌 1

Stefan Krawczyk

02/10/2024, 10:56 PM

So it depends what you want to achieve. Extract columns is a convenience function for pulling a column reference out of a dataframe. The assumption is that computations are immutable, thus if you transform a column, and want to create a dataframe again, you need to do it manually:

Copy code

@extract_columns(..)
def all_data_lazyframe(...) -> pl.DataFrame:
   return df

def col_tr_A(...) -> pl.Series:
   return series

def colt_tr_B(..) -> pl.Series
   return series

def data_set(col_tr_A: pl.Series, col_tr_B: pl.Series, original_col: pl.Series) -> pl.DataFrame:
   # the arguments here specify what we want for the data set -- it could also reference all_data_lazyframe too!
   return pl.DataFrame(....)

You could just instead use dataframes all the way — but you lose some visibility:

Copy code

def all_data_lazyframe(...) -> pl.DataFrame:
   return df

def data_set(all_data_lazyframe: pl.DataFrame) -> pl.DataFrame:
   col_tr_A = all_data_lazyframe[...] *...
   col_tr_B = all_data_lazyframe[...] *...
   
   return pl.concat([col_tr_A, col_tr_B, all_data_lazyframe])

If you want to gain some visibility with the latter approach, you could use @pipe. Which you choose depends on what’s going on downstream in my experience. We do have a @with_columns construct that works for

pyspark

, which we could also extend to

polars/pandas

quite easily that could simplify the ergonomics here a bit.

🙌 1

Stefan Krawczyk

02/10/2024, 10:58 PM

happy to jump on a quick call if that’s easier to explain/ask questions/open an issue to make an improvement.

Walber Moreira

02/11/2024, 11:26 AM

Thank you for the explanation! I think that's enough to figure it out now.

Stefan Krawczyk

02/11/2024, 4:18 PM

@Walber Moreira sounds good. We’re always open to feedback if you find something to improve upon :)

Open in Slack

Previous Next