Next random question. We're evaluating Polars outs...
# hamilton-help
t
Next random question. We're evaluating Polars outside of Hamilton to run some large volume data stuff without the complexity of Spark etc. Polars has decent support for lazy evaluation and push down predicates etc. I'm assuming that to use Polars in Hamilton though, you'd need to have polars read all the data in prior to running your processing due to the code then being used in a Hamilton DAG?
👀 1
s
Nope. That shouldn’t be required.
PySpark is also lazily evaluated
t
interesting, I could have swore I tried it the other day and it blew up. I shall assume I'm wrong and go back and check again.
s
So Hamilton helps you manage the code — and it won’t trigger computation unless you put in something that will trigger it.
Polars only works if it fits into memory vs say pyspark
So that would be the thing to double check. Can I use polars for the datasize I have in question 🙂
Otherwise a few other options (that require some cluster — could be local): • dask dataframes • modin
t
oh its only a tiny test file
s
if you send a gist — happy to take a look
t
where as when I used polars.read_parquet on the same file the thing runs
s
ah — it’s a lazyframe
were you using a decorator with it?
t
Copy code
@extract_columns('transaction_id', 'originator.account_number', 'beneficiary.account_number',
                 'transaction_sub_type', 'credit_or_debit')
def read_transaction_input() -> pl.DataFrame:
    return pl.read_parquet("/tmp/hive/data/hive/warehouse/consilient.db/new_table_name/part-00000-d9898992-0d9d-418c-ba9c-ef6923e81976-c000.snappy.parquet")
s
yeah that’s our type checking in the framework. Extract columns is just syntactic sugar for creating a function that takes in the result of that function and pulls out a column — so there’s a quick workaround.
yeah so if you want to work with lazyframes, then I don’t think you can break it out into columns, since the result of a computation on top of a lazyframe is a lazyframe AFAIK.
To make the ergonomics of that better, we would need to implement a similar decorator to the one we have for pyspark called
@with_columns
.
Does that make sense? i.e. you can’t use
@extract_columns
. But you can still use Hamilton!
Copy code
def transaction_df() -> pl.LazyFrame:
    return pl.scan_parquet("/tmp/hive/data/hive/warehouse/consilient.db/new_table_name/part-00000-d9898992-0d9d-418c-ba9c-ef6923e81976-c000.snappy.parquet")
t
yeah that backs up my thought process mostly from the other day, thanks!
s
pipe be worth a look to help in the mean time.
👍 1
t
Thanks @Stefan Krawczyk