Next random question We re evaluating Polars outside of Hami Hamilton Open Source #hamilton-help

Next random question. We're evaluating Polars outs...

Tom Barber

03/18/2024, 5:46 PM

Next random question. We're evaluating Polars outside of Hamilton to run some large volume data stuff without the complexity of Spark etc. Polars has decent support for lazy evaluation and push down predicates etc. I'm assuming that to use Polars in Hamilton though, you'd need to have polars read all the data in prior to running your processing due to the code then being used in a Hamilton DAG?

👀 1

Stefan Krawczyk

03/18/2024, 5:47 PM

Nope. That shouldn’t be required.

Stefan Krawczyk

03/18/2024, 5:47 PM

PySpark is also lazily evaluated

Tom Barber

03/18/2024, 5:48 PM

interesting, I could have swore I tried it the other day and it blew up. I shall assume I'm wrong and go back and check again.

Stefan Krawczyk

03/18/2024, 5:48 PM

So Hamilton helps you manage the code — and it won’t trigger computation unless you put in something that will trigger it.

Stefan Krawczyk

03/18/2024, 5:48 PM

Polars only works if it fits into memory vs say pyspark

Stefan Krawczyk

03/18/2024, 5:50 PM

So that would be the thing to double check. Can I use polars for the datasize I have in question 🙂

Stefan Krawczyk

03/18/2024, 5:51 PM

Otherwise a few other options (that require some cluster — could be local): • dask dataframes • modin

Tom Barber

03/18/2024, 5:51 PM

oh its only a tiny test file

Stefan Krawczyk

03/18/2024, 5:52 PM

if you send a gist — happy to take a look

Tom Barber

03/18/2024, 5:52 PM

if I use polars.scan_parquet I end up here: https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/registry.py#L45

Tom Barber

03/18/2024, 5:53 PM

where as when I used polars.read_parquet on the same file the thing runs

Stefan Krawczyk

03/18/2024, 5:55 PM

ah — it’s a lazyframe

Stefan Krawczyk

03/18/2024, 5:55 PM

were you using a decorator with it?

Tom Barber

03/18/2024, 5:56 PM

Copy code

@extract_columns('transaction_id', 'originator.account_number', 'beneficiary.account_number',
                 'transaction_sub_type', 'credit_or_debit')
def read_transaction_input() -> pl.DataFrame:
    return pl.read_parquet("/tmp/hive/data/hive/warehouse/consilient.db/new_table_name/part-00000-d9898992-0d9d-418c-ba9c-ef6923e81976-c000.snappy.parquet")

Stefan Krawczyk

03/18/2024, 5:58 PM

yeah that’s our type checking in the framework. Extract columns is just syntactic sugar for creating a function that takes in the result of that function and pulls out a column — so there’s a quick workaround.

Stefan Krawczyk

03/18/2024, 6:03 PM

yeah so if you want to work with lazyframes, then I don’t think you can break it out into columns, since the result of a computation on top of a lazyframe is a lazyframe AFAIK.

Stefan Krawczyk

03/18/2024, 6:05 PM

To make the ergonomics of that better, we would need to implement a similar decorator to the one we have for pyspark called

@with_columns

Stefan Krawczyk

03/18/2024, 6:06 PM

Does that make sense? i.e. you can’t use

@extract_columns

. But you can still use Hamilton!

Copy code

def transaction_df() -> pl.LazyFrame:
    return pl.scan_parquet("/tmp/hive/data/hive/warehouse/consilient.db/new_table_name/part-00000-d9898992-0d9d-418c-ba9c-ef6923e81976-c000.snappy.parquet")

Tom Barber

03/18/2024, 6:07 PM

yeah that backs up my thought process mostly from the other day, thanks!

Stefan Krawczyk

03/18/2024, 6:08 PM

pipe be worth a look to help in the mean time.

👍 1

Stefan Krawczyk

03/18/2024, 6:10 PM

added https://github.com/DAGWorks-Inc/hamilton/issues/767 to track

Tom Barber

03/18/2024, 6:18 PM

Thanks @Stefan Krawczyk

Open in Slack

Previous Next