I read through this pandera integration post Anybody have a Hamilton Open Source #general

I read through this pandera integration post. Any...

Justin Donaldson

10/21/2024, 5:51 PM

I read through this pandera integration post. Anybody have a trip report on Polars? I'm debating whether to rely on it for a data pipeline project that targets dealing with serialized embeddings.

Justin Donaldson

10/21/2024, 5:52 PM

I'd love to hear if there's any nits or gotchas with Polars.

Thierry Jean

10/21/2024, 5:55 PM

IMO, there are few reasons to use pandas except its popularity and familiarity. If you have a team willing to learn Polars, go for it! The only caveat I can think of, is that it's being actively developed. It's good because there are bug fixes and new functionalities all the time, but they're also not afraid of renaming arguments or refactoring methods, so pin your versions! Their release page / changelog should give you a good idea https://github.com/pola-rs/polars/releases/tag/py-1.10.0

🙌 1

Justin Donaldson

10/21/2024, 6:14 PM

I do love the ability to give fields more specific schemas than "list[obj]"

Michal Siedlaczek

10/22/2024, 12:54 PM

I remember trying polars for one of my projects and it went great until I found out that it can't handle element-wise operations on lists (i.e., the type of a column is object and they are lists), which is something that I needed for this particular application (in polars it seemed I would have to end up with a much more complex approach, but pandas made it simple). so that's one gotcha I found.

👍 1

Justin Donaldson

10/23/2024, 6:14 PM

Yeah, awesome list support is great, but in this case it makes sense to require a vector of some sort, since a lot of the performance improvements can come from knowing the "shape" of the tensors one is working with.

👍 1

Justin Donaldson

10/23/2024, 6:18 PM

That alone is worth the price of admission for learning the quirks of its pl.col("blah") semantics.

Michal Siedlaczek

10/23/2024, 6:19 PM

yeah, that sounds pretty nice

Justin Donaldson

10/23/2024, 6:20 PM

this plus the jupyter lab integration is "chef's kiss"

Thierry Jean

10/23/2024, 6:21 PM

Yup, nested types (

Array

and

Mapping

in Arrow) are very exciting and an area of active development. For instance, ClickHouse added support for "automatically" handling columns of JSON objects. I believe Snowflake does something similar. I expect this area to be actively developed with the increasing use of unstructured and multimodal data

🙌 1

Justin Donaldson

10/23/2024, 6:27 PM

Love it.

Justin Donaldson

10/23/2024, 6:28 PM

Has anyone cooked up some shortcuts for quickly specifying dags? This would include doing production-unfriendly things like using the reflect api to detect and load hamilton models updated on a working path.

Justin Donaldson

10/23/2024, 6:33 PM

the process can become pretty interactive, I enjoy writing some code, and then explaining the output in the notebook cell by cell.

Thierry Jean

10/23/2024, 6:49 PM

Curious to hear more! How does this iteration phase look like for you? • do you start with a DAG structure in mind? • do you write regular script/notebook code then convert to Hamilton? • do you iterate with notebooks, interactive sessions, scripts? For me, I now use the notebook extension + caching a lot. I can add nodes incrementally, rebuild and visualize the dataflow, and execute it very effectively. I started using the Cursor IDE and it's pretty darn good at autocompleting Hamilton code in this context. • working with the notebook means all the context is in a single file and easy to read for the LLM • writing small functions / Hamilton nodes with a good name + type hints • it typically suggests a good docstring just from the function signature. Then, I tweak the docstring and the completed code is a good start

Justin Donaldson

10/24/2024, 3:04 PM

I'm figuring it out as I go. But, on this last modeling project I wrote everything in the notebook using your jupyter cell magic. Actually, I didn't even write most of it. I would write prompts and explanations in the cell above, and then generate it using Claude, or my onboard llama model.

Justin Donaldson

10/24/2024, 3:21 PM

I already had a DAG structure in mind, and there's generally some good first principles there that make structuring the flow fairly obvious (break out embarrassingly parallel steps into their own nodes, etc). I'm also using LLMs to set up training flows based on the data types and modeling needs. It's interesting to see an LLM try and solve that, it actually comes up with different techniques from time to time that I hadn't considered, but it also often gets some detail wrong and winds up wasting time. So, I'm trying to break that down into chunks that it can digest more easily.

Justin Donaldson

10/24/2024, 3:24 PM

But yeah, I start with a function signature like you do (to ensure it fits hamilton's requirements), and have it generate a docstring. Then I copy the docstring back into my instructions and tweak it with more information (tensor shape if necessary). Then, I have the LLM generate the function. If it does something I don't like, I add a "reminder" to the original prompt.

Justin Donaldson

10/24/2024, 3:24 PM

Generate a method called

train

, and include necessary imports. Put this at the top of the output so it can run in a jupyter cell: `

%%cell_to_module train_model_category --display --write_to_file <file location>

At this point there's a polars dataframe with 3 fields which are each 300 dimensional sentence embeddings expressed as polars series. * redacted : pl.Series * redacted: pl.Series * redacted: pl.Series There's also a target field * expanded_category: pl.Series['str'] The target field is

expanded_category

, which is a simple text field that gives the category label. This is a method "train" that accepts each of the embedding series as an argument, along with the expanded_category series argument. * It should train a simple pytorch model (SimpleModel) on the embeddings and category label. * The model should include a constructor that specifies: * input_dimension : specified by the input_tensor shape * target_dimension : specified by the target_tensor shape * It should concatenate the embeddings together into the input_dimension needed for the model * It should be simple single-label classification. * It should return None, and include this in the type signature as a type hint. Don't forget: * It should remember to convert category_embedding into something appropriate for pytorch training. * It should import polars as pl * It should convert polar series to tensor with the existing to_torch() method * It should convert the category to a vector using a multi-label binarizer. * calculate loss every epoch, and also calculate precision and recall on a holdout dataset. `

Justin Donaldson

10/24/2024, 3:26 PM

I have it in mind to write something about this, do you want to do a blog post? I thought I could break down a Titanic dataset modeling flow with it.

Thierry Jean

10/25/2024, 3:30 PM

A guest post would be great! It could be about your workflow, your tools, and how you approach creating Hamilton projects. I think declarative approaches and libraries that avoid too many abstractions will outperform alternatives as LLM improves. It's a sentiment echoed by our friends at dlt

Justin Donaldson

10/25/2024, 8:14 PM

Oh neato, thanks for this.

Justin Donaldson

10/25/2024, 8:15 PM

I'll throw together a draft. I've pulled apart that titanic data countless times.

Justin Donaldson

10/25/2024, 8:16 PM

Also finding another big gotcha here switching to polars. They completely get rid of the concept of the "index"

Justin Donaldson

10/25/2024, 8:16 PM

https://docs.pola.rs/user-guide/migration/pandas/

Justin Donaldson

10/25/2024, 8:17 PM

That's huge, index use is all over the place in pandas. I can see how it's useful to step away from that for extra functionality of robust schemas in polars, but yeah, big difference in approach there.

Elijah Ben Izzy

10/25/2024, 8:36 PM

Yeah polars killing indices is very intentional. I like indices but there are so many performance gotchas/other difficulties that it makes sense. Spark has the same setup. It’s really just a join key and you can maintain it better without the higher-level syntactic sugar if you prefer control.

👍 1

Justin Donaldson

10/25/2024, 10:37 PM

Ran into another gotcha. I used to be able to collect model training metrics in a dataframe. Then I could plot it with something like df.plot() to see how metrics changed over epochs. It's super easy to memorize.

Justin Donaldson

10/25/2024, 10:40 PM

Polars lets you use altair as a default plotting interface, but altair doesn't like wide-form data. https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

Justin Donaldson

10/25/2024, 10:43 PM

Here's what polars apparently wants you to do in this case:

Copy code

res.unpivot(index="epoch", on = ~cs.by_name("epoch")).plot.line(x="epoch", y="value", color="variable")

Justin Donaldson

10/25/2024, 10:43 PM

image.png

Open in Slack

Previous Next