I read through this pandera integration post. Any...
# general
j
I read through this pandera integration post. Anybody have a trip report on Polars? I'm debating whether to rely on it for a data pipeline project that targets dealing with serialized embeddings.
I'd love to hear if there's any nits or gotchas with Polars.
t
IMO, there are few reasons to use pandas except its popularity and familiarity. If you have a team willing to learn Polars, go for it! The only caveat I can think of, is that it's being actively developed. It's good because there are bug fixes and new functionalities all the time, but they're also not afraid of renaming arguments or refactoring methods, so pin your versions! Their release page / changelog should give you a good idea https://github.com/pola-rs/polars/releases/tag/py-1.10.0
🙌 1
j
I do love the ability to give fields more specific schemas than "list[obj]"
m
I remember trying polars for one of my projects and it went great until I found out that it can't handle element-wise operations on lists (i.e., the type of a column is object and they are lists), which is something that I needed for this particular application (in polars it seemed I would have to end up with a much more complex approach, but pandas made it simple). so that's one gotcha I found.
👍 1
j
Yeah, awesome list support is great, but in this case it makes sense to require a vector of some sort, since a lot of the performance improvements can come from knowing the "shape" of the tensors one is working with.
👍 1
That alone is worth the price of admission for learning the quirks of its pl.col("blah") semantics.
m
yeah, that sounds pretty nice
j
this plus the jupyter lab integration is "chef's kiss"
t
Yup, nested types (
Array
and
Mapping
in Arrow) are very exciting and an area of active development. For instance, ClickHouse added support for "automatically" handling columns of JSON objects. I believe Snowflake does something similar. I expect this area to be actively developed with the increasing use of unstructured and multimodal data
🙌 1
j
Love it.
Has anyone cooked up some shortcuts for quickly specifying dags? This would include doing production-unfriendly things like using the reflect api to detect and load hamilton models updated on a working path.
the process can become pretty interactive, I enjoy writing some code, and then explaining the output in the notebook cell by cell.
t
Curious to hear more! How does this iteration phase look like for you? • do you start with a DAG structure in mind? • do you write regular script/notebook code then convert to Hamilton? • do you iterate with notebooks, interactive sessions, scripts? For me, I now use the notebook extension + caching a lot. I can add nodes incrementally, rebuild and visualize the dataflow, and execute it very effectively. I started using the Cursor IDE and it's pretty darn good at autocompleting Hamilton code in this context. • working with the notebook means all the context is in a single file and easy to read for the LLM • writing small functions / Hamilton nodes with a good name + type hints • it typically suggests a good docstring just from the function signature. Then, I tweak the docstring and the completed code is a good start
j
I'm figuring it out as I go. But, on this last modeling project I wrote everything in the notebook using your jupyter cell magic. Actually, I didn't even write most of it. I would write prompts and explanations in the cell above, and then generate it using Claude, or my onboard llama model.
I already had a DAG structure in mind, and there's generally some good first principles there that make structuring the flow fairly obvious (break out embarrassingly parallel steps into their own nodes, etc). I'm also using LLMs to set up training flows based on the data types and modeling needs. It's interesting to see an LLM try and solve that, it actually comes up with different techniques from time to time that I hadn't considered, but it also often gets some detail wrong and winds up wasting time. So, I'm trying to break that down into chunks that it can digest more easily.
But yeah, I start with a function signature like you do (to ensure it fits hamilton's requirements), and have it generate a docstring. Then I copy the docstring back into my instructions and tweak it with more information (tensor shape if necessary). Then, I have the LLM generate the function. If it does something I don't like, I add a "reminder" to the original prompt.
Generate a method called
train
, and include necessary imports. Put this at the top of the output so it can run in a jupyter cell: `
%%cell_to_module train_model_category --display --write_to_file <file location>
At this point there's a polars dataframe with 3 fields which are each 300 dimensional sentence embeddings expressed as polars series. * redacted : pl.Series * redacted: pl.Series * redacted: pl.Series There's also a target field * expanded_category: pl.Series['str'] The target field is
expanded_category
, which is a simple text field that gives the category label. This is a method "train" that accepts each of the embedding series as an argument, along with the expanded_category series argument. * It should train a simple pytorch model (SimpleModel) on the embeddings and category label. * The model should include a constructor that specifies: * input_dimension : specified by the input_tensor shape * target_dimension : specified by the target_tensor shape * It should concatenate the embeddings together into the input_dimension needed for the model * It should be simple single-label classification. * It should return None, and include this in the type signature as a type hint. Don't forget: * It should remember to convert category_embedding into something appropriate for pytorch training. * It should import polars as pl * It should convert polar series to tensor with the existing to_torch() method * It should convert the category to a vector using a multi-label binarizer. * calculate loss every epoch, and also calculate precision and recall on a holdout dataset. `
I have it in mind to write something about this, do you want to do a blog post? I thought I could break down a Titanic dataset modeling flow with it.
t
A guest post would be great! It could be about your workflow, your tools, and how you approach creating Hamilton projects. I think declarative approaches and libraries that avoid too many abstractions will outperform alternatives as LLM improves. It's a sentiment echoed by our friends at dlt
j
Oh neato, thanks for this.
I'll throw together a draft. I've pulled apart that titanic data countless times.
Also finding another big gotcha here switching to polars. They completely get rid of the concept of the "index"
That's huge, index use is all over the place in pandas. I can see how it's useful to step away from that for extra functionality of robust schemas in polars, but yeah, big difference in approach there.
e
Yeah polars killing indices is very intentional. I like indices but there are so many performance gotchas/other difficulties that it makes sense. Spark has the same setup. It’s really just a join key and you can maintain it better without the higher-level syntactic sugar if you prefer control.
👍 1
j
Ran into another gotcha. I used to be able to collect model training metrics in a dataframe. Then I could plot it with something like df.plot() to see how metrics changed over epochs. It's super easy to memorize.
Polars lets you use altair as a default plotting interface, but altair doesn't like wide-form data. https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data
Here's what polars apparently wants you to do in this case:
Copy code
res.unpivot(index="epoch", on = ~cs.by_name("epoch")).plot.line(x="epoch", y="value", color="variable")
image.png