Justin Donaldson
10/21/2024, 5:51 PMJustin Donaldson
10/21/2024, 5:52 PMThierry Jean
10/21/2024, 5:55 PMJustin Donaldson
10/21/2024, 6:14 PMMichal Siedlaczek
10/22/2024, 12:54 PMJustin Donaldson
10/23/2024, 6:14 PMJustin Donaldson
10/23/2024, 6:18 PMMichal Siedlaczek
10/23/2024, 6:19 PMJustin Donaldson
10/23/2024, 6:20 PMThierry Jean
10/23/2024, 6:21 PMArray
and Mapping
in Arrow) are very exciting and an area of active development. For instance, ClickHouse added support for "automatically" handling columns of JSON objects. I believe Snowflake does something similar.
I expect this area to be actively developed with the increasing use of unstructured and multimodal dataJustin Donaldson
10/23/2024, 6:27 PMJustin Donaldson
10/23/2024, 6:28 PMJustin Donaldson
10/23/2024, 6:33 PMThierry Jean
10/23/2024, 6:49 PMJustin Donaldson
10/24/2024, 3:04 PMJustin Donaldson
10/24/2024, 3:21 PMJustin Donaldson
10/24/2024, 3:24 PMJustin Donaldson
10/24/2024, 3:24 PMtrain
, and include necessary imports. Put this at the top of the output so it can run in a jupyter cell:
`
%%cell_to_module train_model_category --display --write_to_file <file location>
At this point there's a polars dataframe with 3 fields which are each 300 dimensional sentence embeddings expressed as polars series.
* redacted : pl.Series
* redacted: pl.Series
* redacted: pl.Series
There's also a target field
* expanded_category: pl.Series['str']
The target field is expanded_category
, which is a simple text field that gives the category label.
This is a method "train" that accepts each of the embedding series as an argument, along with the expanded_category series argument.
* It should train a simple pytorch model (SimpleModel) on the embeddings and category label.
* The model should include a constructor that specifies:
* input_dimension : specified by the input_tensor shape
* target_dimension : specified by the target_tensor shape
* It should concatenate the embeddings together into the input_dimension needed for the model
* It should be simple single-label classification.
* It should return None, and include this in the type signature as a type hint.
Don't forget:
* It should remember to convert category_embedding into something appropriate for pytorch training.
* It should import polars as pl
* It should convert polar series to tensor with the existing to_torch() method
* It should convert the category to a vector using a multi-label binarizer.
* calculate loss every epoch, and also calculate precision and recall on a holdout dataset.
`Justin Donaldson
10/24/2024, 3:26 PMThierry Jean
10/25/2024, 3:30 PMJustin Donaldson
10/25/2024, 8:14 PMJustin Donaldson
10/25/2024, 8:15 PMJustin Donaldson
10/25/2024, 8:16 PMJustin Donaldson
10/25/2024, 8:16 PMJustin Donaldson
10/25/2024, 8:17 PMElijah Ben Izzy
10/25/2024, 8:36 PMJustin Donaldson
10/25/2024, 10:37 PMJustin Donaldson
10/25/2024, 10:40 PMJustin Donaldson
10/25/2024, 10:43 PMres.unpivot(index="epoch", on = ~cs.by_name("epoch")).plot.line(x="epoch", y="value", color="variable")
Justin Donaldson
10/25/2024, 10:43 PM