Is there a way to serialize a profile so i can store it some WhyLabs R2AI Community #general

Is there a way to serialize a profile so i can sto...

wooden-glass-33314

11/28/2023, 10:33 PM

Is there a way to serialize a profile so i can store it somewhere like a db? i’d like to recall this profile at a later point and merge/compare it with another profile.

brainy-addition-90818

11/28/2023, 10:38 PM

Yep

Copy code

import whylogs as why

row = {"a": 1, "b": 2, "c": 3}

result_set = why.log(row)

print(result_set.view().serialize())

brainy-addition-90818

11/28/2023, 10:39 PM

that gives you

bytes

that you can store and load back up later into a DatasetProfileView. The biggest caveat is that you're going to be using a

DatasetProfileView

and not a

DatasetProfile

at that point, which means you can only merge it with other things, you can't add individual datapoints directly.

wooden-glass-33314

11/28/2023, 10:40 PM

thanks for the quick response! I’m still learning the intricacies of your api so apologies for the dumb questions. can you also compare DatasetProfileViews?

brainy-addition-90818

11/28/2023, 10:41 PM

Np, no dumb questions. You can do a lot of the same stuff, you'll basically be manually comparing things that you can find on the profile view, like

._columns

. It doesn't have a convenient

.equals()

function sadly

brainy-addition-90818

11/28/2023, 10:42 PM

You can load it back up with

Copy code

import whylogs as why

row = {"a": 1, "b": 2, "c": 3}

result_set = why.log(row)

ser = result_set.view().serialize()

view = why.DatasetProfileView.deserialize(ser)

print(view._columns)

brainy-addition-90818

11/28/2023, 10:43 PM

You'll probably be most interested in these things

Copy code

view._columns
        view._dataset_timestamp
        view._creation_timestamp
        view._metrics
        view._metadata: Dict[str, str]

brainy-addition-90818

11/28/2023, 10:44 PM

Depends on why you're doing the manual comparison I guess. You probably want to browse around the _metrics though. Are you just going to do things like comparing column_a's avg to another profile's version of column_a's avg?

wooden-glass-33314

11/28/2023, 10:46 PM

at a high-level, the idea is to profile our training dataset, then store it. then, on a rolling basis, profile new feature data and compare it to the training dataset’s profile. each time we compare, we capture metrics and store them in timeseries to see drift over time. hopefully with this process we’re able to determine when we need to retrain our models

mysterious-solstice-25388

11/28/2023, 10:47 PM

We have some of these topics covered in our examples, check out: https://whylogs.readthedocs.io/en/latest/examples/basic/Getting_Started.html https://github.com/whylabs/whylogs/blob/mainline/python/examples/basic/Notebook_Profile_Visualizer.ipynb

wooden-glass-33314

11/28/2023, 10:47 PM

as far as what we’re comparing, i am no data scientist and was just hoping for automagic goodness 😄

wooden-glass-33314

11/28/2023, 10:50 PM

ok, so i see some functions in the NotebookProfileVisualizer to compare 2 profile views, so that gets me closer. is there a way to also serialize the “profile summary” into a set of k/v pairs? we can dot notate stuff to flatten it if it has a more complex structure than simple dict[str, str]

mysterious-solstice-25388

11/28/2023, 10:53 PM

The profile summary is effectively contained in the profile, the summary is like a string representation of the more compact stats/distributions in the profile.

mysterious-solstice-25388

11/28/2023, 10:54 PM

But if you wanted to build an index or separate table for some value in the profile you can do that

wooden-glass-33314

11/28/2023, 10:55 PM

how’s that work? don’t you have to compare a target and reference profile to calculate that summary? i’m probably missing something

wooden-glass-33314

11/28/2023, 10:55 PM

probably need one of my data scientist brethren in here to translate to/from software engineer speak for me 😞

mysterious-solstice-25388

11/28/2023, 11:01 PM

Oh sorry you mean the summary/drift report in the

NotebookProfileVisualizer

thats right those require two profiles, sorry I thought you were referencing a "summary" method we have for displaying specific column profile views

wooden-glass-33314

11/28/2023, 11:03 PM

apologies. ultimately, after comparing the target and reference profile, i’d like some kind of meaningful data structure in return that i can also persist and possibly visualize as timeseries.

mysterious-solstice-25388

11/28/2023, 11:05 PM

gotcha, so what people often do is store a timeseries of the profiles, and then calculate drift on each batch of data (against a reference profile they trained on or a trailing window). You can detect drift with the

NotebookProfileVisualizer

or you can call methods directly to `calculate_drift_scores`: https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Drift_Algorithm_Configuration.ipynb

wooden-glass-33314

11/28/2023, 11:08 PM

that’s a good idea. i don’t know if there’s value in seeing miniscule changes in a drift metric over time. i just thought it would be a cool thing to visualize. perhaps the metrics don’t matter until they matter and it’s better to just capture/alert on them when they matter

mysterious-solstice-25388

11/28/2023, 11:09 PM

Then based on interesting points they might load the relevant profiles and run the NotebookProfileVisualizer reports from those stored profiles

mysterious-solstice-25388

11/28/2023, 11:11 PM

That being said, you can store the reports as well, though thats less common since its pretty quick to regenerate them as needed

wooden-glass-33314

11/28/2023, 11:13 PM

i think

calculate_drift_scores

gets me that data structure i was looking for, so i can at least test out my idea from here. thanks for some alternative ideas on how to look at the problem. we’re looking to automate all of this for any model we train and do it in a uniform way, with a giant list of assumptions that must be true for it to be repeatable across models. we’ll be testing those assumptions too

mysterious-solstice-25388

11/28/2023, 11:15 PM

Great! sounds like a good scenario for whylogs

wooden-glass-33314

11/28/2023, 11:15 PM

I think so too

🙂 1

mysterious-solstice-25388

11/28/2023, 11:16 PM

Maybe you already saw this but you might setup conditions if you have a known set of assumptions about your data, in addition to looking for drift: https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Condition_Validators.ipynb

mysterious-solstice-25388

11/28/2023, 11:17 PM

That's a bit more of complex to setup, as conditions require someone to define them, but we do have people using those in combination to drift detection in a scenario like you described

wooden-glass-33314

11/28/2023, 11:20 PM

well, all you have to do is tell me that Predicates and other goodies from that

relations

package are also serializable in some form and i think we could make some use out of this

wooden-glass-33314

11/28/2023, 11:46 PM

needing that entire package to be serializable is probably a naive take. after thinking on it some more, i could see a central package of well curated validators that can be referred to by name at runtime.

mysterious-solstice-25388

11/28/2023, 11:51 PM

Right, we are considering building more of a "suite" definition for the conditions that could be loaded, but so far this has been an exercise left to the reader to load values and instantiate the predicates. It would be useful to have a design partner in this area if you would be willing to test out some ideas, or share your scenario a bit more with one of the engineers working on the data quality scenarios on our side?

Open in Slack

Previous Next