Is there a way to serialize a profile so i can sto...
# general
w
Is there a way to serialize a profile so i can store it somewhere like a db? i’d like to recall this profile at a later point and merge/compare it with another profile.
b
Yep
Copy code
import whylogs as why

row = {"a": 1, "b": 2, "c": 3}

result_set = why.log(row)

print(result_set.view().serialize())
that gives you
bytes
that you can store and load back up later into a DatasetProfileView. The biggest caveat is that you're going to be using a
DatasetProfileView
and not a
DatasetProfile
at that point, which means you can only merge it with other things, you can't add individual datapoints directly.
w
thanks for the quick response! I’m still learning the intricacies of your api so apologies for the dumb questions. can you also compare DatasetProfileViews?
b
Np, no dumb questions. You can do a lot of the same stuff, you'll basically be manually comparing things that you can find on the profile view, like
._columns
. It doesn't have a convenient
.equals()
function sadly
You can load it back up with
Copy code
import whylogs as why

row = {"a": 1, "b": 2, "c": 3}

result_set = why.log(row)

ser = result_set.view().serialize()

view = why.DatasetProfileView.deserialize(ser)

print(view._columns)
You'll probably be most interested in these things
Copy code
view._columns
        view._dataset_timestamp
        view._creation_timestamp
        view._metrics
        view._metadata: Dict[str, str]
Depends on why you're doing the manual comparison I guess. You probably want to browse around the _metrics though. Are you just going to do things like comparing column_a's avg to another profile's version of column_a's avg?
w
at a high-level, the idea is to profile our training dataset, then store it. then, on a rolling basis, profile new feature data and compare it to the training dataset’s profile. each time we compare, we capture metrics and store them in timeseries to see drift over time. hopefully with this process we’re able to determine when we need to retrain our models
w
as far as what we’re comparing, i am no data scientist and was just hoping for automagic goodness 😄
ok, so i see some functions in the NotebookProfileVisualizer to compare 2 profile views, so that gets me closer. is there a way to also serialize the “profile summary” into a set of k/v pairs? we can dot notate stuff to flatten it if it has a more complex structure than simple dict[str, str]
m
The profile summary is effectively contained in the profile, the summary is like a string representation of the more compact stats/distributions in the profile.
But if you wanted to build an index or separate table for some value in the profile you can do that
w
how’s that work? don’t you have to compare a target and reference profile to calculate that summary? i’m probably missing something
probably need one of my data scientist brethren in here to translate to/from software engineer speak for me 😞
m
Oh sorry you mean the summary/drift report in the
NotebookProfileVisualizer
thats right those require two profiles, sorry I thought you were referencing a "summary" method we have for displaying specific column profile views
w
apologies. ultimately, after comparing the target and reference profile, i’d like some kind of meaningful data structure in return that i can also persist and possibly visualize as timeseries.
m
gotcha, so what people often do is store a timeseries of the profiles, and then calculate drift on each batch of data (against a reference profile they trained on or a trailing window). You can detect drift with the
NotebookProfileVisualizer
or you can call methods directly to `calculate_drift_scores`: https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Drift_Algorithm_Configuration.ipynb
w
that’s a good idea. i don’t know if there’s value in seeing miniscule changes in a drift metric over time. i just thought it would be a cool thing to visualize. perhaps the metrics don’t matter until they matter and it’s better to just capture/alert on them when they matter
m
Then based on interesting points they might load the relevant profiles and run the NotebookProfileVisualizer reports from those stored profiles
That being said, you can store the reports as well, though thats less common since its pretty quick to regenerate them as needed
w
i think
calculate_drift_scores
gets me that data structure i was looking for, so i can at least test out my idea from here. thanks for some alternative ideas on how to look at the problem. we’re looking to automate all of this for any model we train and do it in a uniform way, with a giant list of assumptions that must be true for it to be repeatable across models. we’ll be testing those assumptions too
m
Great! sounds like a good scenario for whylogs
w
I think so too
🙂 1
m
Maybe you already saw this but you might setup conditions if you have a known set of assumptions about your data, in addition to looking for drift: https://github.com/whylabs/whylogs/blob/mainline/python/examples/advanced/Condition_Validators.ipynb
That's a bit more of complex to setup, as conditions require someone to define them, but we do have people using those in combination to drift detection in a scenario like you described
w
well, all you have to do is tell me that Predicates and other goodies from that
relations
package are also serializable in some form and i think we could make some use out of this
needing that entire package to be serializable is probably a naive take. after thinking on it some more, i could see a central package of well curated validators that can be referred to by name at runtime.
m
Right, we are considering building more of a "suite" definition for the conditions that could be loaded, but so far this has been an exercise left to the reader to load values and instantiate the predicates. It would be useful to have a design partner in this area if you would be willing to test out some ideas, or share your scenario a bit more with one of the engineers working on the data quality scenarios on our side?