WhyLabs R2AI Community #whylogs-support

Join Slack

powerful-potato-83513

12/07/2023, 6:52 AM

Hello

powerful-potato-83513

12/07/2023, 6:54 AM

Is it possible to use whylogs profile visualisation in Python outside of the notebook environment (say in streamlit) ?

acoustic-painter-98305

12/07/2023, 5:11 PM

Hi @powerful-potato-83513 - Yes, the profile visualizer can output html content, which could be displayed in streamlit or any other place that supports html. Would love to see an example of it if you get something working.

cuddly-france-22384

01/12/2024, 7:37 PM

Hello, newb here wanting some guidance. I opened an account on whylabs for evaluation of an LLM. I would like to log my eval results to my whylabs dashboard from some analysis that I am running on a Google colab notebook. I am having trouble using

why.init()

such that the metrics log onto my dashboard. I am using

from langkit.config import check_or_prompt_for_api_keys

to enter my Whylabs keys and dataset ID. I tried

why.init(session_type='whylabs')

. When I type in my credentials I tried with and without

"..."

Here's what I get:

Copy code

WARNING:whylogs.api.whylabs.session.session_manager:No api key found in session or configuration, will not be able to send data to whylabs.
WARNING:whylogs.api.whylabs.session.session_manager:No org id found in session or configuration, will not be able to send data to whylabs.

cuddly-france-22384

01/12/2024, 7:38 PM

Appreciate any help! Ty!

cuddly-france-22384

01/12/2024, 7:49 PM

When I follow this flow

Copy code

### First, install whylogs with the whylabs extra
### pip install -q 'whylogs[whylabs]'

import pandas as pd
import os
import whylogs as why

os.environ["WHYLABS_API_KEY"] = "YOUR-API-KEY"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "YOUR-ORG-ID"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-1" # Note: the 'model-id' is provided when setting-up a model in WhyLabs

# Point to your local CSV if you have your own data
df = pd.read_csv("<https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv>")
                
# Run whylogs on current data and upload to the WhyLabs Platform
results = why.log(df)
results.writer("whylabs").write()

I get the message

Skipping uploading profile to WhyLabs because no name was given with name=

mysterious-solstice-25388

01/13/2024, 1:36 AM

That warning message is generated from the why.log call and doesn’t reference your explicit use of the writer to upload the profile to WhyLabs. If you run the line with the: results.writer(“WhyLabs”).write() In a separate cell you should see that write succeed without a confusing message. Did you check the dashboards in your WhyLabs account to see if it already worked?

silly-cricket-55450

01/30/2024, 11:15 AM

Hi Team, I am currently exploring options for logging profile data, and I'm particularly interested in using InfluxDB for this purpose. I wanted to inquire whether Whylogs has the capability to log profile data directly into InfluxDB. If this feature is available or if there are any plans to support InfluxDB integration in the future, I would greatly appreciate any information or guidance you can provide.

lively-apartment-74947

02/01/2024, 6:26 AM

Copy code

Hi team, I am exploring whylogs with fugue, but I keep getting this error even when I have increase this parameter: spark.driver.maxResultSize (50.0 GiB)

Copy code

24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 60 tasks (54.1 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 61 tasks (55.0 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 62 tasks (55.9 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 63 tasks (56.8 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 64 tasks (57.9 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 INFO DAGScheduler: ResultStage 1 (_collect_as_arrow at /env/lib/python3.9/site-packages/fugue_spark/_utils/convert.py:206) failed in 257.895 s due to Job aborted due to stage failure: Total size of serialized results of 56 tasks (50.5 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)

This is my code:

Copy code

def profile_dataframe(transaction_id: str, featureset_id: str, messaging, spark_df) -> None:
    """
    Profile the Spark DataFrame and update the profile via API.

    Parameters:
    - transaction_id (str): The transaction ID for the API request.
    - featureset_id (str): The featureset ID for the API request.
    - spark_df: The input Spark DataFrame.
    - messaging: custom request service.

    Raises:
    - ValueError: If there is an error during profiling.
    """
    try: 
        # Profile the Spark DataFrame using Fugue
        # since our input is already a spark df, we don't need
        # to specify engine=spark, it is automatically inferred
        dataset_profile_view = fugue_profile(spark_df)

        serialized_profile = dataset_profile_view.serialize()

        # Encode the serialized profile to base64
        output = base64.b64encode(serialized_profile).decode()

        # Update the profile via API
        update_profile_via_api(messaging, transaction_id=transaction_id, featureset_id=featureset_id,
                               serialized_profile=output)

    except ValueError as e:
        raise ValueError(f"An error occurred during profiling: {e}")

Can you please suggest a better approach if you are aware to achieve this? Specifically, it is failing for datasets like this: The first column is actually coming up as varchar (i am aware it should be int or long but that's the case with few datasets). (refer the image attached). The data is fairly small and definitely not bigger than 50 GB. It has these many records: 3210035928.

thousands-match-39457

02/06/2024, 5:51 AM

Is there a way to retrieve the computed classification metrics(Accuracy, F1, AUC etc) from the results of this call using the whylogs client library:

Copy code

results = why.log_classification_metrics(
        df,
        target_column = "output_discount",
        prediction_column = "output_prediction",
        score_column="output_score",
        log_full_data=True
    )

Here is the example notebook: https://github.com/whylabs/whylogs/blob/mainline/python/examples/integrations/writ[…]ion_Performance_Metrics_to_WhyLabs.ipynb?ref=content.whylabs.ai

lively-apartment-74947

02/08/2024, 6:58 AM

is there an alternate for this data type. Please see the error in thread.

curved-coat-90458

02/13/2024, 7:32 PM

I am trying to push LLM monitoring metrics to Platform. I am facing this error even though langkit has been installed. Could you please help ? No module named : langkit.callback.handler. Lankit version is 0.0.2 and the error seems to be due to the code - whylabs = WhyLabsCallbackHandler.from_params() (edited)

👀 1

silly-cricket-55450

02/16/2024, 7:02 AM

Everyone - I'm reaching out regarding an issue I encountered while attempting to generate a profile output for a JSON payload using WhyLogs. My aim was to leverage WhyLogs to analyze JSON data, but it seems there might be a limitation or a misunderstanding on my end regarding its compatibility with JSON input. Here's a brief overview of the problem along with the code snippet I used:

Copy code

import pandas as pd
import whylogs as why

# Simple JSON input data
json_data = [
    {"deviceId": 373088, "uin": "CV620GVHEG0000007", "deviceType": "AC"},
    {"deviceId": 373089, "uin": "CV620GVHEG0000008", "deviceType": "AC"},
    {"deviceId": 373090, "uin": "CV620GVHEG0000009", "deviceType": "AC"}
]

# Convert JSON strings to a DataFrame
retail_daily = pd.DataFrame(json_data, columns=['json_string'])

# Log the data frame
results = why.log(pandas=retail_daily)

# Get the Results
profile = results.profile()

# Display the profile
profile.view().to_pandas()

Output received

Copy code

| Column            | cardinality/est | cardinality/lower_1 | cardinality/upper_1 | counts/inf | counts/n | counts/nan | counts/null | distribution/max | distribution/mean | distribution/median | ... | distribution/q_95 | distribution/q_99 | distribution/stddev | type            | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor |
|-------------------|-----------------|----------------------|----------------------|------------|-----------|-------------|--------------|------------------|-------------------|---------------------|-----|-------------------|-------------------|----------------------|-----------------|----------------|------------------|----------------|--------------|---------------|--------------|
| json_string       | 0.0             | 0.0                  | 0.0                  | 0          | 3         | 3           | 3            | NaN              | 0.0               | None                | ... | None              | None              | 0.0                  | SummaryType.COLUMN | 0               | 0                | 0              | 0            | 0             | 0             |

Expected Profile Output: • A detailed profile reflecting the characteristics of the JSON data such as data types, cardinality, counts, and distributions. I would appreciate any insights or guidance on whether WhyLogs supports JSON input directly for generating profiles. If not, I'd love to know any workarounds or best practices to achieve this. Additionally, if there are any mistakes in my approach or if further clarification is needed, please feel free to let me know.

purple-airplane-15031

02/20/2024, 6:05 AM

Hi, in the documentation it states the: “pyspark implementation is the experimental phase” 1. Is this statement still accurate? 2. Are there currently any issues with pushing whylogs into a Databricks/Pyspark environment? 3. Besides usage statistics, is any data sent outside of the local environment? (HIPAA compliance) Thanks.

happy-hamburger-71923

02/21/2024, 3:53 PM

hello, I stumbeled upon whylogs Git page and found the project really interesting! I'm trying to test it by running the following notebook to integrate it with MLflow : https://github.com/whylabs/whylogs-examples/blob/mainline/python/MLFlow%20Integration%20Example.ipynb how ever I'm getting an error at the very start : AttributeError: module 'whylogs' has no attribute 'get_or_create_session'

happy-hamburger-71923

02/21/2024, 3:54 PM

is the notebook out of date or is there any other documentation to link whylogs with mlflow

happy-hamburger-71923

02/21/2024, 6:51 PM

more over, in most of the examples whylogs is called directly in colab without initializing a connect to whylabs ... however that doesn't seem possible anymore (this colab for instance https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb#scrollTo=3XU5AY8ogr0I) can you please provide documentation on how to run whylogs directly in code ?

mysterious-solstice-25388

02/21/2024, 8:32 PM

Hello @happy-hamburger-71923 Can you try this example? https://whylogs.readthedocs.io/en/stable/examples/integrations/Mlflow_Logging.html The example repository you reference is based on older whylogs 0.7.x and earlier: https://github.com/whylabs/whylogs-examples?tab=readme-ov-file#whylogs-examples

✅ 1

happy-hamburger-71923

02/22/2024, 9:26 AM

thank you @mysterious-solstice-25388 for your reply, I was able to fully test it and I'm impressed by how easy it can be used and integrated with different frameworks

🎉 2

boundless-easter-47982

03/05/2024, 11:05 AM

👋 hello！

astonishing-kangaroo-33717

03/05/2024, 12:22 PM

Hello everyone, I wish to perform model monitoring of machine learning models in production, and get several metrics like drift, precision, recall, AUC, etc, in distributed setting. I was intrigued by profiling offerred by whylogs team which gives out of the box pyspark support. I want to understand how can I use the profiles generated to find out the drift, precision, recall, and several other monitoring metrics. Is someone already doing this by solely using the OSS version?

incalculable-motorcycle-82302

04/03/2024, 5:05 AM

Hello, Is there a way to build a constraints that can check

no_missing_values

over two columns (or more) simultaneously? The check should fail only with rows where both columns have missing values.

incalculable-motorcycle-82302

04/08/2024, 6:52 AM

Hey, A couple of questions related to segmented profiling: • If I want both the overall profile and the segmented profile, do I need to log the dataset twice? Or would it be enough to perform the segmented profile and then aggregate that to the overall profile? Any code example to how to do this? • The segmented profiling seems orders of magnitude slower than the overall profile. Is it due to the large number of segments? Is there any code recommendation to speed up a segmented profile?

incalculable-motorcycle-82302

04/09/2024, 4:14 AM

Should I partition the spark dataframe by the columns I am going to use in the segmented profiling to speed-up the process? any tips to speed-up the segmented profiling?

acoustic-painter-98305

04/09/2024, 2:19 PM

Hi @incalculable-motorcycle-82302 - when you write segmented profiles to the whylabs platform, it will contain both segmented data and the total population for use in dashboards. How many segments do you have? It may impact performance but not to this degree you are mentioning.

echoing-orange-31613

06/18/2024, 3:23 PM

Hi everyone, I'm facing slow processing times for a DataFrame loaded from BigQuery. The DataFrame has 250 million rows and 42 columns, it has 1 column with unique identifier for each customer and other columns are mostly integer columns. Currently it takes 21 mins to create profile. I'm hoping to identify the issue to improve the processing speed. The data is loaded into memory. Any suggestions would be helpful!

billions-easter-40437

06/20/2024, 9:44 AM

Hi, I need support about custom metrics in Whylogs: https://whylogs.readthedocs.io/en/latest/examples/advanced/Custom_Metrics.html and the result is in the attached image. can you explain why we have this result? also can you explain the function of columnar_update and merge?

high-electrician-66573

06/26/2024, 7:31 AM

Hi, I need support I need to remove "type" metric in the result of whylogs. how can I do that.

high-electrician-66573

06/27/2024, 10:47 AM

Hi, I need help

Copy code

@classmethod
    def zero(cls, config: Optional[MetricConfig] = None) -> "CustomMetric":
        return CustomMetric(
            n_row=IntegralComponent(0),
            missing_rate=FractionalComponent(0.0),
            duplicated_rate=FractionalComponent(0.0),
            n_unique=IntegralComponent(0),
            itype=StringComponent(""),
        )

I need itype is string, do we have any component for str?

high-electrician-66573

06/28/2024, 7:10 AM

hi I need support: I am using Whylogs to custom metric, and it work fine in whylogs. but when I used fugue_profile, it return the error:

Copy code

whylogs.core.errors.UnsupportedError: Unsupported metric: dx_base_metric

I traced and found that _METRIC_DESERIALIZER_REGISTRY do not have the new custom metric.