powerful-potato-83513
12/07/2023, 6:52 AMpowerful-potato-83513
12/07/2023, 6:54 AMacoustic-painter-98305
12/07/2023, 5:11 PMcuddly-france-22384
01/12/2024, 7:37 PMwhy.init()
such that the metrics log onto my dashboard. I am using from langkit.config import check_or_prompt_for_api_keys
to enter my Whylabs keys and dataset ID. I tried why.init(session_type='whylabs')
.
When I type in my credentials I tried with and without "..."
Here's what I get:
WARNING:whylogs.api.whylabs.session.session_manager:No api key found in session or configuration, will not be able to send data to whylabs.
WARNING:whylogs.api.whylabs.session.session_manager:No org id found in session or configuration, will not be able to send data to whylabs.
cuddly-france-22384
01/12/2024, 7:38 PMcuddly-france-22384
01/12/2024, 7:49 PM### First, install whylogs with the whylabs extra
### pip install -q 'whylogs[whylabs]'
import pandas as pd
import os
import whylogs as why
os.environ["WHYLABS_API_KEY"] = "YOUR-API-KEY"
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "YOUR-ORG-ID"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-1" # Note: the 'model-id' is provided when setting-up a model in WhyLabs
# Point to your local CSV if you have your own data
df = pd.read_csv("<https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv>")
# Run whylogs on current data and upload to the WhyLabs Platform
results = why.log(df)
results.writer("whylabs").write()
I get the message
Skipping uploading profile to WhyLabs because no name was given with name=
mysterious-solstice-25388
01/13/2024, 1:36 AMsilly-cricket-55450
01/30/2024, 11:15 AMlively-apartment-74947
02/01/2024, 6:26 AMHi team, I am exploring whylogs with fugue, but I keep getting this error even when I have increase this parameter: spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 60 tasks (54.1 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 61 tasks (55.0 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 62 tasks (55.9 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 63 tasks (56.8 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 ERROR TaskSetManager: Total size of serialized results of 64 tasks (57.9 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
24/01/31 15:25:10 INFO DAGScheduler: ResultStage 1 (_collect_as_arrow at /env/lib/python3.9/site-packages/fugue_spark/_utils/convert.py:206) failed in 257.895 s due to Job aborted due to stage failure: Total size of serialized results of 56 tasks (50.5 GiB) is bigger than spark.driver.maxResultSize (50.0 GiB)
This is my code:
def profile_dataframe(transaction_id: str, featureset_id: str, messaging, spark_df) -> None:
"""
Profile the Spark DataFrame and update the profile via API.
Parameters:
- transaction_id (str): The transaction ID for the API request.
- featureset_id (str): The featureset ID for the API request.
- spark_df: The input Spark DataFrame.
- messaging: custom request service.
Raises:
- ValueError: If there is an error during profiling.
"""
try:
# Profile the Spark DataFrame using Fugue
# since our input is already a spark df, we don't need
# to specify engine=spark, it is automatically inferred
dataset_profile_view = fugue_profile(spark_df)
serialized_profile = dataset_profile_view.serialize()
# Encode the serialized profile to base64
output = base64.b64encode(serialized_profile).decode()
# Update the profile via API
update_profile_via_api(messaging, transaction_id=transaction_id, featureset_id=featureset_id,
serialized_profile=output)
except ValueError as e:
raise ValueError(f"An error occurred during profiling: {e}")
Can you please suggest a better approach if you are aware to achieve this? Specifically, it is failing for datasets like this: The first column is actually coming up as varchar (i am aware it should be int or long but that's the case with few datasets). (refer the image attached). The data is fairly small and definitely not bigger than 50 GB. It has these many records: 3210035928.thousands-match-39457
02/06/2024, 5:51 AMresults = why.log_classification_metrics(
df,
target_column = "output_discount",
prediction_column = "output_prediction",
score_column="output_score",
log_full_data=True
)
Here is the example notebook: https://github.com/whylabs/whylogs/blob/mainline/python/examples/integrations/writ[…]ion_Performance_Metrics_to_WhyLabs.ipynb?ref=content.whylabs.ailively-apartment-74947
02/08/2024, 6:58 AMcurved-coat-90458
02/13/2024, 7:32 PMsilly-cricket-55450
02/16/2024, 7:02 AMimport pandas as pd
import whylogs as why
# Simple JSON input data
json_data = [
{"deviceId": 373088, "uin": "CV620GVHEG0000007", "deviceType": "AC"},
{"deviceId": 373089, "uin": "CV620GVHEG0000008", "deviceType": "AC"},
{"deviceId": 373090, "uin": "CV620GVHEG0000009", "deviceType": "AC"}
]
# Convert JSON strings to a DataFrame
retail_daily = pd.DataFrame(json_data, columns=['json_string'])
# Log the data frame
results = why.log(pandas=retail_daily)
# Get the Results
profile = results.profile()
# Display the profile
profile.view().to_pandas()
Output received
| Column | cardinality/est | cardinality/lower_1 | cardinality/upper_1 | counts/inf | counts/n | counts/nan | counts/null | distribution/max | distribution/mean | distribution/median | ... | distribution/q_95 | distribution/q_99 | distribution/stddev | type | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor |
|-------------------|-----------------|----------------------|----------------------|------------|-----------|-------------|--------------|------------------|-------------------|---------------------|-----|-------------------|-------------------|----------------------|-----------------|----------------|------------------|----------------|--------------|---------------|--------------|
| json_string | 0.0 | 0.0 | 0.0 | 0 | 3 | 3 | 3 | NaN | 0.0 | None | ... | None | None | 0.0 | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 0 | 0 |
Expected Profile Output:
• A detailed profile reflecting the characteristics of the JSON data such as data types, cardinality, counts, and distributions.
I would appreciate any insights or guidance on whether WhyLogs supports JSON input directly for generating profiles. If not, I'd love to know any workarounds or best practices to achieve this. Additionally, if there are any mistakes in my approach or if further clarification is needed, please feel free to let me know.purple-airplane-15031
02/20/2024, 6:05 AMhappy-hamburger-71923
02/21/2024, 3:53 PMhappy-hamburger-71923
02/21/2024, 3:54 PMhappy-hamburger-71923
02/21/2024, 6:51 PMmysterious-solstice-25388
02/21/2024, 8:32 PMhappy-hamburger-71923
02/22/2024, 9:26 AMboundless-easter-47982
03/05/2024, 11:05 AMastonishing-kangaroo-33717
03/05/2024, 12:22 PMincalculable-motorcycle-82302
04/03/2024, 5:05 AMno_missing_values
over two columns (or more) simultaneously? The check should fail only with rows where both columns have missing values.incalculable-motorcycle-82302
04/08/2024, 6:52 AMincalculable-motorcycle-82302
04/09/2024, 4:14 AMacoustic-painter-98305
04/09/2024, 2:19 PMechoing-orange-31613
06/18/2024, 3:23 PMbillions-easter-40437
06/20/2024, 9:44 AMhigh-electrician-66573
06/26/2024, 7:31 AMhigh-electrician-66573
06/27/2024, 10:47 AM@classmethod
def zero(cls, config: Optional[MetricConfig] = None) -> "CustomMetric":
return CustomMetric(
n_row=IntegralComponent(0),
missing_rate=FractionalComponent(0.0),
duplicated_rate=FractionalComponent(0.0),
n_unique=IntegralComponent(0),
itype=StringComponent(""),
)
I need itype is string, do we have any component for str?high-electrician-66573
06/28/2024, 7:10 AMwhylogs.core.errors.UnsupportedError: Unsupported metric: dx_base_metric
I traced and found that _METRIC_DESERIALIZER_REGISTRY do not have the new custom metric.