hi
@glamorous-hospital-27667 - These are output metrics generated from whylogs. the results are statistical data profiles for the dataset you are summarizing. Since you are only running this on one prompt and response record, you notice that you have the same value for all distribution values (0.311745822429657). If you were running this across a larger dataset, you will see different values for each of the distribution metrics (max, min, mean, stddev, etc.). You can grab any of these - say mean or max to extract your single score since they all are the same: (0.311745822429657). The
response.relevance_to_prompt
computed column will contain a similarity score between the prompt and response. The higher the score, the more relevant the response is to the prompt. The similarity score is computed by calculating the cosine similarity between embeddings generated from both prompt and response. The embeddings are generated using the hugginface's model
sentence-transformers/all-MiniLM-L6-v2
.