promptfoo #❓｜questions

Sending multiple responses to LLM judge

storied

10/20/2025, 8:26 PM

Is there a way to take multiple LLM responses and send them to another LLM which can then do some action using all of the responses? For example, if I run a prompt through three LLMs so I now have 3 responses. Can I send those 3 responses to another LLM and ask it to combine the three or some other action using all three. I see there is a "select best" metric but I don't want to choose one, I want to combine 3 responses in some fashion. Thank you for your help.

run `promptfoo eval` with NO models to register and evaluate manually?

pelopo

10/20/2025, 9:50 PM

Hi. I would like to only register my different prompts with customs labels for my combination of agents/model so that I manually can annotate them and assign the pass or no pass status. Without the actual run against any models API. I only want to make sure that the prompt is in the database and gives me the option to evaluate/mark/rate them, but the actual evaluation is done somewhere else, like in** claude code** or in codex. I launched

promptfoo eval

And immediately cancelled it. get the prompt in the database but there are no options to manually annotate them or rank them. Basically I need the evaluation metrics even if the problem didn't run - #evals https://cdn.discordapp.com/attachments/1429949929895493723/1429949931329683586/image.png?ex=68f80026&is=68f6aea6&hm=baba382cc34013385292eccf32ab9408222c2a47ff0715eba7e58f97ba05d31d& https://cdn.discordapp.com/attachments/1429949929895493723/1429949932080599221/image.png?ex=68f80026&is=68f6aea6&hm=ed959dc3f0dc443702e9aac5887b6df4b4eaccfbd2947aee4ff61e69fa5edff6&

Question regarding data retention and privacy for api.promptfoo.app

gonkm

10/21/2025, 4:57 AM

Hello promptfoo team, We are considering using promptfoo's remote Grading and Attack generation features. When using these features, we understand that a POST request is sent to the https://api.promptfoo.app/api/v1/task endpoint. The request body of this call may contain sensitive and confidential information from our product under development, such as prompts, context, and test cases. From a security and confidentiality standpoint, we need to clarify your data handling policies. Could you please let us know: Is the content of the request body (prompts, variables, etc.) sent to this endpoint persistently stored on promptfoo's servers (e.g., in databases or log files)? If it is stored, for how long is this data retained? Is our understanding correct that the data is deleted immediately after the task (grading/attack generation) is executed? We need to ensure our sensitive data is handled appropriately. We would greatly appreciate it if you could provide details on your data retention policy. Thank you.

Error: "self-signed certificate in certificate chain" with tls config

dulax

10/21/2025, 9:50 PM

Hi, when I setup TLS in my config, even with

rejectUnauthorized: false

I keep getting the above error. Is there anything more I should be doing?

Copy code

- id: https
  config:
    url: https://<myhost>:8080/chat
    method: POST
    tls:
      certPath: 'client.cert'
      keyPath: 'client.key'
      caPAth: 'ca.cert'
      rejectUnauthorized: false

Error

Copy code

Request to https://myhost:8080/v1/run failed, retrying: TypeError: fetch failed (Cause: Error: self-signed certificate in certificate chain)

Prompt name in Prompt tab table

pelopo

10/22/2025, 11:04 AM

How to add "prompt name" or some label to the table of prompts in the prompt tab in the UI. By default I only see the ID (some UUID) and prompt text, but no way to see which prompt is it, or which variation. I tried to add labels to the prompt in "promptfooconfig" but it didn't help. Thanks https://cdn.discordapp.com/attachments/1430512074143830086/1430512074370580490/prompt_tab.jpg?ex=68fa0bb0&is=68f8ba30&hm=c07c53b81e984a3f9bcc2afb25603056ea71a2d96b0bc11c275f3e33cc265feb&

Add SharePoint Dataset Support

Teti

10/22/2025, 1:30 PM

Hey everyone! We're thinking about contributing a new feature to Promptfoo — adding support for pulling datasets directly from SharePoint. In our setup, SharePoint is the single source of truth for evaluation data. It’d be super helpful if Promptfoo could read datasets from a SharePoint file URL (CSV or Excel), similar to how it currently works with Google Sheets. The idea is to: Let users reference a SharePoint dataset link in the tests: field of their config. Support private file access via Microsoft Graph API with authentication. Before diving in, I just wanted to check with the maintainers/community: Would you be open to a PR adding this feature?

Automatic trace injection to promptfoo from coding assistants and not API?

pelopo

10/23/2025, 1:48 PM

Aye, I was wondering if there is any automagical way to send to promptfoo traces or conversations that I have with the likes of Codex, Claude Code, Amp or some other coding assistants that use subscription model instead of and API. Something that would listen real time and make it go to promptfoo for me to evaluate on each turn basis? Maybe some third party tool? Thanks

Evaled or not tick in the results table.

pelopo

10/23/2025, 2:15 PM

In the Results table in the UI, I don't see any obvious way to tell which runs were evaluated and which are not. I have to click each line to see whether I gave it a thumbs up, thumbs down, or left a comment, etc. It would be nice to have a column with a tick if any evaluation criteria were used, making it easy to see what needs work and what doesn't. For example, in the attached screenshot there are 4 runs and I have touched all of them, but only the one with the red percentage clearly shows I evaluated it negatively. The other 3 with 100% in green were also evaluated, but it's unclear whether they were, because I left comments and ratings. https://cdn.discordapp.com/attachments/1430922506558247012/1430922507095113989/image.png?ex=68fb89ee&is=68fa386e&hm=cdd74783b89bd32fbf2f26a723402fd311f7a870ccd5dd2a33bba39a87c858ce&

How to test dynamic multi-turn conversations in Promptfoo?

Đức Duy

10/23/2025, 3:28 PM

Hi everyone 👋 I’m testing a medical chatbot agent that starts from a symptom (e.g. “I have stomach pain”), then asks several related questions, and finally recommends a suitable clinic. The problem is that each test run may have different question wording or order, so I can’t predefine all user inputs in advance. I’d like to dynamically provide user replies based on the agent’s last question — for example, if the agent asks about pain_location, I return the predefined answer for that property. Is there any recommended way in Promptfoo to handle this kind of dynamic multi-turn input-output flow? Thanks!

Environment variable substitution only working some places

crizm

10/23/2025, 11:57 PM

I'm trying to use environment variables defined in a .env file to specify a default provider for llm-rubric: defaultTest: options: provider: id: "azure:chat:{{ env.MY_DEPLOYMENT}}" config: apiVersion: "{{ env.API_VERSION }}" apiHost: "{{ env.AZURE_ENDPOINT }}" For some reason, only env.MY_DEPLOYMENT gets replaced. "{{ env.AZURE_ENDPOINT }}" does not (nor does API_VERSION, and there doesn't appear to be a way to affect that through preset environment variables) and results in a "Failed to sanitize URL" error. Any idea what's wrong here?

OpenRouter - API error: 401 // message: No auth credentials found

haveles

10/24/2025, 2:39 PM

Hi all, I'm encountering persistent 401 Unauthorized errors when trying to use OpenRouter providers in my self evaluation and model comparison configs, despite having a working API key and successful direct API calls. Error Details: [ERROR] API error: 401 Unauthorized {"error":{"message":"No auth credentials found","code":401}} What's Working: OpenRouter API key works perfectly with direct curl calls Successfully configured and ran deterministic A/B testing for 3 LLMs using OpenRouter Environment variable OPENROUTER_API_KEY is properly set Current Configuration (that works for A/B testing): providers: - id: openrouter:anthropic/claude-3.5-sonnet config: temperature: 0.0 max_tokens: 2000 apiKey: ${OPENROUTER_API_KEY} What's Failing: Self-grading config with identical provider setup Model comparison config with identical provider setup All attempts result in 401 errors Attempted Fixes: Variable syntax variations: ${OPENROUTER_API_KEY}, "{{ env.OPENROUTER_API_KEY }}" Provider ID variations: different model names and versions Configuration approaches: Direct OpenRouter, OpenAI with custom base URL, Anthropic with custom base URL Environment handling: shell variables, --var flag, --env-file flag Removed llm-rubric assertions in attempt to fix authentication issues System Info: Promptfoo version: 0.118.17 OS: macOS Any insights on what might be causing this inconsistent behavior would be greatly appreciated!

Clarification regarding Red Team configuration

tanktg

10/28/2025, 12:10 PM

Hi all, I am working for a cybersecurity service provider and we would like to use Promptfoo to test LLM applications of our customers. Data privacy is of major importance to us, and we therefore don't want to send any data or requests of any sort to PromptFoo's cloud services. In practice, this means that adversarial input generation, response evaluation and grading of attacks should all happen in our systems, and that all telemetry should be disabled. Looking at the documentation (https://www.promptfoo.dev/docs/red-team/configuration/#how-attacks-are-generated), we have several questions regarding the correct configuration to use: Will setting the PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION env var to true prevent adversarial input generation requests to be sent to promptFoo's API, while allowing us to use our own remote LLM deployed in our cloud environment? Or should we specify our own attacker model provider in the config file, while leaving PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION to its default value, false? Additionally, I understand that it is possible to override the default grader by specifying a custom one in the config file: https://www.promptfoo.dev/docs/red-team/troubleshooting/grading-results/#overriding-the-grader. Will making those two configuration changes (specifying a custom attacker model provider, and a custom grader) be enough to ensure that no data (including telemetry of usage data) is ever sent to promptFoo's services? If not, what additional configuration is needed to achieve this? Thanks

How to hook context in YAML?

Alex1990

10/28/2025, 3:45 PM

Hi, everyone. I spent around 3-4 hours to understand how dynamic context works, but whatever I did, every time I got an error. I connected to my own RAG using custom call_api

Copy code

def call_api(prompt, options=None, context=None):
.................. some logic......
    data = response.json()
    contexts = [source.get('content', '')
                for source in data.get('sources', [])]

    return {
        "output": data.get('content', ''),
        "context": context_text
    }

and part of YAML for this metric

Copy code

assert:
      - type: context-relevance
        contextTransform: context
        value: ''

But when I tried to catch this context field from the RAG response, I got an error below Whatever I did, I tried to use a string or array, just context or output.context, every time I had an error

Copy code

Error: Failed to transform context using expression 'context': Invariant failed: contextTransform must return a string or array of strings. Got object. Check your transform expression: context

Error: Failed to transform context using expression 'context': Invariant failed: contextTransform must return a string or array of strings. Got object. Check your transform expression: context
    at resolveContext (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/contextUtils.js:60:19)
    at async handleContextRelevance (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/contextRelevance.js:23:21)
    at async runAssertion (/Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/index.js:353:24)
    at async /Users/aleksandrmeskov/.npm/_npx/81bbc6515d992ace/node_modules/promptfoo/dist/src/assertions/index.js:400:24

In documentation, it looks pretty simple, but look like it doesn't work correctly https://www.promptfoo.dev/docs/configuration/expected-outputs/model-graded/context-relevance/ Any suggestions, how I can handle that? https://cdn.discordapp.com/attachments/1432757147405651968/1432757147648786584/image.png?ex=69023693&is=6900e513&hm=e3561b5fac664cff41e9131fc0c4327ce0fa1634c74a9240f06dff1d91c6ffb1&

_conversation / previous messages for Simulated User and Assistant

Elias_M2M

10/29/2025, 9:38 AM

Hello, I would like to test a multi-turn conversation between an assistant and a simulated user. The prescribed conversation flow of the assistant is very long and for my current test cases I just need to test the end of the conversation. For these tests, the previous messages are very important, so the simulated user and the assistant need to know what "they" said before. I saw in the docs, that there is an option of adding a variable "messages" or "_conversation", but I don't know how this is behaving with the simulated user provider. Is it possible the define the previous messages for both the assistant and the simulated user, so they know where to continue the conversation? And how can I do this?

prompts generation only

b00l_

10/29/2025, 2:36 PM

hello, I have a redteam.yaml file with a bunch of plugins enabled, is it possible to just generate prompts, and save them in a file based on all plugins enabled? can I do it local only and even with openai key?

How to add dynamic prompt with multiple placeholders inside promptfooconfig?

curious_battle

11/04/2025, 4:50 AM

My prompt looks like {"role":"system", "content": < {company}, {company_description}, ... , {previous_context},} then again repeated at user level with a few placeholders How can I use this reliably inside promptfooconfig with variables separately from another file such that the prompt gets build up completely and then we can test against user_input, currently the prompt part allows prompt with placeholders but no support for passing placeholders variables values, and input csv only allows a single column input ??

Retrying tests

thomas_romas

11/04/2025, 1:04 PM

I am running a basic redteaming evaluation with some plugins enabled. Nothing crazy, I am just trying to see how it works. I pointed promptfoo at my Azure OpenAI model. The evaluation doesn't finish for multiple hours and is stuck at the following output

Copy code

...
Chunk 72 partial success: 24/8 test cases, retrying missing ones individually
...

The chunk number increases. I am not sure what it means. Is there anything I can do to omit the retrying or at least see partial results of the evaluation?

Any advice for really long-running models like GPT-5-pro?

CasetextJake

11/04/2025, 9:30 PM

I'd like to run some evals with GPT-5-pro, and I'd say usually 50% of them error out. I get a variety of errors: API call error: Error: Request failed after 4 retries: TypeError: fetch failed (Cause: Error: getaddrinfo ENOTFOUND api.openai.com) API call error: Error: Request failed after 4 retries: Error: Request timed out after 300000 ms API call error: Error: Error parsing response from https://api.openai.com/v1/responses: Unexpected token '<', " <h"... is not valid JSON. Received text: 502 Bad Gateway 502 Bad Gateway cloudflare API call error: Error: Error parsing response from https://api.openai.com/v1/responses: Unexpected token 'u', "upstream c"... is not valid JSON. Received text: upstream connect error or disconnect/reset before headers. reset reason: connection termination Presumably I can resolve one of these by increasing the amount of time per completion, but the other ones... Curious if there are tips for working with models like these. Thanks!

OSS version limits

Man

11/06/2025, 8:07 PM

What are the adversarial prompt generation limits in the open source version?

Mitigation

Jan!

11/09/2025, 10:32 AM

Is there a way to enable the mitigation option on the open source version? I'd be very happy to know how to fix the issues my application has haha.

Does anyone else has a Python provider problem

Monini

11/13/2025, 9:00 AM

I'm using promptfoo version: 0.119.6. In my yaml I have configured provider like that: providers: - id: 'file://retrieve_answer.py' I get an error:

Copy code

[logger.js:324] Python worker stderr: ERROR handling call: [Errno 2] No such file or directory: 'C'
Traceback (most recent call last):
  File "C:\Users\x\AppData\Roaming\npm\node_modules\promptfoo\dist\src\python\persistent_wrapper.py", line 191, in handle_call
    with open(request_file, "r", encoding="utf-8") as f:
         ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C'


[logger.js:324] Python worker stderr: ERROR: Failed to write error response: [Errno 22] Invalid argument: '\\Users\\x\\AppData\\Local\\Temp\\promptfoo-worker-req-1763023104223-7f0714e90c0d6.json:C:\\Users\\x\\AppData\\Local\\Temp\\promptfoo-worker-resp-1763023104223-8d908bcfbd3be.json'                                                                                                                                                              

[logger.js:324] Shutdown complete
[logger.js:324] Failed to read response file after 16 attempts (~18s). Expected: promptfoo-worker-resp-1763023104223-8d908bcfbd3be.json, Found in C:\Users\x\AppData\Local\Temp: promptfoo-worker-req-1762959388946-f0ac9b4f1dd7e.json, promptfoo-worker-req-1762959433064-8d405bf3e7d72.json, promptfoo-worker-req-1763023104223-7f0714e90c0d6.json

I didn't have this problem some time ago, so I think it started after the update to Promptfoo.

Multiple prompts with each mapped to separate set of images & test cases/assertions

nkhatwani

11/13/2025, 10:56 AM

Can someone please look into https://github.com/promptfoo/promptfoo/issues/6206 and reply accordingly.

Is it possible to set `response_format` per test, rather than per prompt?

m0ltz

11/16/2025, 5:19 PM

I have a single prompt that I want to use, but the JSON Schema for the response is different for each test. I see from the docs that it's possible to set the schema on the provider, or a prompt, but not for a test. Are there any known hacks of workarounds to make that work?

Can we integrate and fetch langfuse datasets directly from promptfoo config file like prompts?

curious_battle

11/17/2025, 4:45 AM

For example, The way We do langfuse://{prompt_name}

Streamable HTTP MCP Server Testing

Anupam Patil

11/19/2025, 12:44 PM

Hi Team, @User I am very new to promptfoo. I need to test couple of MCP Servers. I have one login end point and need to use the token from that API for further requests. I have search and fetch tools which would require that token as an authorization. How to achieve it using promptfoo?

Cost Metric - Bedrock

ellebarto

11/19/2025, 3:38 PM

Hi - is there any plans to have the cost metric work for other providers? AWS Bedrock?

Custom provider context for model graded asserts

dracesw

11/19/2025, 5:42 PM

tl;dr: providers don't seem to be sent test context when used for asserts Hi, I'm trying to use promptfoo to evaluate some agentic workflows. I have a custom python provider that does some environment setup. I need to be able to pass information from the provider completing the prompt to the provider when it is evaluating an llm-rubric assert that doesn't belong in the prompt response. The context seems to always be empty when the provider is used for asserts. Is this working as intended and if so, is there an intended way to pass this information to the assert provider?

How to Show Markdown Instead of JSON + How to Expose OpenAI Response IDs?

IdoRozin

11/19/2025, 7:50 PM

Hey all — two related questions: 1) Prompt display in Promptfoo When using messages: in promptfoo.yaml, the Promptfoo results page shows the full prompt as an ugly JSON array, like: [ { "role": "system", "content": ".... long markdown ...." }, { "role": "user", "content": "...." } ] Is there a way to make Promptfoo show the actual Markdown inside the content fields, instead of the raw JSON structure? Ideally I'd like to see the formatted prompt (headings, lists, etc.) the same way a user would see it — not the full message object. 2) Getting OpenAI Response IDs in Promptfoo Is there a way to extract the OpenAI response id from each run so that I can click/open that response inside the OpenAI API logs? I don’t see the response ID in the result JSON, even when using the OpenAI provider with logprobs or raw: true. Is there a config option or hook for surfacing the model’s id (e.g., resp.id like chatcmpl-abc123) in the Promptfoo results?

Bedrock Provider

ellebarto

11/20/2025, 3:01 PM

I am reaching out to check if the Bedrock provider response includes the input/output token count of all the models on bedrock.

Set reasoning effort for open router models

CYH

11/24/2025, 7:41 PM

Does the open router config support setting reasoning effort? something like this

Copy code

config:
  reasoning:
    effort: minimal