Hi, We are facing an issue with intermittent failu...
# pact-rust
a
Hi, We are facing an issue with intermittent failures when running multiple contract tests. For context we are continuing with the WebSockets plugin (consumer side is all working now šŸŽ‰ ). We have some integration tests that test 4 scenarios: unexpected request, no request, request that partially matches and a matching request. All of the cases work most of the time. However occasionally we get a failure or two. The tests which fail are not consistent and the error we get is always ā€œtransport errorā€ which is coming from Tonic I believe. This error is proceeded by a few errors that look like this:
Copy code
connection error: hyper::Error(Io, Custom { kind: BrokenPipe, error: "connection closed because of a broken pipe" })
If I run the tests with
--test-threads=1
it never fails. I have a feeling that its something to do with the plugin process already running and pact using it for all concurrent tests. Then it is being shut down before one of the tests has finished but I am not sure if that is the case or not and I am having a tough time debugging it. Any help would be appreciated. Not sure if this is better in #C047TCR7B6W or #CA2S7E6KC, I guess that depends on what the problem turns out to be šŸ¤”
When I am on a Teams call my CPU is slowed and there seems to be no issue. But when my laptop runs unimpeded I get the intermittent errors.
u
When I am on a Teams call my CPU is slowed and there seems to be no issue
Huh! That's the first time I have ever heard of Teams being useful 🤣
šŸ’Æ 1
šŸ˜† 2
This smalls of a classic race condition issue. Those are going to be hard to debug. For a start, can you provide trace level logs (it will be very verbose, maybe attach as a separate file)
Can you provide the project with the tests? So I can try replicate the issue?
a
Sure, sorry I should have pre-empted that question. I just need to work out a way of getting the repo to you.
Logs are here
The Teams thing might just be a coincidence I would be surprised if ti is actually useful, even in this case šŸ˜‰
The repo is here: https://github.com/adamdama/pact-plugin-websockets I have added your GH users as collaborators to it
Feel free to fork it if needed but please do not make it public. We have some compliance conversations to have about that
u
Yes, thanks, received the invite and we promise not to do anything until you say so :-D
I haven't been able to replicate this (no surprise there, as it seems to be load dependant). I tried both on LInux and Mac OS, but they passed with no issues. Reviewing the logs, it looks like everything is working correctly until the first test finishes, then the underlying HTTP2 client is setting it's state to
Closed
and the remaining tests can't interact with the plugin process. This seems to be all on the client side, as if there is a single HTTP2 client being shared.
Now the driver is creating a new Channel per request, so we can try a shared channel, so hopefully whatever is being dropped at the end of the test happens at the end of all the tests, but I am unable to test if that helps.
This seems to be the race condition:
Copy code
2023-03-30T09:02:47.517919Z TRACE Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::frame::go_away: encoding GO_AWAY; code=NO_ERROR
2023-03-30T09:02:47.517960Z TRACE Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::codec::framed_write: encoded go_away rem=17
2023-03-30T09:02:47.517970Z TRACE tonic::transport::service::reconnect: poll_ready; idle
2023-03-30T09:02:47.518011Z TRACE tonic::transport::service::reconnect: poll_ready; connecting
2023-03-30T09:02:47.518008Z DEBUG Connection{peer=Client}:poll: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
2023-03-30T09:02:47.518035Z TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("127.0.0.1"), port=Some(Port(51856))
2023-03-30T09:02:47.518048Z TRACE Connection{peer=Client}:poll: h2::proto::connection:     -> already going away
2023-03-30T09:02:47.518096Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:51856
2023-03-30T09:02:47.518090Z TRACE Connection{peer=Client}:poll: h2::proto::connection: connection.state=Closing(NO_ERROR, Library)
2023-03-30T09:02:47.518126Z TRACE Connection{peer=Client}:poll: h2::proto::connection: connection closing after flush
It's trying to open a connection (in the other test) while the HTTP2 client is in a going away state:
h2::proto::connection:     -> already going away
If it's a bit slower, it makes the connection before that, and on faster machines, it makes the connection after that state is cleaned up.
One thing to try is do a
cargo update
to make sure you have all the latest crates.
Another option is to add a retry with backoff, that way the odds will be that the retry happens after the first test has finished finishing.
Another another option is to look at the HTTP 2 keep alive settings
a
OK so that sounds like a few options to me. Let me try to summarise and let me know if I am getting any of it wrong: • Shared channel for requests - requires changes to plugin driver • Cargo update - can be done on the WS plugin side • Retry with backoff - requires changes to plugin driver • HTTP2 Keep Alive - can be done on the WS plugin side. I’ll take a look at the Keep Alive and Cargo Update and report back as those are in my control
The lack of reproducability of this issue is very frustrating. I go from periods of 20-30 runs without issue to a swathe of consistent failures
Thanks for looking into this with me šŸ™‚
The cargo update didn’t seem to have any effect. I have enabled keep-alive on the plugin server now. Not sure if it has remedied the situation or not but I am not getting anymore test failures. Not sure if I trust it or not though as I might just be going through a ā€œsuccessā€ period šŸ˜µā€šŸ’«
I asked the whole team to test it and it worked for everyone who had not worked on the plugin. One team member had issues and after a reboot of his Mac the issue could no longer be reproduced.
u
Looks like I'm getting the same issue with our FFI based tests, but only with the protobuf plugin. The CSV plugin works fine, so it must be something one the plugin side that is causing this.
Disregard me
@Adam Cox I have managed to get a similar issue with the gRPC examples when I introduced the shared channel change. I was seeing the second test failing with a transport error (but sometimes all passed). Turns out this was due to each test getting a separate Tokio runtime, and the race condition was if the second test tried to access the plugin while the first test had just finished, the connection was associated with the first test runtime which was shutting down. If the request is made before the first test finishes, it is all good because the runtime can execute the tasks, and if it is made after the first test runtime has completed the shutdown, then it seems to also be ok because it uses the connection associated with its own runtime.
a
OK I think that makes sense to me. Still learning the whole tokio / async rust thing. So is the solution then to use
Copy code
Handle::current().spawn()
For each test or something like that? Is there a way this could impact a user when running a multiple contract tests in their unit test suite? I would think so. So is it possible for us to fix this in the plugin/driver side to prevent users from having to write tests a certain way?
u
No, nothing like that. It's more about using globally configured objects and sharing them between the tests. Each test will have to create all the things it needs.