Hi We are facing an issue with intermittent failures when ru Pact Foundation #pact-rust

Hi, We are facing an issue with intermittent failu...

Adam Cox

03/29/2023, 8:46 AM

Hi, We are facing an issue with intermittent failures when running multiple contract tests. For context we are continuing with the WebSockets plugin (consumer side is all working now 🎉 ). We have some integration tests that test 4 scenarios: unexpected request, no request, request that partially matches and a matching request. All of the cases work most of the time. However occasionally we get a failure or two. The tests which fail are not consistent and the error we get is always “transport error” which is coming from Tonic I believe. This error is proceeded by a few errors that look like this:

Copy code

connection error: hyper::Error(Io, Custom { kind: BrokenPipe, error: "connection closed because of a broken pipe" })

If I run the tests with

--test-threads=1

it never fails. I have a feeling that its something to do with the plugin process already running and pact using it for all concurrent tests. Then it is being shut down before one of the tests has finished but I am not sure if that is the case or not and I am having a tough time debugging it. Any help would be appreciated. Not sure if this is better in #C047TCR7B6W or #CA2S7E6KC, I guess that depends on what the problem turns out to be 🤔

Adam Cox

03/29/2023, 2:18 PM

When I am on a Teams call my CPU is slowed and there seems to be no issue. But when my laptop runs unimpeded I get the intermittent errors.

uglyog

03/29/2023, 10:06 PM

When I am on a Teams call my CPU is slowed and there seems to be no issue

Huh! That's the first time I have ever heard of Teams being useful 🤣

💯 1

😆 2

uglyog

03/29/2023, 10:08 PM

This smalls of a classic race condition issue. Those are going to be hard to debug. For a start, can you provide trace level logs (it will be very verbose, maybe attach as a separate file)

uglyog

03/29/2023, 10:09 PM

Can you provide the project with the tests? So I can try replicate the issue?

Adam Cox

03/30/2023, 9:01 AM

Sure, sorry I should have pre-empted that question. I just need to work out a way of getting the repo to you.

Adam Cox

03/30/2023, 9:05 AM

Logs are here

trace-logs.log

Adam Cox

03/30/2023, 9:13 AM

The Teams thing might just be a coincidence I would be surprised if ti is actually useful, even in this case 😉

Adam Cox

03/30/2023, 2:35 PM

The repo is here: https://github.com/adamdama/pact-plugin-websockets I have added your GH users as collaborators to it

Adam Cox

03/30/2023, 2:36 PM

Feel free to fork it if needed but please do not make it public. We have some compliance conversations to have about that

uglyog

03/30/2023, 11:44 PM

Yes, thanks, received the invite and we promise not to do anything until you say so :-D

uglyog

03/31/2023, 4:52 AM

I haven't been able to replicate this (no surprise there, as it seems to be load dependant). I tried both on LInux and Mac OS, but they passed with no issues. Reviewing the logs, it looks like everything is working correctly until the first test finishes, then the underlying HTTP2 client is setting it's state to

Closed

and the remaining tests can't interact with the plugin process. This seems to be all on the client side, as if there is a single HTTP2 client being shared.

uglyog

03/31/2023, 4:55 AM

Now the driver is creating a new Channel per request, so we can try a shared channel, so hopefully whatever is being dropped at the end of the test happens at the end of all the tests, but I am unable to test if that helps.

uglyog

03/31/2023, 5:00 AM

This seems to be the race condition:

Copy code

2023-03-30T09:02:47.517919Z TRACE Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::frame::go_away: encoding GO_AWAY; code=NO_ERROR
2023-03-30T09:02:47.517960Z TRACE Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::codec::framed_write: encoded go_away rem=17
2023-03-30T09:02:47.517970Z TRACE tonic::transport::service::reconnect: poll_ready; idle
2023-03-30T09:02:47.518011Z TRACE tonic::transport::service::reconnect: poll_ready; connecting
2023-03-30T09:02:47.518008Z DEBUG Connection{peer=Client}:poll: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", NO_ERROR, Library)
2023-03-30T09:02:47.518035Z TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("127.0.0.1"), port=Some(Port(51856))
2023-03-30T09:02:47.518048Z TRACE Connection{peer=Client}:poll: h2::proto::connection:     -> already going away
2023-03-30T09:02:47.518096Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:51856
2023-03-30T09:02:47.518090Z TRACE Connection{peer=Client}:poll: h2::proto::connection: connection.state=Closing(NO_ERROR, Library)
2023-03-30T09:02:47.518126Z TRACE Connection{peer=Client}:poll: h2::proto::connection: connection closing after flush

uglyog

03/31/2023, 5:01 AM

It's trying to open a connection (in the other test) while the HTTP2 client is in a going away state:

h2::proto::connection:     -> already going away

uglyog

03/31/2023, 5:03 AM

If it's a bit slower, it makes the connection before that, and on faster machines, it makes the connection after that state is cleaned up.

uglyog

03/31/2023, 5:03 AM

One thing to try is do a

cargo update

to make sure you have all the latest crates.

uglyog

03/31/2023, 5:15 AM

Another option is to add a retry with backoff, that way the odds will be that the retry happens after the first test has finished finishing.

uglyog

03/31/2023, 5:32 AM

Another another option is to look at the HTTP 2 keep alive settings

Adam Cox

03/31/2023, 8:13 AM

OK so that sounds like a few options to me. Let me try to summarise and let me know if I am getting any of it wrong: • Shared channel for requests - requires changes to plugin driver • Cargo update - can be done on the WS plugin side • Retry with backoff - requires changes to plugin driver • HTTP2 Keep Alive - can be done on the WS plugin side. I’ll take a look at the Keep Alive and Cargo Update and report back as those are in my control

Adam Cox

03/31/2023, 8:14 AM

The lack of reproducability of this issue is very frustrating. I go from periods of 20-30 runs without issue to a swathe of consistent failures

Adam Cox

03/31/2023, 8:14 AM

Thanks for looking into this with me 🙂

Adam Cox

04/03/2023, 9:33 AM

The cargo update didn’t seem to have any effect. I have enabled keep-alive on the plugin server now. Not sure if it has remedied the situation or not but I am not getting anymore test failures. Not sure if I trust it or not though as I might just be going through a “success” period 😵‍💫

Adam Cox

04/03/2023, 2:40 PM

I asked the whole team to test it and it worked for everyone who had not worked on the plugin. One team member had issues and after a reboot of his Mac the issue could no longer be reproduced.

uglyog

04/05/2023, 1:33 AM

Looks like I'm getting the same issue with our FFI based tests, but only with the protobuf plugin. The CSV plugin works fine, so it must be something one the plugin side that is causing this.

uglyog

04/05/2023, 3:41 AM

No, mine is due to this: https://github.com/hyperium/tonic/issues/942

uglyog

04/05/2023, 3:41 AM

Disregard me

uglyog

04/05/2023, 7:22 AM

@Adam Cox I have managed to get a similar issue with the gRPC examples when I introduced the shared channel change. I was seeing the second test failing with a transport error (but sometimes all passed). Turns out this was due to each test getting a separate Tokio runtime, and the race condition was if the second test tried to access the plugin while the first test had just finished, the connection was associated with the first test runtime which was shutting down. If the request is made before the first test finishes, it is all good because the runtime can execute the tasks, and if it is made after the first test runtime has completed the shutdown, then it seems to also be ok because it uses the connection associated with its own runtime.

Adam Cox

04/05/2023, 8:18 AM

OK I think that makes sense to me. Still learning the whole tokio / async rust thing. So is the solution then to use

Copy code

Handle::current().spawn()

For each test or something like that? Is there a way this could impact a user when running a multiple contract tests in their unit test suite? I would think so. So is it possible for us to fix this in the plugin/driver side to prevent users from having to write tests a certain way?

uglyog

04/05/2023, 11:21 PM

No, nothing like that. It's more about using globally configured objects and sharing them between the tests. Each test will have to create all the things it needs.

Open in Slack

Previous Next