Puppet Community #puppet-enterprise

bastelfreak

08/16/2022, 8:40 PM

while I really hate the appliances, I think they are common for enterprise customers? so if it's a general problem I assume more people would open support tickets?

nlew

08/16/2022, 8:40 PM

This sort of issue comes up relatively frequently.

bastelfreak

08/16/2022, 8:40 PM

oho!

bastelfreak

08/16/2022, 8:41 PM

tell me more 😄

nlew

08/16/2022, 8:41 PM

Well just in terms of 👋 “some network device seems to be causing trouble” 👋

bastelfreak

08/16/2022, 8:41 PM

meh 😞

bastelfreak

08/16/2022, 8:41 PM

yeah

bastelfreak

08/16/2022, 8:42 PM

okay, will do some tcpdumping tomorrow I guess

nlew

08/16/2022, 8:42 PM

We’ve been discussing how we can make it easier to manage/diagnose, and whether there are changes to the software to make it more resilient

nlew

08/16/2022, 8:42 PM

But unidirectional packet loss is the worst

➕ 1

bastelfreak

08/16/2022, 8:42 PM

do it like consul: support providing a list of brokers to the pxp-agent. let them pick a random broker and manage failover

bastelfreak

08/16/2022, 8:43 PM

so we can eliminate loadbalancers (unless required for network separation)

nlew

08/16/2022, 8:45 PM

It does support multiple brokers now, but I think it tries them in order, which is not so helpful for load balancing.

bastelfreak

08/16/2022, 8:45 PM

yep. they are tried in order. I could randomize them. but support told me that's not recommended and loadbalancers are prefered

nlew

08/16/2022, 8:46 PM

Even without a load balancer in the middle, there’s still likely to be other picky network devices along the route. 😕

bastelfreak

08/16/2022, 8:46 PM

yes. but it would eliminate one potential error source

nlew

08/16/2022, 8:47 PM

Yeah for sure.

npwalker

08/16/2022, 8:47 PM

I thought you connected some of them directly without the LB and there were less or no errors?

bastelfreak

08/16/2022, 8:48 PM

ah well. yes. I've two locations. one is small and has their own compiler. the pxp-agents/puppet-agents in that location connect directly to the local compiler

bastelfreak

08/16/2022, 8:48 PM

the other location has the primary + 4 compiler. puppet-agent/pxp-agent there connect to the f5 an then to the compilers

csharpsteen

08/16/2022, 9:06 PM

In general, "random % of nodes don't respond to Orchestrator" is caused by network devices dropping connections that are supposed to be persistent. Same was true of MCollective.

bastelfreak

08/16/2022, 9:11 PM

I will see if I can do some debugging tomorrow

Slackbot

08/17/2022, 11:41 AM

This message was deleted.

bastelfreak

08/17/2022, 1:18 PM

coming back to the PXP debugging from last night: since 2020 the 15min idle timeout in the pcp-broker isn't 15min and it's also not hardcoded anymore. It got changed to 6min and is configureable: https://github.com/puppetlabs/pcp-broker/pull/227

bastelfreak

08/17/2022, 1:18 PM

Slackbot

08/18/2022, 3:54 AM

This message was deleted.

masterjc

08/18/2022, 3:58 AM

the status of the service itself is: Active: active (exited) and not Active: active (running) Is there a way to ensure it starts if failed, but with the oneshot type as (per my understanding) oneshot causes it to be active but exited

bastelfreak

08/19/2022, 2:19 PM

again, it's me! debugging pxp agent. A question about the following log:

Copy code

2022-08-18 14:27:41.658329 DEBUG puppetlabs.cpp_pcp_client.connector:335 - Sending heartbeat ping
2022-08-18 14:27:46.658799 WARN  puppetlabs.cpp_pcp_client.connection:670 - WebSocket onPongTimeout event
2022-08-18 14:27:48.247769 DEBUG puppetlabs.cpp_pcp_client.connection:655 - WebSocket onPong event
2022-08-18 14:29:41.658706 DEBUG puppetlabs.cpp_pcp_client.connector:335 - Sending heartbeat ping

My understanding: • pxp-agent sends every 120s a keepalive • pcp-broker responds • pxp-agent expects the response with 5s •

WebSocket onPong event

between onPongTimeout and heartbeat means that the agent receives the response, but too late? (I've a lot of random onPongTimeout logentries)

bastelfreak

08/19/2022, 2:19 PM

as a workaround I wanted to increase the 5s timeout to 10s. but that's hardcoded 😞

nlew

08/19/2022, 4:59 PM

Yep that’s right. The timeout is hardcoded but even if you increased it, I saw in the logs at least one pong that came in 59 seconds after the ping.