https://www.puppet.com/community logo
Join Slack
Powered by
# puppet
  • b

    bastelfreak

    10/08/2024, 8:02 PM
    is there a chance that you've a thundering herd problem?
  • e

    Elliott

    10/08/2024, 8:02 PM
    like a backoff induced flood?
  • e

    Elliott

    10/08/2024, 8:02 PM
    no im directly setting --server based on time of day so its an even distribution
  • b

    bastelfreak

    10/08/2024, 8:03 PM
    there's a chance that at a specific time frame more agents hit the server than usual
  • b

    bastelfreak

    10/08/2024, 8:03 PM
    okay
  • b

    bastelfreak

    10/08/2024, 8:03 PM
    do you run https://github.com/puppetlabs/puppet_operational_dashboards to get some metrics?
  • e

    Elliott

    10/08/2024, 8:03 PM
    yep
  • b

    bastelfreak

    10/08/2024, 8:04 PM
    how does requested/available jruby instances look like?
  • e

    Elliott

    10/08/2024, 8:04 PM
    here's a recent one
  • e

    Elliott

    10/08/2024, 8:05 PM
    it seems catalog compilation is really laggy at first
  • b

    bastelfreak

    10/08/2024, 8:05 PM
    that really looks like a thundering herd problem 😄
  • e

    Elliott

    10/08/2024, 8:07 PM
    yeah the problem is if this happens without a server restart its fine
  • b

    bastelfreak

    10/08/2024, 8:07 PM
    you should distribute your agent requests more
  • e

    Elliott

    10/08/2024, 8:07 PM
    this is what it looks like if the server didnt do a restart
  • b

    bastelfreak

    10/08/2024, 8:08 PM
    are all your agents coming in this 20min window?
  • b

    bastelfreak

    10/08/2024, 8:12 PM
    also you said 'moving load back'. is that a load balanced setup? when do you move the agents back? I guess too early
  • e

    Elliott

    10/08/2024, 8:18 PM
    i have 2 servers that i migrate between every 30 minutes, at :15 and :45. primary server runs a script that can reload crl/restart puppetserver at :00 and secondary does it at :30, this way there is a 15 minute window before and after the potential server restart to give agents enough time to finish their runs.
  • e

    Elliott

    10/08/2024, 8:18 PM
    agents are hard coded to each server based on this time frame using a script
  • e

    Elliott

    10/08/2024, 8:18 PM
    agents run hourly according to fqdn_mod_by distribution
  • e

    Elliott

    10/08/2024, 8:19 PM
    and within each minute of that hour they're distributed by the second according to fqdn_mod_by
  • e

    Elliott

    10/08/2024, 8:19 PM
    its a very good distribution (i looked at one point)
  • e

    Elliott

    10/08/2024, 8:19 PM
    so all agents check in within 1 hour
  • e

    Elliott

    10/08/2024, 8:19 PM
    14000 agents
  • b

    bastelfreak

    10/08/2024, 8:19 PM
    acording to your graph it's quite bad 😄
  • e

    Elliott

    10/08/2024, 8:19 PM
    200 of them are on a 10 minute run interval though (canary env)
  • b

    bastelfreak

    10/08/2024, 8:19 PM
    lol
  • b

    bastelfreak

    10/08/2024, 8:20 PM
    you do over 28k catalog compilations with two servers, each 30 jruby instances?
  • e

    Elliott

    10/08/2024, 8:20 PM
    that bad graph is what it looks like when the puppetserver container has restarted... the 2nd graph is the same flip but when cache is intact
  • e

    Elliott

    10/08/2024, 8:20 PM
    14k compilations per hour, 30 jrubies yes
  • b

    bastelfreak

    10/08/2024, 8:24 PM
    so on the second graph it looks like your agents do the first requests at ~8:15 and finish at 8:45, but nothing before/after
1...416417418...428Latest