Hey folks, I'm having a tricky issue with VPN auth...
# community-support
r
Hey folks, I'm having a tricky issue with VPN authentication breaking my gradle builds, and I'd be curious to see if anyone else has come across this. Our artifact repository is accessible via our company's Cloudflare VPN, which we also have integrated with Okta. 😢 The Problem Cloudflare does not automatically disconnect VPN sessions when the Okta/IDP session expires, leaving the developer machine in a half-in/half-out "Connected, but not Authenticated" state. Unfortunately, getting into this state isn't super obvious to developers. The menu bar icon gets a tiny yellow dot, and most internet connectivity still works just fine. Here's where it gets weird. When we inevitably and inadvertently run gradle builds in this state, we get an error like this:
Copy code
> Could not resolve com.puppycrawl.tools:checkstyle:9.3.
     Required by:
         project :<project-path>
      > Could not resolve com.puppycrawl.tools:checkstyle:9.3.
         > Could not parse POM https://<repository-host>/<repository-path>/com/puppycrawl/tools/checkstyle/9.3/checkstyle-9.3.pom
            > Already seen doctype.
What appears to be happening is that Gradle is receiving a raw HTML file instead of a POM file. It's coming from Cloudflare, before the request even reaches our repository, where checkstyle has been successfully proxied and stored. What's really problematic is that his appears to corrupt the local gradle cache, and even after re-auth-ing with CF, the only way to fix this is to either re-run the build with
--refresh-dependencies
or clear the gradle cache. For backend repos this is a minor inconvenience, for android repos it's a nonstarter. Now, we also ship a settings-plugin that needs to be resolved from our repository, so this issue can bite us first thing in the build process, so we can't even bake in checks for this in our plugins/build scripts. Lastly, while the cloudflare CLI has a bunch of great commands, it does not have a command that we can use to programmatically detect this state. 🤔 Possible Solutions init script -- push an init script to everyone's
~/.gradle/init.d
directory that checks for required credentials, and validates the network connection and access via HTTP request. Functionally, this works, but after reading the the configuration cache docs , i'm concerned it won't be invoked on every build, thus defeating the purpose. is there a gradle property/setting I need to tweak so we don't ingest/store these HTML files? is there something else I'm just not thinking of? Open to try any ideas and hear any feedback short of "don't use the VPN", because that's out of my hands. Thanks for reading!
✔️ 1
m
I have no ideas for gradle. We had a similar problem with routers in a data center. That fix was to enable tcp keepalive on the client machines with lowered start threshold of 2 minutes instead of default 7.
e
sounds like Cloudflare is returning cacheable HTTP 200's for their interception pages? if so, that really seems like their problem
and any solution that tries to check the connection status at the start of the build is vulnerable to TOCTTOU, breaking the build again if it goes down in the middle
possible to configure https://developers.cloudflare.com/cloudflare-one/policies/gateway/block-page/#redirect-to-a-block-page to redirect to something that returns an HTTP 511 or something else that Gradle will definitely see as an error?
v
Technically speaking, gradle should reject storing the entry in its caches if the artifact does not align with sha checksum. If a broken download corrupts Gradle's cache, then it sounds as a severe caching issue to me
e
checksums aren't mandatory and aren't checked by default
r
@ephemient that aligns with my further reading on the gradle docs: https://docs.gradle.org/current/userguide/dependency_verification.html It seems we can turn this on to prevent the local cache poisoning, but then every dependency bump requires an update to the metadata xml file, which could be suboptimal for the developer experience.
also, @ephemient thanks for flagging the "block page" docs. I'll take that to my sec team and see if that's something we can work with.
v
For each file that is the repository there must be a file containing the checksum of the file, typically md5 or sha1 https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=9798541#content/view/9798541
e
v
I wonder what is the repository that does not provide the checksum
Maven2 layout says the checksum is mandatory
e
in OP's case, the checksum file will fail to be correctly fetched correctly due to network conditions (same as the artifact itself)
💯 1
r
Yeah, that's spot on. What's happening is we're getting a raw HTML doc instead of the POM.xml file that's being requested, and then we end up with an error including the message
Already seen DOCTYPE
v
push an init script to everyone's
~/.gradle/init.d
directory that checks for required credentials, and validates the network connection and access via HTTP request. Functionally, this works, but after reading the the configuration cache docs , i'm concerned it won't be invoked on every build, thus defeating the purpose.
That is not an issues, because the dependency resolution result is part of the configuration cache entry. If a configuration cache entry is reused, there will be no dependency resolution done. If dependency resolution needs to be done, then a new configuration cache entry will be created.
Open to try any ideas and hear any feedback short of "don't use the VPN", because that's out of my hands.
Just a wild idea, but you asked for it. 😄 If CF really answers with a non-sense response but HTTP 200 result code, you could have a proxy on the developer machine that is used by the Gradle builds that verifies the connection or the response to not be HTTP for a requested POM and return a proper error HTTP response instead. But I actually assume that CF is properly answering with a non-200 response. But Gradle also remembers that failure-to-resolve for a while unless the retry timeout is expired or you use the manual
--refresh-dependencies
. And in that case, such a proxy would probably also not help any further.
If a broken download corrupts Gradle's cache, then it sounds as a severe caching issue to me
Afair the cache is not corrupt, it rightfully remembers that the resolution failed until the retry timeout expired or you tell it manually to retry.
checksums [...] aren't checked by default
I wonder whether this might be a bug or should be reported as feature request. It saves the effort of downloading the checksum file and calculating the checksum, but for the price of not detecting download errors or bitflips or ...
For each file that is the repository there must be a file containing the checksum of the file, typically md5 or sha1
[...]
Maven2 layout says the checksum is mandatory
While not every repository follows this "must", it is pretty useless anyway, as it just says "typically md5 or sha1", but it does not require any specific checksum, so even when adhering to the "must", no consumer could rely on the file being there, as it could be any algorithm. 😕
e
CF is probably returning 200's, e.g. https://blocked.teams.cloudflare.com/
😱 1
v
in OP’s case, the checksum file will fail to be correctly fetched correctly due to network conditions (same as the artifact itself)
If’ve filed https://github.com/gradle/gradle/issues/34158 so Gradle team could reconsider and start requiring at least one checksum for the retrieved artifacts. Frankly speaking, I can hardly imagine a case when the repository misses checksums, and if users really have those types of repositories, they could opt-out of the checksum validation.
👌 1
a
The only workaround I can think of is creating a Gradle plugin that starts a localhost server that proxies the Maven repositories and validates the responses. It won't solve the root cause, but it will at least make it visible and prevent invalid data being cached. If you're interested, here's an example of a localhost server in a Gradle plugin (although the purpose is different) https://github.com/JetBrains/intellij-platform-gradle-plugin/blob/v2.0.1/src/main/kotlin/org/jetbrains/intellij/platform/gradle/shim/Shim.kt
e
as a Gradle plugin, it'll be tricky to have it start before any resolution from remote repositories starts. but maybe doable if it's only project repositories affected and not plugin repositories
v
Maybe something like
Copy code
configurations.matching { it.isCanBeResolved }.configureEach {
    withDependencies {
        // start the proxy
    }
}
and same for
buildscript
configurations. The proxy-starting can be done as an auto-closeable shared build service then it is also shut down at the end of the build.
a
good point, maybe creating an init plugin and packing it into a custom Gradle dist would make sure it's launched first?
r
Hey everyone, thanks for chiming in. I was able to catch someone in the "stale" state before they resolved it, so I have much more concrete info. The stale response from CF is a 302 redirect, with a text/html content type and location that points to our cloudflare domain. So, I can deterministically catch it now and fail the build before dependency downloads are attempted.
@Vampire -- you're spot on about the configuration cache <> dependency downloading problem. If the configuration cache has a hit, no dependencies will be downloaded anyway. Also, if the build fails during the init script, we don't write an entry to cache. If it's successful, it seems we do write the cache entry, and then re-builds using the same inputs skip the init script and dependency downloading, so we're good there.
e
Gradle will follow 302's; does the destination of the redirect return 200?
r
I'll have to try to catch someone in the state again to be sure. I suspect no because it's an okta web login page, and gradle can't execute the login, so it either gets stuck or gets a non-200
e
if it's a login page it's probably a 200. that's why Gradle is caching that, once the user gets into the bad state
🙌 1
r
Thank you for explaining that. It seems we're going to handle the problem way upstream at the cloudflare level now, so my init script will just live on as an emergency fallback, not a live check so much anymore.
👌 1