I am exploring solutions for the following scenari...
# caching
j
I am exploring solutions for the following scenario and looking for feedback/thoughts/ideas. I need to download and extract tools from a repository (zip or 7z). Sometimes these are huge (1gb+) and thus performance and not wasting disc space (by having the same tool extracted multiple times in different locations) is quite important. Currently, the repository may be a folder (on a shared drive). This is because we are transitioning a large legacy setup to a Gradle-based solution. Therefore, we also need to support a few ugly things for the time being. Our current solution is based on standard Dependency Management. The folder on the drive is modeled as ivy repository with custom layout. The downloading-only-once and caching of the Zip files work well for us. What gives me trouble is the extraction step. Right now we use an Artifact Transform but it is not working out for the following reasons: • (1) The transform implementation is part of the custom plugin classpath. Therefore, each time we change something else in our plugins, the transform is outdated. The transform itself has not been touched for months. Right now, as things are under heavy development, this happens daily. Initially I thought we can accept this, as at some point there will be releases that are used for longer. But given the sheer size of all the things that are extracted, even that is problematic. • There are a few requirements for legacy reasons that we cannot get rid of in the transition phase we are in right now: ◦ (2) Some tools need to be placed into a specific destination location. This is bad, and eventually should not happen anymore. But legacy... we cannot change everything at once ◦ (3) Some tools are more like large systems. When running, they modify something in their own installation directory. This is also not right, but we need to support this for some selected older tools for now. In the previous build system, users were able to define which files of a tool should be excluded when doing checksum checks. If it was not for (1), I would consider doing something hacky/nasty/imperformant to support the legacy requirements. But all this together let me to the conclusion that we should explore alternative solutions to Artifact Transforms. I am currently thinking about using a Task to do the extraction. As we basically use custom tasks for everything, it would not be too difficult to register and wire one Task per tool instead of the Transform. However, relying only on the standard UP-TO-DATE mechanism would not solve issue (1) and (2). (Although (2) could maybe be solved with some kind of filtering when defining the outputs of the task.) Solution: I am thinking to have a kind of "2nd-level" checking in the task action. Similar to what
@Incremental
tasks do: • When the task first runs, I store a hash of the output somewhere ◦ Here I may also do some filtering if required for selected tools to cater for (3) • Before I actually extract the Zip, I hash the existing output folder and check if the hash has changed. Only when it changed I clear the output and re-extract. My question: • I am hoping to reuse an existing Gradle Service for the solution I sketched – even though it is internal API. I am looking at
ChecksumService
, but that's for single files. Is there something I can use out-of-the-box to process the whole destination directory? @wolfs maybe you have a pointer for me? • Any alternative ideas?
w
I think it is not a too bad idea to have the second level cache. I am not sure you should keep the outputs in the same location necessarily for the cache, not sure how expensive that would be. If you keep them in the same location, Gradle will clean up the outputs when using
InputChanges
and the changes are non-incremental. Changing the task classpath is non-incremental… For a regular task, that won’t happen right now, so you might make it work. For actually obtaining the output hashes, I wonder if you would be able [the
OutputFileChanges
from up-to-date checks](https://github.com/gradle/gradle/blob/e3a8405909d78e6e586893e4fdcaece11c855968/pla[…]cution/history/changes/DefaultExecutionStateChangeDetector.java). Gradle has the information already, so it would be better to re-use. It might currently be impossible from a task action, though. For reading snapshots from disk,
FileSystemAccess
would be the right service, though again, it is internal and might change. If you use it to read the outputs during the task action, you might need to call
FileSystemAccess.invalidate()
to make sure Gradle doesn’t re-use possibly changed output.
e
a
The scenario sounds similar to something the IntelliJ Gradle Platform Plugin has to do. In order to build a plugin for a specific flavour of IntelliJ it needs to download and extract the distribution, which can be big. There's other parts to the problem too, but the workaround is IJPGP spins up a localhost server that presents itself to Gradle as a regular Ivy repository, but internally it downloads and extracts the IJ dist. The benefit is Gradle can treat the files as regular dependencies, and can do its own caching. The shim server only has to download and unpack the files. https://github.com/JetBrains/intellij-platform-gradle-plugin/blob/v2.0.1/src/main/kotlin/org/jetbrains/intellij/platform/gradle/shim/Shim.kt
j
Thanks! This is all very helpful. @ephemient we have tweaked these values already. But on the long run the problem remains that the same things is extracted although it already exists. And when all the tools together already take up a lot of space and then they are all extracted again (because the plugin classpath changes) even having just two copies of everything (until the cleanup jumps in) is sometimes too much. @wolfs thanks for confirming the idea and the pointers. I did not mean to do a task with
InputChanges
(but a plain one). Just mentioned it for comparison. I will experiment if I can get to the task outputs as you described. @Adam that's very interesting. Do you know some more details? If the extracted folder is treated as an ivy repo, each file needs to be presented as a separate downloadable artifact right? You can't download a folder. Does the approach create some artificial metadata that lists all the individual files that were extracted as artifacts?
a
Do you know some more details? If the extracted folder is treated as an ivy repo, each file needs to be presented as a separate downloadable artifact right? You can't download a folder.
Yeah that's right, I think it creates an IvyModule for each requested file, on demand https://github.com/JetBrains/intellij-platform-gradle-plugin/blob/6af3294df1691ee1a1d817fecac1bb6a22f4b37a/src/main/kotlin/org/jetbrains/intellij/platform/gradle/shim/PluginArtifactoryShim.kt#L65-L74
j
Looks like
FileSystemAccess
is exactly what I was looking for. We will try this approach first. Thanks again for the pointers @wolfs!
This is is working pretty well. If you are interested: Full example: https://github.com/jjohannes/gradle-demos/tree/main/tool-installation-task Task: https://github.com/jjohannes/gradle-demos/blob/main/tool-installation-task/gradle/plugins/src/main/java/org/example/ToolInstall.java I store the hash of the previous execution myself. I didn't find a way to get it from Gradle. But that's also good, because we potentially want to use installation across several projects. (As the transforms would.) So it would be different "instances" of the task in different projects reusing the same folder. Which we can do if we manage the hash and where it is stored and identified ourselves. (I know parallel running build is a problem in theory, we'll see...)
To close the loop on this one: I ended up implementing a "service" (maybe more a "utility") that is used directly from inside a task that uses certain tools to ensure that these tools are installed. No additional Transforms, no additional Tasks. This also has the advantage, that if a task is FROM-CACHE, the tool used inside the task is not installed upfront only to see later that it is never used. Here is the example implementation: • https://github.com/jjohannes/gradle-demos/tree/main/toolchain-management I ended up using
DependencyManagementServices
and
FileSystemAccess
. I created this issue to discuss if these could become available as public API in some form. I think this would be really useful for a couple of use cases: • https://github.com/gradle/gradle/issues/32225
🆒 2
💯 1
a
Thanks Jendrik, I played around with this on the weekend for manually I have some questions and thoughts: I added a custom
outputs.upToDateWhen { hashFile.readText() != currentChecksum }
check to my ToolInstallerTask, otherwise the content could change on disk and Gradle wouldn't realise and re-run the task. Maybe there's a better way? I think the internal util for generating a checksum could be re-implemented with public types (use a FileCollection to access all files and their relative paths, and Java's MessageDigest to create a checksum). I'll attach a demo below. However, the Gradle checksum utils have some reference to Gradle's VFS - is there some benefit to using the VFS?
• Use Java's MessageDigest to compute a checksum. • Use a FileCollection#visit to include the relative paths of each file in the hash, and the file content.
j
outputs.upToDateWhen { ...
Maybe your use case is different than mine, but I explicitly do NOT want the task to rerun on changes to the tool installation. The tool is an implementation detail in the task action. I do not care what state it is in to decide if the task runs or not. Only if the task needs to run, and the tool is not present (or has an invalid installation) I want to download/extract i.
...util for generating a checksum could be re-implemented with public types
Thanks for sharing. The reason I used the Gradle service: 1. I just don't want to think about it and don't want to maintain any own code in this area 2. I believe the Gradle implementation does caching. So you can repeatedly call
read()
for the same folder and the checksum won't be re-computed (that's why there is also an
invalidate()
). Concretely, one colleague in the project was concerned that with the solution, if like 100 tasks use the same tool within one build run, they would all re-compute the checksum. And since some tools are more than 1gb large this may be expensive (I don't know how expensive it really is though).
a
I explicitly do NOT want the task to rerun on changes to the tool installation
What about if the tool installation dir is deleted? Without checking the install dir, Gradle is only aware of the checksum file. If the checksum file doesn't change, will the deleted tool be re-installed?
j
a
I think that check won't get called if the task is cached & up-to-date though?
j
Yes. But then I don't care. Because when the task is not executing, I don't need the tool.