Right now, we're not setting any CPU limit in the pods we schedule. This means our workloads runs without constraints and can hog the hosts they're running on if there are no namespace-wide default resource limits.
Unlike with memory requests/limits, limits are enforced at the CPU scheduling level, so IIUC it's not possible to be evicted due to excessive CPU usage.
So let's start defaulting to a reasonable not too large value and ensure we set both the CPU requests and limits. This also allows cgroups-aware applications to derive the correct number of "equivalent CPUs" to use for multi-threaded/multi-processing steps.
As of 7b552e1134, we're enforcing CPU resource limits on our pods. The current defaults allocate 2 CPUs and then use them to run 8 parallel kola tests and a kola upgrade test. Reduce oversubscription by having kola parallelize to the number of available CPUs by default. There's still some oversubscription because we're running the kola upgrade test in parallel with that, but that currently only needs 1 CPU.
job.yaml: switch to minimal image; use :latest tag
Since the role of the container here is super minimal let's just use the minimal container and also follow the :latest tag so that we don't ever have to worry about being on an EOL release.
When deploying the RHCOS pipeline we ran into a case where there was a limitrange set with a default memory limit:
``` oc describe limitrange Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio ---- -------- --- --- --------------- ------------- ----------------------- Container cpu - - 300m 4 - Container memory - - 400Mi 6656Mi - ```
In this case we also needed to bump the memory for the COSA pod to 8Gi. Bumping it to 8Gi (via cosaPod()) set the request but not the limit. The pod inherited the default limit of 6656 Mi and we got an error:
``` Invalid value: "8Gi": must be less than or equal to memory limit ```
Let's just set the limit to the same value as the request here. We'll adapt later if we find we need more knobs.
This will be used by the pipeline to process some of the templated fields in `config.yaml`. Putting it here allows us to work around the fact that the `SimpleTemplateEngine` API would normally have to be allowed by an administrator first.
vars/utils: add compat symlink from /srv/fcos to /srv/coreos
Commit 3719f69 ("tree: drop FCOS wording from various places") broke multiple upstream CI jobs which still reference the old location. We need to temporarily support both locations as we migrate them.
kolaTestIso: port over some changes from kola.groovy
This mostly ports over 8804c62 and 042c0ba from kola.groovy which makes it so multiple runs can happen at the same time and also allows the caller to give a marker (some identifying info) about this run that will be stored in it's output filename. This makes it easier to find which archive to download in the Jenkins web UI.
{kola,kolaTestIso}: reduce number of try/finally blocks
This makes the finally block (the log collection) more generic and allows for it to run at the end so that we can reduce the number of scopes in our code.
{kola,kolaTestIso}: initial stab at multi-arch support
The rule here is mostly to run things through either `cosa` commands or run generic bash commands through `cosa shell --`. This also is made easier if we're running the `cosa` command within `shwrap()` from the cosa dir already, so let's use `dir(cosaDir)` to try to enable that simplification.
kola: simplify tests on non-QEMU or if user passed extraArgs
We only need to worry about basic-qemu-scenarios and running reprovision tests separately when the platform is QEMU (the default). Loosely detect if the target platform is something other than QEMU by checking to see if the user provided platformArgs and if we are running against non-QEMU then just run all the tests up front and not in separate runs.
Similarly if the user passed in `extraArgs` let's assume they want greater control and could be passing arguments that conflict with our `--tag` specifications in the invocations below. Given this assumption it would be better to just run the extraArgs in a single invocation up front.
{kola,kolaTestIso}: remove the unnecessary stage declaration
When nesting `parallel` calls like we do in the `bump-lockfile` job in our FCOS pipeline the extra stage declaration confuses the Blue Ocean view. Let's just drop it.
I'm seeing some concurrency issues when running the bump-lockfile job. It turns out when running kolaTestIso concurrently these variables are stepping on each other from the other runs. Let's scope them appropriately.
In this commit we collapse all test iso run definition into a single map (testIsoRuns) rather than two (testIsoRuns1, testIsoRuns2), while still retaining the property of only running two parallel runs at any given time.
The reason for doing this is I noticed in an s390x run it would first run `s390x:kola:metal` and then run `s390x:kola:multipath`, serially. In this case it would be more appropriate to run both of those together in a single parallel run.
{kola,kolaTestIso}: workaround issue with accessDenied from dir()
When using `dir()` in the kola and kolaTestIso jobs we are getting a permission denied error if the directory is isn't under the env.WORKSPACE directory.
Let's go back to using a `cd ${cosaDir}` for now to workaround this while we find better solutions. This `cd ${cosaDir}` will have no effect on multi-arch where we'll be operating in a remote session anyway.
I think with 380ffa5 it makes the optimization from 362e995 no longer required. In it's current form the kola-azure and kola-openstack tests in the pipeline don't have the kola tests broken out into a separate stage (they skip the upgrade test) so it's hard to find a failure in the blue ocean view when one happens.
This will help us copy over credentials to the remote node if we're inside a remote session. Having the code in a single place means we're less likely to copy/paste and make mistakes.
{kola,kolaTestIso}: remove Arch from stage run titles by default
Including the arch everywhere is a bit overboard. Where it's really needed is in our pipeline bump-lockfile job where we run tests from all architectures at once. Everywhere else it's not really welcome.
Let's change the model a bit to include the marker given by the client in the stage titles. Otherwise we won't add anything to them.
If we pass the full directory name to `tar -c`, it'll want to recreate the whole structure on extracting. Instead, use `-C` so we only pass the final directory to `tar`. That'll ensure we still create at least one directory on extraction instead of filling up the working directory. We used to do this, but I think it got lost in the recent enhancements on these steps.
withPodmanRemote: add port 22 to CONTAINER_HOST env var
There is a regression in podman [1] that causes podman remote sessions to not work unless the port is specified. Let's just add the port so we can get unblocked.
AFAIK, we've never used these functions and we've moved away from gangplank for now in favour of `podman remote`. Let's delete it. We can always restore it in the future if needed.
We started using this convention in a few places in the pipeline. Let's add a helper for it. Make it top-level so it feels as ergonomic as an `error` call.
Currently, we're putting the tarball in `/tmp`, but if we're running on the Jenkins controller, that's a semi-permanent shared location. Since our generated tarball has a predictable name, we additionally run the risk of racing with other `buildImage` calls on the controller and then passing the wrong `dir.tar` to `oc start-build`.
Fix this by putting the tarball in the workspace itself. That way, it's properly lifecycled to the job run, and there's no chance of racing with other jobs.
buildImage: drop support for `workspace` parameter
The current trend is to avoid these kinds of parameters in favour of requiring the caller to instead use `dir(...) { }` when calling us. But anyway, the only caller I know of using this function (cosa) isn't currently passing that parameter, so we can safely drop it.
kola: fix allowUpgradeFail by removing id from log check
If the kola upgrade test fails early enough that logs werene't even created then we'll get a failure like:
``` 10:50:22 + cosa shell -- test -d /home/jenkins/agent/workspace/build/tmp/kola-5JOIO/kola-upgrade 10:50:22 error: failed to execute cmd-shell: exit status 1 ```
Let's try to remove the log from consideration if that's the case.
We had it in `shwrap`, but not `shwrapCapture` and `shwrapRc`. We're hitting an issue right now where `podman` wants to create `$HOME/.ssh` when using the remote stuff but because we're running unprivileged, we get
Revert "Set `HOME` at the pod level instead of in shwrap helpers"
This reverts commit 3577892a9cac46964a5f44be8881a9ef7741df8c.
Setting the `HOME` var to the workspace from the pod definition doesn't work because it introduces a chicken-and-egg problem: the workspace isn't yet allocated since the pod isn't running yet.
This isn't a pure revert since I wanted to keep 33d4910 ("Add `umask` workaround to other shwrap helpers"), which came after.