When building software, we know that we need to be fast and we know we need to minimise any wasted time in the process. We need to get critical updates and features out to users as quickly as possible and we need to increase the productivity of teams who rely on an efficient edit and compile cycle. Introducing Remote Execution to builds will help achieve this, and for this, there is Bazel's Remote Execution API, which gives us the ability to parallelise the construction and validation phases of the software development cycle, and thus massively reduce elapsed time.
The API specifies protocols for both client and server. Assuming your client is Bazel, there are several options to choose from for the server implementation: Buildfarm, Buildbarn, BuildGrid or Google's Remote Build Execution (RBE). The first three are open source projects, and need to be set up on-prem and backed by the appropriate computing resource, whereas RBE is a service and comes with Google Cloud Platform included. But how do the different solutions compare? That's where the Remote Execution API Testing Project comes in. This blog post will introduce the project and it's goals, progress so far, what's up next, and how you can get involved.
Testing the Remote Execution API
The project is a community-driven initiative born out of discussions between folks working on Buildfarm, Buildbarn and BuildGrid, who thought it'd be a good idea to collaborate on creating an independent 'acid test' for all of the implementations. We are really interested in knowing which implementations make the best use of resources and improve execution times. To do this we aim to construct reproducible metrics of the behaviors of each of the systems we are comparing and see what their trend is as they are executed, allowing us to make judgements based on the data.
Progress so far
The initial goal to get the project off the ground was to put some basic Remote Execution testing in place. For this, we used Gitlab's CI pipelines - with Terraform, Kubernetes, and AWS - to spin up a Bazel build of Abseil that executes against each of the server implementations once a week. Upon completion, a PASS/FAIL badge is displayed. The pipeline works by building the latest release of Bazel (with Bazel), then building the latest Docker images for each server implementation and using Terraform to deploy a small Kubernetes cluster with EKS. After that, we use kubectl to deploy the individual services required for each of the server implementations and then set off the Abseil build job using Bazel and post the results to the README. This verifies that Remote Execution works with Bazel across all server implementations and we can see if anything falls over after one of the components has been updated. This has so far allowed us to catch a few bugs and raise these with the appropriate projects upstream.
Once these pipelines were up and running we focused on adding tests to record performance, capturing end to end build times. For this, we needed a bigger project than Abseil, so we chose to build Bazel itself. Whilst this was a good first step and highlighted some basic speed comparisons between each project, we wanted to dig deeper and be able to see more finer grained metrics such as CPU cost and memory usage. To do this we added a monitoring stack to the Kubernetes cluster, comprised of Prometheus, metrics servers (cAdvisor and node-exporter) and Grafana. The metrics servers collect data from the cluster at the pod and node level and provide an endpoint for Prometheus to retrieve. The data is presented using Grafana, which connects to Prometheus to query the metrics using Prometheus PromQL, and for each test run a dashboard is created which shows CPU and memory usage for each of the pods. The results are then stored as Gitlab artifacts in pdf format and also pushed to the Grafana dashboard hosting site, allowing for interaction with the data. See here for examples.
Up next
So far the project has created an overview of the compatibility status between different Remote Execution build clients and server implementations. We’d like to continue to enhance this and include the status for all projects in this space. For example, Bazel is not the only build client to work with the Remote Execution API; there are others which we've added tests for, such as BuildStream and RECC. These tests don't yet work with all of the server implementations, but we're working on adding these as well.
We'd also like to enhance performance analysis further by enabling gRPC tracing in the server implementations using open-tracing. This should produce huge amounts of data which can be analysed to identify where inefficiencies lie, which should prove invaluable in steering each server implementation towards performance improvements.
How can you get involved?
We would very much welcome any contributions that help us move closer to our goals. If you’re interested in the issues Remote Execution is trying to solve, reducing elapsed time and wasted developer time via parallelisation of builds, then we’d like to hear from you.
Are you working with another build client or server implementation that conforms to the Remote Execution API that you would like to see tested as part of this framework? Have you chosen Bazel as your build tool and now need to consider which Remote Execution solution to choose? Do you have any test projects that could be contributed?
If so, come and chat to us on Slack, tweet me @LaurenceUrhegyi or email me directly at laurence.urhegyi@codethink.com.