Skip to content

Investigate ASAN test timeouts on CNCF runners #2295

@ddelnano

Description

@ddelnano

Since migrating our build infrastructure to CNCF-managed runners (#2277), the following test targets on the ASAN build have been consistently hitting their timeouts:

  • //src/vizier/services/agent/shared/manager:heartbeat_test
  • //src/vizier/services/agent/shared/manager:registration_test
  • //src/carnot/builtins:collections_test

BuildBuddy history shows both tests running up against the original 2-minute timeout on CNCF infrastructure, even though they complete quickly on my dev machine. As a result, the timeout was temporarily increased in #2294 to unblock other work.

Image

While the avg test time isn't near the 2m threshold, I've seen timeouts to the BEP API so my anecdotal evidence is that these timeouts are happening more than BuildBuddy is reporting.

//src/vizier/services/agent/shared/manager:heartbeat_test               TIMEOUT in 120.5s
  /github/home/.cache/bazel/_bazel_root/56ec069a32c4abebc78228236a835895/execroot/px/bazel-out/k8-dbg/testlogs/src/vizier/services/agent/shared/manager/heartbeat_test/test.log
//src/vizier/services/agent/shared/manager:registration_test            TIMEOUT in 120.5s
  /github/home/.cache/bazel/_bazel_root/56ec069a32c4abebc78228236a835895/execroot/px/bazel-out/k8-dbg/testlogs/src/vizier/services/agent/shared/manager/registration_test/test.log

Executed 296 out of 296 tests: 294 tests pass and 2 fail locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
INFO: Build completed, 2 tests FAILED, 1530 total actions
INFO: Build completed, 2 tests FAILED, 1530 total actions
ERROR: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@6ce6bba6[status=PENDING, info=[delegate=[SettableFuture@29e4285e[status=PENDING]]]]
Bazel returned code 38, ignoring...

This issue tracks investigating the performance regression, implementation of the underlying fix, and reverting the temporary timeout increase once the issue is resolved.

App information (please complete the following information):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions