Skip to content

Failure: MasterCI #311- RegressionTestsRelease: S3 (aws_s3, 1), S3 (gcs, 1), S3 (minio, 1) - Code 76: Cannot lock /var/lib/clickhouse/status (CANNOT_OPEN_FILE) #1175

@CarlosFelipeOR

Description

@CarlosFelipeOR

Description

All three S3 backends (AWS, GCS, MinIO) in MasterCI #311 are failing due to the same underlying ClickHouse restart error.

During steps that update storage.xml and trigger a ClickHouse restart, the server fails to start and logs:

Code: 76. DB::Exception: Cannot lock file /var/lib/clickhouse/status.
Another server instance in same directory is already running. (CANNOT_OPEN_FILE)

Because the server does not come back up, the tests waiting for config reload or health checks eventually hit ExpectTimeoutError, causing the scenarios to fail.

This issue is backend-agnostic and appears to be caused by an incomplete stop sequence leaving orphaned ClickHouse processes that still hold the status file lock.

Summary (Root Cause & Resolution)

The restart failures were not caused by orphaned ClickHouse processes as initially suspected.
The actual root cause was a regression introduced on 2025-10-17, when a new known issue (XFail) was added for >= 25.8.

This XFail caused the PID-termination retry loop to exit prematurely, leading the framework to:

  • remove /tmp/clickhouse-server.pid while the server was still running, and
  • attempt to start a new ClickHouse instance without killing the previous one.

This created a race condition that consistently produced:

Code: 76 — Cannot lock /var/lib/clickhouse/status
Another server instance in same directory is already running.

Additionally, we identified that overly broad XFail patterns (e.g., using /*) contributed to the problem.
Such patterns can unintentionally match retry iterations, preventing the retry loop from executing properly.
These broad patterns will be corrected in the S3 test suite, and a future cleanup will update all suites to use precise XFail paths to avoid similar issues.

The fix (PR #70) updates the restart logic to terminate all running ClickHouse server processes regardless of PID file state, fully resolving the issue.

A follow-up fix PR #72 was needed because PR #70 introduced a retry-loop syntax error.
This issue became visible in the /version test on ClickHouse 22.8, where the retry logic was actually triggered and failed.
PR #72 fixes the syntax and removes the obsolete XFail, ensuring the retry loop works correctly.

This was not an upstream ClickHouse problem.


CI Run

MasterCI #311

Artifacts and Reports

ClickHouse® CI Workflow Run Report

Suite Report:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions