-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Description
All three S3 backends (AWS, GCS, MinIO) in MasterCI #311 are failing due to the same underlying ClickHouse restart error.
During steps that update storage.xml and trigger a ClickHouse restart, the server fails to start and logs:
Code: 76. DB::Exception: Cannot lock file /var/lib/clickhouse/status.
Another server instance in same directory is already running. (CANNOT_OPEN_FILE)Because the server does not come back up, the tests waiting for config reload or health checks eventually hit ExpectTimeoutError, causing the scenarios to fail.
This issue is backend-agnostic and appears to be caused by an incomplete stop sequence leaving orphaned ClickHouse processes that still hold the status file lock.
Summary (Root Cause & Resolution)
The restart failures were not caused by orphaned ClickHouse processes as initially suspected.
The actual root cause was a regression introduced on 2025-10-17, when a new known issue (XFail) was added for >= 25.8.
This XFail caused the PID-termination retry loop to exit prematurely, leading the framework to:
- remove
/tmp/clickhouse-server.pidwhile the server was still running, and - attempt to start a new ClickHouse instance without killing the previous one.
This created a race condition that consistently produced:
Code: 76 — Cannot lock /var/lib/clickhouse/status
Another server instance in same directory is already running.
Additionally, we identified that overly broad XFail patterns (e.g., using /*) contributed to the problem.
Such patterns can unintentionally match retry iterations, preventing the retry loop from executing properly.
These broad patterns will be corrected in the S3 test suite, and a future cleanup will update all suites to use precise XFail paths to avoid similar issues.
The fix (PR #70) updates the restart logic to terminate all running ClickHouse server processes regardless of PID file state, fully resolving the issue.
A follow-up fix PR #72 was needed because PR #70 introduced a retry-loop syntax error.
This issue became visible in the /version test on ClickHouse 22.8, where the retry logic was actually triggered and failed.
PR #72 fixes the syntax and removes the obsolete XFail, ensuring the retry loop works correctly.
This was not an upstream ClickHouse problem.
CI Run
Artifacts and Reports
ClickHouse® CI Workflow Run Report
Suite Report: