Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 21, 2025

Fix #18234

This includes quite many changes, but on a high level:

  • Accessing to server_context_impl class member is now highly restricted. Only some pointers like vocab, model, mctx are exposed.
  • Any static data (i.e. model name, context size, etc) must now be rendered into server_context_meta. This is to prevent any accesses to non-thread-safe data inside server_context_impl
  • From server_routes, the HTTP can only access some pointers like vocab, model, mctx. Any other data MUST be passed through server_context_meta

As a consequence:

  • /models and /v1/models can no longer be accessed during model loading. It is NOT thread-safe and can potentially cause data race
  • however, /models and /v1/models can now be accessed during server sleeping. This is because it no longer accesses server_context_impl directly

Also include some other fixes described in #18263 (comment) to make things safer

cc @ServeurpersoCom would appreciate if you can do some testing, thanks!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

Stress testing to reproduce the race condition with and without this PR

(root|~) cat race.sh
#!/bin/bash
# Stress test pour détecter la data race #18234

set -o pipefail

readonly BASE_URL="https://www.serveurperso.com/ia/webui"
readonly MODEL_A="MoE-Qwen3-30B-A3B-Instruct-2507"
readonly MODEL_B="MoE-Qwen3-30B-A3B-Thinking-2507"
readonly ITERATIONS=100
readonly PARALLEL_REQUESTS=20

# Couleurs pour le logging old-school
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log() { echo -e "${GREEN}[$(date '+%H:%M:%S.%3N')]${NC} $*"; }
err() { echo -e "${RED}[$(date '+%H:%M:%S.%3N')] ERROR:${NC} $*" >&2; }
warn() { echo -e "${YELLOW}[$(date '+%H:%M:%S.%3N')] WARN:${NC} $*"; }

# Payload de test lourd pour forcer le cache_prompt
generate_payload() {
    local model="$1"
    python3 -c "print('Test '*500)"
}

# Requête /v1/models - cible principale de la race
hammer_models_endpoint() {
    local req_id=$1
    local start=$(date +%s%N)

    local response=$(curl -s -w "\n%{http_code}" \
        "${BASE_URL}/v1/models" \
        -H "Content-Type: application/json" \
        --connect-timeout 2 \
        --max-time 5 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local body=$(echo "$response" | head -n -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        err "REQ-${req_id}: /v1/models failed: HTTP $http_code (${duration}ms)"
        echo "$body" | head -5
        return 1
    fi

    # Vérifier la cohérence JSON
    if ! echo "$body" | jq -e '.data | length' >/dev/null 2>&1; then
        err "REQ-${req_id}: Invalid JSON response"
        return 1
    fi

    log "REQ-${req_id}: /v1/models OK (${duration}ms)"
}

# Requête completion pour forcer les transitions d'état
hammer_completion() {
    local req_id=$1
    local model=$2
    local payload=$(generate_payload)

    local start=$(date +%s%N)
    local response=$(curl -s -w "\n%{http_code}" -N \
        "${BASE_URL}/v1/chat/completions" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "'"$model"'",
            "messages": [{"role": "user", "content": "'"$payload"'"}],
            "stream": true,
            "max_tokens": 10,
            "cache_prompt": false
        }' \
        --connect-timeout 2 \
        --max-time 10 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        err "REQ-${req_id}: Completion failed: HTTP $http_code (${duration}ms, model: $model)"
        return 1
    fi

    log "REQ-${req_id}: Completion OK (${duration}ms, model: $model)"
}

# Attaque mixte: models + completions en parallèle
parallel_assault() {
    local wave=$1
    local pids=()
    local failures=0

    warn "WAVE $wave: Launching $PARALLEL_REQUESTS parallel requests..."

    # Lancer en parallèle
    for i in $(seq 1 $PARALLEL_REQUESTS); do
        local req_id="${wave}-${i}"

        # Alterner entre models et completions
        if (( i % 3 == 0 )); then
            hammer_models_endpoint "$req_id" &
        elif (( i % 2 == 0 )); then
            hammer_completion "$req_id" "$MODEL_A" &
        else
            hammer_completion "$req_id" "$MODEL_B" &
        fi

        pids+=($!)
    done

    # Attendre et compter les échecs
    for pid in "${pids[@]}"; do
        if ! wait "$pid"; then
            ((failures++))
        fi
    done

    if (( failures > 0 )); then
        err "WAVE $wave: $failures/$PARALLEL_REQUESTS requests failed"
        return 1
    else
        log "WAVE $wave: ALL $PARALLEL_REQUESTS requests succeeded"
        return 0
    fi
}

# Test de race pendant model swap
race_during_swap() {
    warn "Testing race during model swap..."

    # Trigger swap vers MODEL_B
    hammer_completion "SWAP-1" "$MODEL_B" &
    local swap_pid=$!

    # Bombarder /v1/models pendant le swap
    sleep 0.1
    for i in {1..10}; do
        hammer_models_endpoint "SWAP-${i}" &
    done

    wait
}

# Main stress test
main() {
    log "=== llama.cpp Data Race Hunter ==="
    log "Target: $BASE_URL"
    log "Models: $MODEL_A, $MODEL_B"
    log "Parallel requests: $PARALLEL_REQUESTS"
    log "Iterations: $ITERATIONS"
    echo

    # Vérifier que le serveur répond
    if ! curl -s --connect-timeout 2 "${BASE_URL}/v1/models" >/dev/null; then
        err "Server unreachable at $BASE_URL"
        exit 1
    fi

    local total_failures=0
    local start_time=$(date +%s)

    # Phase 1: Assault par vagues
    for wave in $(seq 1 $ITERATIONS); do
        if ! parallel_assault "$wave"; then
            ((total_failures++))
        fi

        # Petit délai pour observer les transitions
        sleep 0.05
    done

    # Phase 2: Race pendant swap
    warn "=== Phase 2: Model swap race test ==="
    race_during_swap

    # Rapport final
    local duration=$(($(date +%s) - start_time))
    echo
    log "=== Test completed in ${duration}s ==="

    if (( total_failures > 0 )); then
        err "FAILED: $total_failures/$ITERATIONS waves had failures"
        err "Data race likely present - check server logs"
        exit 1
    else
        log "SUCCESS: All $ITERATIONS waves passed"
        log "No obvious race detected (but check server logs for assertions/crashes)"
        exit 0
    fi
}

main "$@"
(root|~)

No PR-merged, results :

(root|~) ./race.sh
[21:30:46.093] === llama.cpp Data Race Hunter ===
[21:30:46.093] Target: https://www.serveurperso.com/ia/webui
[21:30:46.094] Models: MoE-Qwen3-30B-A3B-Instruct-2507, MoE-Qwen3-30B-A3B-Thinking-2507
[21:30:46.095] Parallel requests: 20
[21:30:46.095] Iterations: 100

[21:30:46.318] WARN: WAVE 1: Launching 20 parallel requests...
[21:30:46.551] ERROR: REQ-1-3: /v1/models failed: HTTP 503 (231ms)
[21:30:46.551] ERROR: REQ-1-9: /v1/models failed: HTTP 503 (231ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.559] ERROR: REQ-1-15: /v1/models failed: HTTP 503 (238ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.560] ERROR: REQ-1-18: /v1/models failed: HTTP 503 (239ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.563] ERROR: REQ-1-5: Completion failed: HTTP 503 (238ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.567] ERROR: REQ-1-6: /v1/models failed: HTTP 503 (246ms)
[21:30:46.567] ERROR: REQ-1-12: /v1/models failed: HTTP 503 (246ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.568] ERROR: REQ-1-4: Completion failed: HTTP 503 (243ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.723] ERROR: REQ-1-16: Completion failed: HTTP 503 (398ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.771] ERROR: REQ-1-11: Completion failed: HTTP 503 (445ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.775] ERROR: REQ-1-10: Completion failed: HTTP 503 (449ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.777] ERROR: REQ-1-1: Completion failed: HTTP 503 (452ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.780] ERROR: REQ-1-7: Completion failed: HTTP 503 (454ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.782] ERROR: REQ-1-8: Completion failed: HTTP 503 (456ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.785] ERROR: REQ-1-14: Completion failed: HTTP 503 (459ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.785] ERROR: REQ-1-2: Completion failed: HTTP 503 (459ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.786] ERROR: REQ-1-20: Completion failed: HTTP 503 (460ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.791] ERROR: REQ-1-17: Completion failed: HTTP 503 (465ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.797] ERROR: REQ-1-13: Completion failed: HTTP 503 (471ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.950] ERROR: REQ-1-19: Completion failed: HTTP 503 (623ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.951] ERROR: WAVE 1: 20/20 requests failed
[21:30:47.002] WARN: WAVE 2: Launching 20 parallel requests...
^C
(root|~)

The script seems to be working; perhaps a little too heavy-handed. we'll see what happens with the PR!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

With this PR

I'm surprised by the improvement because I really overloaded the server with this LLM-made script. I think I need to narrow down, but it's a good proof the PR make way more reliable ! There is some strange behavior remaining like HTTP 000 error :

(root|~) ./race.sh
[21:40:00.674] === llama.cpp Data Race Hunter ===
[21:40:00.674] Target: https://www.serveurperso.com/ia/webui
[21:40:00.675] Models: MoE-Qwen3-30B-A3B-Instruct-2507, MoE-Qwen3-30B-A3B-Thinking-2507
[21:40:00.676] Parallel requests: 20
[21:40:00.676] Iterations: 100

[21:40:00.923] WARN: WAVE 1: Launching 20 parallel requests...
[21:40:01.176] REQ-1-3: /v1/models OK (240ms)
[21:40:01.179] REQ-1-18: /v1/models OK (239ms)
[21:40:01.180] REQ-1-12: /v1/models OK (239ms)
[21:40:01.180] REQ-1-9: /v1/models OK (239ms)
[21:40:01.181] REQ-1-15: /v1/models OK (239ms)
[21:40:01.182] REQ-1-6: /v1/models OK (239ms)
[21:40:10.787] ERROR: REQ-1-14: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-4: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-10: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-20: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-8: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-16: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-2: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.945] ERROR: REQ-1-19: Completion failed: HTTP 000 (10006ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-11: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-7: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-5: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-13: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: REQ-1-17: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: REQ-1-1: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: WAVE 1: 14/20 requests failed
[21:40:10.997] WARN: WAVE 2: Launching 20 parallel requests...
[21:40:11.250] REQ-2-12: /v1/models OK (239ms)
[21:40:11.250] REQ-2-15: /v1/models OK (239ms)
[21:40:11.257] REQ-2-18: /v1/models OK (246ms)
[21:40:11.263] REQ-2-9: /v1/models OK (252ms)
[21:40:11.265] REQ-2-6: /v1/models OK (255ms)
[21:40:11.266] REQ-2-3: /v1/models OK (255ms)
[21:40:11.401] ERROR: REQ-2-1: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-5: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-7: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-11: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-13: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-19: Completion failed: HTTP 500 (394ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-17: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:13.523] REQ-2-2: Completion OK (2518ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.608] REQ-2-10: Completion OK (2603ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.608] REQ-2-14: Completion OK (2604ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.609] REQ-2-16: Completion OK (2602ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.869] REQ-2-20: Completion OK (2863ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.879] REQ-2-8: Completion OK (2873ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.879] REQ-2-4: Completion OK (2873ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.880] ERROR: WAVE 2: 7/20 requests failed
[21:40:13.931] WARN: WAVE 3: Launching 20 parallel requests...
[21:40:14.179] REQ-3-6: /v1/models OK (234ms)
[21:40:14.179] REQ-3-12: /v1/models OK (235ms)
[21:40:14.180] REQ-3-3: /v1/models OK (236ms)
[21:40:14.182] REQ-3-15: /v1/models OK (237ms)
[21:40:14.188] REQ-3-9: /v1/models OK (241ms)
[21:40:14.196] REQ-3-18: /v1/models OK (249ms)
[21:40:14.280] REQ-3-4: Completion OK (341ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:14.879] ERROR: REQ-3-14: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:14.879] ERROR: REQ-3-17: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.879] ERROR: REQ-3-1: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-5: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-11: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-7: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:15.516] ERROR: REQ-3-19: Completion failed: HTTP 500 (1576ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:15.516] ERROR: REQ-3-13: Completion failed: HTTP 500 (1576ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
^C
(root|~)

@ServeurpersoCom
Copy link
Collaborator

I try to narrow down on HTTP 000 case

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

quite interesting, I think this script can be useful to test changes related to batching too

btw, looking at your report "No PR results", I suppose the 503 error was because the test don't wait until server starts, right?

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

The HTTP 000 case has a timeout of exactly 10 seconds which seems a quite suspicious, probably curl timeout and the error code is defaulted to 000?

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

quite interesting, I think this script can be useful to test changes related to batching too
btw, looking at your report "No PR results", I suppose the 503 error was because the test don't wait until server starts, right?

The server was already started. I think the 503 errors in "No PR results" are the data race your PR fixes!
Your PR works perfectly for the initial data concurrency problem, in any case it does not prevent my server from working AND it improves reliability with a script coded in a brute-force/DoS cyberattack style

@ServeurpersoCom
Copy link
Collaborator

The HTTP 000 case has a timeout of exactly 10 seconds which seems a quite suspicious, probably curl timeout and the error code is defaulted to 000?

So yes, the HTTP 000 at 10s is definitely curl timeout with error code defaulted to 000. The HTTP 000 at 2s is I think reverse proxy timeout

@ServeurpersoCom
Copy link
Collaborator

Better script :

#!/bin/bash
# Stress test
# Tests concurrent access to /v1/models and /v1/chat/completions
# to verify thread-safety of server_context_meta

set -o pipefail

readonly BASE_URL="https://www.serveurperso.com/ia/webui"
readonly MODEL_A="MoE-Qwen3-30B-A3B-Instruct-2507"
readonly MODEL_B="MoE-Qwen3-30B-A3B-Thinking-2507"
readonly ITERATIONS=100
readonly PARALLEL_REQUESTS=20

# Timeout configuration
# /v1/models should respond quickly (metadata read-only)
readonly MODELS_TIMEOUT=5
# Completions can take longer under load
readonly COMPLETION_TIMEOUT=15

# Colors for logging
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
NC='\033[0m'

log() { echo -e "${GREEN}[$(date '+%H:%M:%S.%3N')]${NC} $*"; }
err() { echo -e "${RED}[$(date '+%H:%M:%S.%3N')] ERROR:${NC} $*" >&2; }
warn() { echo -e "${YELLOW}[$(date '+%H:%M:%S.%3N')] WARN:${NC} $*"; }
info() { echo -e "${CYAN}[$(date '+%H:%M:%S.%3N')] INFO:${NC} $*"; }

# Generate heavy payload to force cache_prompt processing
generate_payload() {
    python3 -c "print('Test '*500)"
}

# Test /v1/models endpoint - main target for data race detection
# This endpoint reads server metadata and was vulnerable to concurrent access
hammer_models_endpoint() {
    local req_id=$1
    local start=$(date +%s%N)

    local response=$(curl -s -w "\n%{http_code}" \
        "${BASE_URL}/v1/models" \
        -H "Content-Type: application/json" \
        --connect-timeout 2 \
        --max-time $MODELS_TIMEOUT 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local body=$(echo "$response" | head -n -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        # HTTP 000 with duration < MODELS_TIMEOUT indicates server-side timeout (potential bug)
        # HTTP 000 with duration >= MODELS_TIMEOUT is curl timeout (expected under extreme load)
        if [[ "$http_code" == "000" ]] && (( duration < MODELS_TIMEOUT * 1000 )); then
            err "REQ-${req_id}: /v1/models SERVER TIMEOUT: HTTP $http_code (${duration}ms) - server closed connection early!"
        else
            err "REQ-${req_id}: /v1/models failed: HTTP $http_code (${duration}ms)"
        fi
        echo "$body" | head -5
        return 1
    fi

    # Verify JSON integrity - data race can cause corrupted responses
    if ! echo "$body" | jq -e '.data | length' >/dev/null 2>&1; then
        err "REQ-${req_id}: Invalid JSON response (possible data race!)"
        return 1
    fi

    log "REQ-${req_id}: /v1/models OK (${duration}ms)"
}

# Test completion endpoint to trigger server state transitions
# Heavy payload forces queue pressure and slot allocation
hammer_completion() {
    local req_id=$1
    local model=$2
    local payload=$(generate_payload)

    local start=$(date +%s%N)
    local response=$(curl -s -w "\n%{http_code}" -N \
        "${BASE_URL}/v1/chat/completions" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "'"$model"'",
            "messages": [{"role": "user", "content": "'"$payload"'"}],
            "stream": true,
            "max_tokens": 10,
            "cache_prompt": false
        }' \
        --connect-timeout 2 \
        --max-time $COMPLETION_TIMEOUT 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        # HTTP 500 is expected when queue is full (fail-fast behavior)
        # HTTP 000 with early timeout indicates server issue
        if [[ "$http_code" == "500" ]]; then
            warn "REQ-${req_id}: Completion queue full: HTTP $http_code (${duration}ms, model: $model) - expected under load"
        elif [[ "$http_code" == "000" ]] && (( duration < COMPLETION_TIMEOUT * 1000 )); then
            err "REQ-${req_id}: Completion SERVER TIMEOUT: HTTP $http_code (${duration}ms, model: $model) - server closed connection early!"
        else
            err "REQ-${req_id}: Completion failed: HTTP $http_code (${duration}ms, model: $model)"
        fi
        return 1
    fi

    log "REQ-${req_id}: Completion OK (${duration}ms, model: $model)"
}

# Parallel assault wave - mix of /v1/models and completions
# This simulates real-world concurrent access patterns
parallel_assault() {
    local wave=$1
    local pids=()
    local failures=0

    info "WAVE $wave: Launching $PARALLEL_REQUESTS parallel requests (mixed /v1/models + completions)"

    # Launch requests in parallel
    # Pattern: 1/3 are /v1/models, 2/3 are completions alternating between MODEL_A and MODEL_B
    for i in $(seq 1 $PARALLEL_REQUESTS); do
        local req_id="${wave}-${i}"

        if (( i % 3 == 0 )); then
            # Test /v1/models endpoint (data race target)
            hammer_models_endpoint "$req_id" &
        elif (( i % 2 == 0 )); then
            # Test completions with MODEL_A
            hammer_completion "$req_id" "$MODEL_A" &
        else
            # Test completions with MODEL_B (forces model switching in multi-model setup)
            hammer_completion "$req_id" "$MODEL_B" &
        fi

        pids+=($!)
    done

    # Wait for all requests and count failures
    for pid in "${pids[@]}"; do
        if ! wait "$pid"; then
            ((failures++))
        fi
    done

    if (( failures > 0 )); then
        err "WAVE $wave: $failures/$PARALLEL_REQUESTS requests failed"
        return 1
    else
        log "WAVE $wave: ALL $PARALLEL_REQUESTS requests succeeded"
        return 0
    fi
}

# Test race condition during model context changes
# This triggers server state transitions while hammering /v1/models
race_during_model_transition() {
    info "Phase 2: Testing /v1/models stability during model transitions"

    # Trigger model activity with MODEL_B
    hammer_completion "TRANSITION-1" "$MODEL_B" &
    local trigger_pid=$!

    # Immediately hammer /v1/models while server handles the completion
    sleep 0.1
    for i in {1..10}; do
        hammer_models_endpoint "TRANSITION-${i}" &
    done

    wait
}

# Main stress test
main() {
    log "=== llama.cpp Data Race Stress Test ==="
    log "Purpose: Detect thread-safety issues in server_context metadata access"
    log "Target: $BASE_URL"
    log "Models: $MODEL_A, $MODEL_B"
    log "Parallel requests per wave: $PARALLEL_REQUESTS"
    log "Total waves: $ITERATIONS"
    log "Timeouts: /v1/models=${MODELS_TIMEOUT}s, completions=${COMPLETION_TIMEOUT}s"
    echo

    # Pre-flight check
    info "Checking server availability..."
    if ! curl -s --connect-timeout 2 "${BASE_URL}/v1/models" >/dev/null; then
        err "Server unreachable at $BASE_URL"
        exit 1
    fi
    log "Server is reachable"
    echo

    local total_failures=0
    local start_time=$(date +%s)

    # Phase 1: Sustained parallel assault
    info "=== Phase 1: Sustained parallel assault ($ITERATIONS waves) ==="
    info "Each wave: $((PARALLEL_REQUESTS / 3)) /v1/models + $((PARALLEL_REQUESTS * 2 / 3)) completions"
    echo

    for wave in $(seq 1 $ITERATIONS); do
        if ! parallel_assault "$wave"; then
            ((total_failures++))
        fi

        # Small delay to observe server state transitions
        sleep 0.05
    done

    echo
    # Phase 2: Race during transitions
    warn "=== Phase 2: Testing during model transitions ==="
    race_during_model_transition

    # Final report
    local duration=$(($(date +%s) - start_time))
    echo
    log "=== Test completed in ${duration}s ==="
    log "Total waves: $ITERATIONS"
    log "Failed waves: $total_failures"
    echo

    if (( total_failures > 0 )); then
        err "FAILED: $total_failures/$ITERATIONS waves had failures"
        err "Possible data race detected - check server logs for:"
        err "  - ThreadSanitizer warnings (if compiled with -fsanitize=thread)"
        err "  - Crashes or assertion failures"
        err "  - Corrupted JSON responses"
        err "  - SERVER TIMEOUT messages (connection closed before curl timeout)"
        exit 1
    else
        log "SUCCESS: All $ITERATIONS waves passed"
        log "No data race detected in this test run"
        log "For comprehensive validation, run with ThreadSanitizer enabled"
        exit 0
    fi
}

main "$@"

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

I've already restarted the Windows runner. I'll have to test it on my Windows machine! I try a server-queue.cpp/h patch

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

I try a server-queue.cpp/h patch

If that was for the GGML_ASSERT(idx < states.size()) error, I hope the last commit will fix it

@ServeurpersoCom
Copy link
Collaborator

prompt processing progress, n_tokens = 1355, batch.n_tokens = 1355, progress = 1.000000
[59717] slot update_slots: id 3 | task 0 | prompt done, n_tokens = 1355, batch.n_tokens = 1355
srv operator(): http client error: Failed to read connection
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv operator(): instance name=Dense-Devstral-Small-2-24B-Instruct-2512 exited with status 1
srv log_server_r: request: GET /v1/models 127.0.0.1 200

from my smartphone. i check tomorrow morning

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

I spotted some more bugs on the way, fixed in the last commit(s):

  1. index is required by server_response_reader but is defaulted to -1, which cause some crashes. The hotfix is to default it to 0, but the proper fix (left as a TODO) is to get rid of the index altogether.
  2. server_http_req object is deleted too soon. In 121c7e7 , I fix this by associating its lifecycle to the response object. This should mimic the exact behavior of httplib's res and req objects

Edit: hmm, LoRA endpoint can also cause data race the same way as /models has been. it need to be fixed too

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

So turns out, the endpoints for lora hot-swap can also cause data race as it reads data directly off server_context. I refactored large a part of lora handling to make it safe.

@ServeurpersoCom
Copy link
Collaborator

Compared to what I did from the phone (where the server wasn't working), after theses 5 commit this time everything works and the last DoS script no longer displays an HTTP 000 error, and server recovers after a while. It seems more robust

@stephensrmmartin
Copy link

Is this error related to this PR?

srv    operator(): http client error: Failed to read connection
srv  log_server_r: request: POST /v1/chat/completions 172.17.0.1 500
srv    operator(): instance name=default:latest exited with status 1

This is using the full rocm docker image, running the llama-server with args:

docker run --device /dev/dri --device /dev/kfd -e LLAMA_CACHE=/models -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -v /var/cache/llama.cpp:/models -p 11435:8080 --entrypoint /app/llama-server --detach --name llama.cpp-rocm-7.1-pr localhost/llama.cpp:full-rocm-7.1-pr --models-preset /models/llama.cpp.conf --host 0.0.0.0

I am using the ngxson:xsn/server_data_race branch for building. I am using the sleep option in the models preset file.

If the above error is not related to this PR, let me know and I'll open a different ticket.

Note I did not experience this error prior to updating today.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 22, 2025

Is this error related to this PR?

srv    operator(): http client error: Failed to read connection
srv  log_server_r: request: POST /v1/chat/completions 172.17.0.1 500
srv    operator(): instance name=default:latest exited with status 1

This is using the full rocm docker image, running the llama-server with args:

docker run --device /dev/dri --device /dev/kfd -e LLAMA_CACHE=/models -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -v /var/cache/llama.cpp:/models -p 11435:8080 --entrypoint /app/llama-server --detach --name llama.cpp-rocm-7.1-pr localhost/llama.cpp:full-rocm-7.1-pr --models-preset /models/llama.cpp.conf --host 0.0.0.0

I am using the ngxson:xsn/server_data_race branch for building. I am using the sleep option in the models preset file.

If the above error is not related to this PR, let me know and I'll open a different ticket.

Note I did not experience this error prior to updating today.

Looks similar to what I hit. When a child exits with error (status 1), the router seems to keep trying to proxy to it.
I opened #18237 for this case. Could be ROCm OOM or something else killing your child.
Quick fix: restart the router. This PR fixes the data race on /v1/models but not the dead child cleanup yet.
Edit : Not sure because there is no "Failed to read connection" on my log on my last test....

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

@stephensrmmartin the best way to isolate the problem is to use CPU-only build

@stephensrmmartin
Copy link

@stephensrmmartin the best way to isolate the problem is to use CPU-only build

It was a red herring. The issue is indeed that an error was thrown earlier, due to an unrelated issue. The actual issue had to do with using rocm 7.1. with 7.0, the error does not occur, and the subsequent http error no longer occurs.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

@ServeurpersoCom the tests are now passed on CI, would appreciate if you can re-run the stress test on your side!

@ServeurpersoCom
Copy link
Collaborator

@ServeurpersoCom the tests are now passed on CI, would appreciate if you can re-run the stress test on your side!

I did it early this morning, no new commits, the result is clean, no more "HTTP 000" errors since 5 commit.
I was waiting for the AMX runner to start before merging, but it seemed stuck. Good call restarting everything.

Re testing :

(root|~/scripts) ./race.sh
[12:03:40.850] === llama.cpp Data Race Stress Test ===
[12:03:40.851] Purpose: Detect thread-safety issues in server_context metadata access
[12:03:40.852] Target: https://www.serveurperso.com/ia/webui
[12:03:40.852] Models: MoE-Qwen3-30B-A3B-Instruct-2507, MoE-Qwen3-30B-A3B-Thinking-2507
[12:03:40.853] Parallel requests per wave: 20
[12:03:40.854] Total waves: 100
[12:03:40.854] Timeouts: /v1/models=5s, completions=15s

[12:03:40.855] INFO: Checking server availability...
[12:03:41.077] Server is reachable

[12:03:41.078] INFO: === Phase 1: Sustained parallel assault (100 waves) ===
[12:03:41.079] INFO: Each wave: 6 /v1/models + 13 completions

[12:03:41.080] INFO: WAVE 1: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:41.319] WARN: REQ-1-1: Completion queue full: HTTP 500 (231ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.321] REQ-1-9: /v1/models OK (228ms)
[12:03:41.323] REQ-1-6: /v1/models OK (229ms)
[12:03:41.324] REQ-1-12: /v1/models OK (231ms)
[12:03:41.324] WARN: REQ-1-7: Completion queue full: HTTP 500 (237ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.324] WARN: REQ-1-13: Completion queue full: HTTP 500 (236ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.328] REQ-1-3: /v1/models OK (234ms)
[12:03:41.329] WARN: REQ-1-11: Completion queue full: HTTP 500 (240ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.340] REQ-1-15: /v1/models OK (246ms)
[12:03:41.340] REQ-1-18: /v1/models OK (246ms)
[12:03:41.347] WARN: REQ-1-19: Completion queue full: HTTP 500 (258ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.538] WARN: REQ-1-17: Completion queue full: HTTP 500 (448ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.543] WARN: REQ-1-5: Completion queue full: HTTP 500 (452ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:41.998] WARN: REQ-1-16: Completion queue full: HTTP 500 (908ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.998] WARN: REQ-1-2: Completion queue full: HTTP 500 (910ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.999] WARN: REQ-1-8: Completion queue full: HTTP 500 (911ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.999] WARN: REQ-1-20: Completion queue full: HTTP 500 (910ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.999] WARN: REQ-1-10: Completion queue full: HTTP 500 (909ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.999] WARN: REQ-1-14: Completion queue full: HTTP 500 (910ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:41.999] WARN: REQ-1-4: Completion queue full: HTTP 500 (911ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:42.000] ERROR: WAVE 1: 14/20 requests failed
[12:03:42.051] INFO: WAVE 2: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:42.346] REQ-2-3: /v1/models OK (281ms)
[12:03:42.361] REQ-2-9: /v1/models OK (296ms)
[12:03:42.368] WARN: REQ-2-13: Completion queue full: HTTP 500 (309ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.368] REQ-2-15: /v1/models OK (303ms)
[12:03:42.369] WARN: REQ-2-1: Completion queue full: HTTP 500 (311ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.371] WARN: REQ-2-5: Completion queue full: HTTP 500 (313ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.372] REQ-2-6: /v1/models OK (308ms)
[12:03:42.374] WARN: REQ-2-11: Completion queue full: HTTP 500 (315ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.375] REQ-2-12: /v1/models OK (311ms)
[12:03:42.376] WARN: REQ-2-17: Completion queue full: HTTP 500 (316ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.376] REQ-2-18: /v1/models OK (311ms)
[12:03:42.619] WARN: REQ-2-7: Completion queue full: HTTP 500 (559ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:42.621] WARN: REQ-2-19: Completion queue full: HTTP 500 (561ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.029] WARN: REQ-2-14: Completion queue full: HTTP 500 (970ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.029] WARN: REQ-2-2: Completion queue full: HTTP 500 (971ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.029] WARN: REQ-2-10: Completion queue full: HTTP 500 (971ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.029] WARN: REQ-2-4: Completion queue full: HTTP 500 (971ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.029] WARN: REQ-2-8: Completion queue full: HTTP 500 (970ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.030] WARN: REQ-2-20: Completion queue full: HTTP 500 (970ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.030] WARN: REQ-2-16: Completion queue full: HTTP 500 (970ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:43.030] ERROR: WAVE 2: 14/20 requests failed
[12:03:43.082] INFO: WAVE 3: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:43.374] WARN: REQ-3-5: Completion queue full: HTTP 500 (285ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.374] WARN: REQ-3-11: Completion queue full: HTTP 500 (285ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.374] WARN: REQ-3-13: Completion queue full: HTTP 500 (285ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.375] WARN: REQ-3-7: Completion queue full: HTTP 500 (286ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.379] REQ-3-15: /v1/models OK (284ms)
[12:03:43.380] REQ-3-12: /v1/models OK (284ms)
[12:03:43.380] REQ-3-6: /v1/models OK (285ms)
[12:03:43.381] WARN: REQ-3-1: Completion queue full: HTTP 500 (292ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.381] REQ-3-9: /v1/models OK (284ms)
[12:03:43.381] REQ-3-3: /v1/models OK (286ms)
[12:03:43.381] WARN: REQ-3-19: Completion queue full: HTTP 500 (291ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.381] WARN: REQ-3-17: Completion queue full: HTTP 500 (291ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:43.386] REQ-3-18: /v1/models OK (290ms)
[12:03:44.017] WARN: REQ-3-16: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.017] WARN: REQ-3-10: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.017] WARN: REQ-3-8: Completion queue full: HTTP 500 (928ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.017] WARN: REQ-3-14: Completion queue full: HTTP 500 (928ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.017] WARN: REQ-3-20: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.018] WARN: REQ-3-2: Completion queue full: HTTP 500 (929ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.018] WARN: REQ-3-4: Completion queue full: HTTP 500 (929ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.018] ERROR: WAVE 3: 14/20 requests failed
[12:03:44.070] INFO: WAVE 4: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:44.325] REQ-4-12: /v1/models OK (242ms)
[12:03:44.326] REQ-4-9: /v1/models OK (243ms)
[12:03:44.329] REQ-4-3: /v1/models OK (245ms)
[12:03:44.331] WARN: REQ-4-1: Completion queue full: HTTP 500 (254ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.333] REQ-4-6: /v1/models OK (250ms)
[12:03:44.334] WARN: REQ-4-7: Completion queue full: HTTP 500 (257ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.336] REQ-4-18: /v1/models OK (252ms)
[12:03:44.337] REQ-4-15: /v1/models OK (253ms)
[12:03:44.338] WARN: REQ-4-11: Completion queue full: HTTP 500 (261ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.342] WARN: REQ-4-17: Completion queue full: HTTP 500 (264ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.350] WARN: REQ-4-5: Completion queue full: HTTP 500 (273ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.354] WARN: REQ-4-13: Completion queue full: HTTP 500 (276ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.354] WARN: REQ-4-19: Completion queue full: HTTP 500 (277ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:44.973] WARN: REQ-4-4: Completion queue full: HTTP 500 (896ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-2: Completion queue full: HTTP 500 (896ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-16: Completion queue full: HTTP 500 (896ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-8: Completion queue full: HTTP 500 (896ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-14: Completion queue full: HTTP 500 (895ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-10: Completion queue full: HTTP 500 (896ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.973] WARN: REQ-4-20: Completion queue full: HTTP 500 (894ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:44.974] ERROR: WAVE 4: 14/20 requests failed
[12:03:45.025] INFO: WAVE 5: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:45.313] REQ-5-3: /v1/models OK (270ms)
[12:03:45.331] REQ-5-15: /v1/models OK (292ms)
[12:03:45.338] REQ-5-9: /v1/models OK (300ms)
[12:03:45.341] REQ-5-12: /v1/models OK (300ms)
[12:03:45.345] WARN: REQ-5-19: Completion queue full: HTTP 500 (311ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.347] REQ-5-18: /v1/models OK (308ms)
[12:03:45.348] REQ-5-6: /v1/models OK (310ms)
[12:03:45.349] WARN: REQ-5-5: Completion queue full: HTTP 500 (316ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.356] WARN: REQ-5-7: Completion queue full: HTTP 500 (323ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.367] WARN: REQ-5-1: Completion queue full: HTTP 500 (334ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.368] WARN: REQ-5-11: Completion queue full: HTTP 500 (334ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.370] WARN: REQ-5-13: Completion queue full: HTTP 500 (336ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.372] WARN: REQ-5-17: Completion queue full: HTTP 500 (338ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:45.987] WARN: REQ-5-8: Completion queue full: HTTP 500 (954ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.987] WARN: REQ-5-2: Completion queue full: HTTP 500 (955ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.987] WARN: REQ-5-16: Completion queue full: HTTP 500 (953ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.987] WARN: REQ-5-10: Completion queue full: HTTP 500 (955ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.987] WARN: REQ-5-4: Completion queue full: HTTP 500 (955ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.987] WARN: REQ-5-20: Completion queue full: HTTP 500 (954ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.988] WARN: REQ-5-14: Completion queue full: HTTP 500 (954ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:45.988] ERROR: WAVE 5: 14/20 requests failed
[12:03:46.039] INFO: WAVE 6: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:46.295] REQ-6-3: /v1/models OK (243ms)
[12:03:46.299] REQ-6-9: /v1/models OK (246ms)
[12:03:46.300] REQ-6-12: /v1/models OK (242ms)
[12:03:46.303] WARN: REQ-6-7: Completion queue full: HTTP 500 (256ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.306] REQ-6-15: /v1/models OK (252ms)
[12:03:46.306] WARN: REQ-6-17: Completion queue full: HTTP 500 (259ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.309] REQ-6-6: /v1/models OK (256ms)
[12:03:46.314] WARN: REQ-6-5: Completion queue full: HTTP 500 (267ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.314] WARN: REQ-6-11: Completion queue full: HTTP 500 (267ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.315] WARN: REQ-6-1: Completion queue full: HTTP 500 (267ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.317] WARN: REQ-6-19: Completion queue full: HTTP 500 (269ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.320] REQ-6-18: /v1/models OK (265ms)
[12:03:46.549] WARN: REQ-6-13: Completion queue full: HTTP 500 (500ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:46.953] WARN: REQ-6-10: Completion queue full: HTTP 500 (907ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.953] WARN: REQ-6-8: Completion queue full: HTTP 500 (905ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.953] WARN: REQ-6-4: Completion queue full: HTTP 500 (907ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.954] WARN: REQ-6-16: Completion queue full: HTTP 500 (905ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.954] WARN: REQ-6-20: Completion queue full: HTTP 500 (905ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.954] WARN: REQ-6-2: Completion queue full: HTTP 500 (907ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.954] WARN: REQ-6-14: Completion queue full: HTTP 500 (906ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:46.955] ERROR: WAVE 6: 14/20 requests failed
[12:03:47.006] INFO: WAVE 7: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:47.268] REQ-7-6: /v1/models OK (248ms)
[12:03:47.269] REQ-7-15: /v1/models OK (249ms)
[12:03:47.270] REQ-7-18: /v1/models OK (250ms)
[12:03:47.276] REQ-7-12: /v1/models OK (256ms)
[12:03:47.294] WARN: REQ-7-7: Completion queue full: HTTP 500 (280ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.296] WARN: REQ-7-11: Completion queue full: HTTP 500 (282ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.296] REQ-7-3: /v1/models OK (276ms)
[12:03:47.298] REQ-7-9: /v1/models OK (278ms)
[12:03:47.299] WARN: REQ-7-17: Completion queue full: HTTP 500 (285ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.305] WARN: REQ-7-5: Completion queue full: HTTP 500 (291ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.309] WARN: REQ-7-13: Completion queue full: HTTP 500 (293ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.311] WARN: REQ-7-1: Completion queue full: HTTP 500 (297ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.315] WARN: REQ-7-19: Completion queue full: HTTP 500 (301ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:47.940] WARN: REQ-7-16: Completion queue full: HTTP 500 (926ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.940] WARN: REQ-7-14: Completion queue full: HTTP 500 (925ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] WARN: REQ-7-4: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] WARN: REQ-7-8: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] WARN: REQ-7-20: Completion queue full: HTTP 500 (926ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] WARN: REQ-7-10: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] WARN: REQ-7-2: Completion queue full: HTTP 500 (927ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:47.941] ERROR: WAVE 7: 14/20 requests failed
[12:03:47.993] INFO: WAVE 8: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:48.252] REQ-8-3: /v1/models OK (245ms)
[12:03:48.252] REQ-8-6: /v1/models OK (245ms)
[12:03:48.258] WARN: REQ-8-1: Completion queue full: HTTP 500 (256ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.263] REQ-8-9: /v1/models OK (256ms)
[12:03:48.263] WARN: REQ-8-5: Completion queue full: HTTP 500 (262ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.265] WARN: REQ-8-7: Completion queue full: HTTP 500 (264ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.265] WARN: REQ-8-11: Completion queue full: HTTP 500 (264ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.266] REQ-8-15: /v1/models OK (258ms)
[12:03:48.266] REQ-8-18: /v1/models OK (257ms)
[12:03:48.268] REQ-8-12: /v1/models OK (260ms)
[12:03:48.271] WARN: REQ-8-17: Completion queue full: HTTP 500 (270ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.489] WARN: REQ-8-19: Completion queue full: HTTP 500 (488ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.493] WARN: REQ-8-13: Completion queue full: HTTP 500 (491ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:48.910] WARN: REQ-8-14: Completion queue full: HTTP 500 (908ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-20: Completion queue full: HTTP 500 (908ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-16: Completion queue full: HTTP 500 (908ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-2: Completion queue full: HTTP 500 (909ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-10: Completion queue full: HTTP 500 (909ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-4: Completion queue full: HTTP 500 (909ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.910] WARN: REQ-8-8: Completion queue full: HTTP 500 (909ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:48.911] ERROR: WAVE 8: 14/20 requests failed
[12:03:48.963] INFO: WAVE 9: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:49.220] REQ-9-18: /v1/models OK (242ms)
[12:03:49.221] REQ-9-6: /v1/models OK (244ms)
[12:03:49.224] WARN: REQ-9-5: Completion queue full: HTTP 500 (254ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.230] REQ-9-3: /v1/models OK (253ms)
[12:03:49.234] REQ-9-9: /v1/models OK (256ms)
[12:03:49.234] REQ-9-15: /v1/models OK (257ms)
[12:03:49.235] REQ-9-12: /v1/models OK (258ms)
[12:03:49.238] WARN: REQ-9-1: Completion queue full: HTTP 500 (267ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.239] WARN: REQ-9-11: Completion queue full: HTTP 500 (268ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.241] WARN: REQ-9-13: Completion queue full: HTTP 500 (271ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.244] WARN: REQ-9-7: Completion queue full: HTTP 500 (272ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.258] WARN: REQ-9-19: Completion queue full: HTTP 500 (286ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.452] WARN: REQ-9-17: Completion queue full: HTTP 500 (480ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:49.882] WARN: REQ-9-2: Completion queue full: HTTP 500 (911ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-20: Completion queue full: HTTP 500 (911ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-16: Completion queue full: HTTP 500 (912ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-8: Completion queue full: HTTP 500 (912ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-10: Completion queue full: HTTP 500 (912ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-4: Completion queue full: HTTP 500 (913ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.883] WARN: REQ-9-14: Completion queue full: HTTP 500 (912ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:49.884] ERROR: WAVE 9: 14/20 requests failed
[12:03:49.935] INFO: WAVE 10: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:50.238] REQ-10-3: /v1/models OK (290ms)
[12:03:50.271] REQ-10-6: /v1/models OK (322ms)
[12:03:50.272] REQ-10-12: /v1/models OK (323ms)
[12:03:50.273] REQ-10-9: /v1/models OK (324ms)
[12:03:50.282] REQ-10-15: /v1/models OK (333ms)
[12:03:50.285] WARN: REQ-10-1: Completion queue full: HTTP 500 (343ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.286] REQ-10-18: /v1/models OK (338ms)
[12:03:50.290] WARN: REQ-10-17: Completion queue full: HTTP 500 (348ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.302] WARN: REQ-10-13: Completion queue full: HTTP 500 (360ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.309] WARN: REQ-10-5: Completion queue full: HTTP 500 (366ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.311] WARN: REQ-10-7: Completion queue full: HTTP 500 (369ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.320] WARN: REQ-10-11: Completion queue full: HTTP 500 (377ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.368] WARN: REQ-10-19: Completion queue full: HTTP 500 (424ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:50.946] WARN: REQ-10-20: Completion queue full: HTTP 500 (1003ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.946] WARN: REQ-10-14: Completion queue full: HTTP 500 (1004ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] WARN: REQ-10-2: Completion queue full: HTTP 500 (1005ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] WARN: REQ-10-10: Completion queue full: HTTP 500 (1003ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] WARN: REQ-10-16: Completion queue full: HTTP 500 (1003ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] WARN: REQ-10-4: Completion queue full: HTTP 500 (1004ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] WARN: REQ-10-8: Completion queue full: HTTP 500 (1005ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:50.947] ERROR: WAVE 10: 14/20 requests failed
[12:03:50.999] INFO: WAVE 11: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:51.261] REQ-11-3: /v1/models OK (250ms)
[12:03:51.262] REQ-11-9: /v1/models OK (249ms)
[12:03:51.274] REQ-11-6: /v1/models OK (262ms)
[12:03:51.274] WARN: REQ-11-5: Completion queue full: HTTP 500 (268ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.279] REQ-11-12: /v1/models OK (267ms)
[12:03:51.279] WARN: REQ-11-11: Completion queue full: HTTP 500 (272ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.279] REQ-11-15: /v1/models OK (267ms)
[12:03:51.280] REQ-11-18: /v1/models OK (267ms)
[12:03:51.281] WARN: REQ-11-1: Completion queue full: HTTP 500 (275ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.284] WARN: REQ-11-7: Completion queue full: HTTP 500 (278ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.288] WARN: REQ-11-19: Completion queue full: HTTP 500 (281ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.295] WARN: REQ-11-13: Completion queue full: HTTP 500 (288ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.302] WARN: REQ-11-17: Completion queue full: HTTP 500 (294ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:51.927] WARN: REQ-11-4: Completion queue full: HTTP 500 (920ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.927] WARN: REQ-11-2: Completion queue full: HTTP 500 (921ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.927] WARN: REQ-11-8: Completion queue full: HTTP 500 (921ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.928] WARN: REQ-11-10: Completion queue full: HTTP 500 (921ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.928] WARN: REQ-11-14: Completion queue full: HTTP 500 (920ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.928] WARN: REQ-11-20: Completion queue full: HTTP 500 (920ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.928] WARN: REQ-11-16: Completion queue full: HTTP 500 (921ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:51.928] ERROR: WAVE 11: 14/20 requests failed
[12:03:51.980] INFO: WAVE 12: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:52.279] REQ-12-6: /v1/models OK (285ms)
[12:03:52.305] REQ-12-12: /v1/models OK (310ms)
[12:03:52.308] REQ-12-3: /v1/models OK (315ms)
[12:03:52.313] REQ-12-9: /v1/models OK (319ms)
[12:03:52.315] WARN: REQ-12-7: Completion queue full: HTTP 500 (328ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.317] REQ-12-18: /v1/models OK (323ms)
[12:03:52.321] WARN: REQ-12-5: Completion queue full: HTTP 500 (333ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.325] REQ-12-15: /v1/models OK (328ms)
[12:03:52.325] WARN: REQ-12-1: Completion queue full: HTTP 500 (337ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.334] WARN: REQ-12-11: Completion queue full: HTTP 500 (346ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.356] WARN: REQ-12-13: Completion queue full: HTTP 500 (367ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.367] WARN: REQ-12-17: Completion queue full: HTTP 500 (378ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.367] WARN: REQ-12-19: Completion queue full: HTTP 500 (378ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:52.973] WARN: REQ-12-16: Completion queue full: HTTP 500 (985ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.973] WARN: REQ-12-4: Completion queue full: HTTP 500 (985ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.973] WARN: REQ-12-20: Completion queue full: HTTP 500 (985ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.973] WARN: REQ-12-2: Completion queue full: HTTP 500 (986ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.974] WARN: REQ-12-8: Completion queue full: HTTP 500 (985ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.974] WARN: REQ-12-10: Completion queue full: HTTP 500 (986ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.974] WARN: REQ-12-14: Completion queue full: HTTP 500 (985ms, model: MoE-Qwen3-30B-A3B-Instruct-2507) - expected under load
[12:03:52.975] ERROR: WAVE 12: 14/20 requests failed
[12:03:53.026] INFO: WAVE 13: Launching 20 parallel requests (mixed /v1/models + completions)
[12:03:53.343] REQ-13-3: /v1/models OK (304ms)
[12:03:53.364] REQ-13-6: /v1/models OK (325ms)
[12:03:53.371] WARN: REQ-13-1: Completion queue full: HTTP 500 (338ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.372] REQ-13-12: /v1/models OK (331ms)
[12:03:53.373] WARN: REQ-13-11: Completion queue full: HTTP 500 (340ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.374] WARN: REQ-13-5: Completion queue full: HTTP 500 (341ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.376] WARN: REQ-13-13: Completion queue full: HTTP 500 (342ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.376] REQ-13-9: /v1/models OK (336ms)
[12:03:53.379] REQ-13-15: /v1/models OK (339ms)
[12:03:53.382] WARN: REQ-13-7: Completion queue full: HTTP 500 (347ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.382] REQ-13-18: /v1/models OK (341ms)
[12:03:53.384] WARN: REQ-13-19: Completion queue full: HTTP 500 (349ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load
[12:03:53.384] WARN: REQ-13-17: Completion queue full: HTTP 500 (349ms, model: MoE-Qwen3-30B-A3B-Thinking-2507) - expected under load

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 22, 2025

For the last commit -> apply and re test -> about the same result, and no regression for my standard use

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

I think this is should be pretty much ready to merge then.

The last 2 commits are nits so we don't need to test it again, it's fine as soon as the CI workflow passes.

Would be nice if you can have a quick look in the code and approve @ServeurpersoCom , I also have this mirrored PR with AI-assisted review: ngxson#63

@ServeurpersoCom
Copy link
Collaborator

I always test for my Linux/Nvidia use case on several models in practice and do some reviews with GPT High/Codex + Claude Thinking/Code before merging :)

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 22, 2025

Failed macOS-latest-cmake-arm64 here but OK on https://github.com/ngxson/llama.cpp/actions/runs/20431225417/job/58702205221?pr=63 I retry at end

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

yes it's expected that some tests may fail randomly on hosted hardware. for server changes, just server workflows passed is enough

@ngxson ngxson merged commit 6ce863c into ggml-org:master Dec 22, 2025
69 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: (bug) data race on /v1/models and LoRA endpoints

3 participants