Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 21, 2025

Make sure to read the contributing guidelines before submitting a PR

Summary by CodeRabbit

  • Bug Fixes

    • More robust streaming response lifecycle and request lifetime handling to reduce dropped/partial streams.
    • Loading-state behavior tightened: non-root endpoints now return loading status while initializing.
  • Refactor

    • Server metadata and route responses enriched and centralized — endpoints now expose more model and UI metadata.
    • Per-request adapter (LORA) handling moved to per-call scope for predictable per-request behavior.
    • Task batching and indexing improved; tasks can be posted with front (high) priority for faster handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 21, 2025

Walkthrough

Replaced server_context::get_info() with an expanded server_context::get_meta(); moved response generation to queue-based constructors and a create_response() factory; changed LoRA handling from vector to std::map<int,float>; made tokens_to_str vocab-centric; introduced request ownership for streaming HTTP responses and priority posting for tasks.

Changes

Cohort / File(s) Summary
Context metadata & routes
tools/server/server-context.h, tools/server/server-context.cpp, tools/server/server.cpp
Renamed server_context_infoserver_context_meta and get_info()get_meta() with many new fields; server_routes now holds a meta pointer, uses create_response(bool) to build response generators, and update_meta() is invoked after load; route construction signatures adjusted.
HTTP request lifetime & streaming
tools/server/server-http.cpp
Added using server_http_req_ptr = std::unique_ptr<server_http_req>; changed process_handler_response() to accept the request as an rvalue and capture request ownership into streaming lambdas to ensure proper lifetime and cleanup.
Queue and response reader API
tools/server/server-queue.h, tools/server/server-queue.cpp
post_task/post_tasks gained a bool front parameter (default false) for priority posting; tasks are assigned explicit indices before state creation; has_next/wait_for_all/result assembly updated to use explicit indices and cancellation state.
LoRA parsing & token-string API
tools/server/server-common.h, tools/server/server-common.cpp
parse_lora_request() signature changed to std::map<int,float> parse_lora_request(const json & data) (removed base vector overload); tokens_to_str helper refactored to accept const llama_vocab* with a new vocab-based overload and context-based API delegating to it.
Task types, indexing, and LORA result shape
tools/server/server-task.h, tools/server/server-task.cpp
task_params.lora and set_lora changed from std::vector<...> to std::map<int,float>; added SERVER_TASK_TYPE_GET_LORA and server_task_result_get_lora; moved index to server_task/server_task_result as size_t; removed many get_index() virtuals; params_from_json_cmpl() now accepts const llama_vocab* and n_ctx_slot.
CLI usage
tools/cli/cli.cpp
Replaced ctx_server.get_info() calls with ctx_server.get_meta() to obtain server metadata (subsequent usage unchanged).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP as HTTP Handler
    participant Routes as server_routes
    participant Queue as Task Queue
    participant Worker as Background Worker
    participant ResGen as server_res_generator

    Client->>HTTP: POST /v1/completions (streaming)
    HTTP->>HTTP: allocate request (unique_ptr)
    HTTP->>Routes: routes.create_response(bypass_sleep)
    Routes->>ResGen: create response generator (queues + sleep params)
    HTTP->>ResGen: post_task(task, front=false)  -- moves request into handler
    HTTP->>Client: send initial headers (200)

    Worker->>Queue: fetch next task
    Worker->>Worker: process task (use model/vocab, per-request LORA)
    Worker->>Queue: post_result(result with index)

    ResGen->>Queue: wait_for_all / has_next
    Queue-->>ResGen: deliver ordered results by index
    ResGen->>Client: stream chunk(s)
    alt more chunks
        ResGen->>Queue: wait_for_next
        Queue-->>ResGen: next result
        ResGen->>Client: stream chunk
    end

    Note right of ResGen: request shared ownership captured by streaming lambda\nand reset on completion
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Files needing extra attention:
    • tools/server/server-http.cpp — verify moved request ownership, lambda captures, and destruction ordering to avoid use-after-free.
    • tools/server/server-queue.* — confirm thread-safety and correctness of front-priority posting, index assignment, and batch result assembly.
    • tools/server/server-context.* & tools/server/server.cpp — ensure all call sites updated for get_meta() and that update_meta() is invoked at correct times.
    • tools/server/server-task.* & tools/server/server-common.* — validate LoRA parse semantics, serialization, and tokenization changes from context→vocab plus n_ctx_slot usage.

Poem

🐇
I hopped through queues and meta-fields,
Mapped each LoRA and kept my shields,
Tokens now told by vocab's song,
Requests held safe the whole stream-long,
A little rabbit cheers the build.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only a template reminder without substantive explanation of the changes, objectives, or technical details required for code review. Provide a detailed description explaining the data race issue, the root cause, the solution implemented, and how the changes in each file address the problem.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title '(FOR CI) Xsn/server data race' is vague and uses non-descriptive parenthetical notation that doesn't clearly convey what change is being made. Replace with a clear, specific title describing the main fix, e.g., 'Fix server data race condition in context and response handling' or 'Add thread-safety protections to server metadata and task queue'.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xsn/server_data_race

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 982eda7 and b1b8098.

📒 Files selected for processing (3)
  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
  • tools/server/server-queue.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/server/server-queue.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: windows-2022-cmake-cuda (12.4)
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ios-xcode-build
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: server-windows
🔇 Additional comments (20)
tools/server/server-queue.cpp (4)

328-334: LGTM!

The post_task function correctly initializes task.index = 0 before posting and properly forwards the front parameter to the queue. The assertion ensures single-use semantics per reader.


337-346: LGTM!

The post_tasks function correctly assigns sequential indices to each task and forwards the front parameter. The index assignment before state creation ensures proper ordering for batch result collection.


370-375: LGTM!

Using result->index directly is cleaner than a getter method. The bounds check at line 373 ensures safe access to the states vector.


386-406: LGTM!

The batch result handling correctly:

  1. Pre-sizes the results vector to match id_tasks.size()
  2. Uses result->index for placement with proper bounds assertion
  3. Detects duplicate results with the nullptr check

The assertions at lines 401-402 provide good runtime validation for invariants.

tools/server/server-context.cpp (16)

509-532: LGTM!

The restructuring of server_context_impl with explicit friend declaration and clear public/private separation improves encapsulation. The destructor correctly avoids double-free by checking the sleeping state before calling destroy().


1051-1062: LGTM!

The construct_lora_list function correctly:

  1. Creates a copy of the base LoRA adapters
  2. Applies per-request scale overrides from the config map
  3. Defaults unspecified adapters to scale 0.0f

This is cleaner than the previous vector-based approach.


1064-1082: LGTM!

The updated LoRA handling in launch_slot_with_task correctly uses the new construct_lora_list helper and maintains the cache-clearing logic when LoRA configurations change.


1811-1836: Variable shadowing issue has been addressed.

The inner loop variable is now correctly named j (line 1824) instead of i, which fixes the variable shadowing bug flagged in the previous review. The outer loop's i at line 1817 is no longer shadowed.


1837-1849: LGTM!

The SERVER_TASK_TYPE_SET_LORA case correctly uses construct_lora_list with the new map-based request format and provides useful logging.


2785-2813: LGTM!

The get_meta() function provides a clean, consolidated way to expose server metadata. The structure initialization is complete and properly accesses internal state through the impl pointer.


2818-2829: LGTM!

The server_res_generator constructor properly handles the bypass_sleep optimization by short-circuiting when sleeping is disabled (sleep_idle_seconds < 0).


3055-3065: LGTM!

The create_response helper and server_routes constructor properly initialize the routing infrastructure with references to the queue and server context.


3071-3082: Good defensive practice with the dummy variable.

The pattern of declaring bool server_ctx; with the comment "do NOT delete this line" is a clever way to prevent accidental use of ctx_server in endpoints that should work during sleeping state. This provides compile-time safety.


3253-3293: LGTM!

The get_props endpoint properly accesses metadata through the meta pointer and correctly uses queue_tasks.is_sleeping() to report sleeping state. The conditional addition of chat_template_tool_use is clean.


3565-3566: Prefer vocab-based tokenization for consistency.

The call to common_token_to_piece(ctx_server.vocab, token) correctly uses the vocab directly rather than going through the context, which aligns with the refactoring mentioned in the AI summary.


3600-3601: Consistent with vocab-centric approach.

tokens_to_str(ctx_server.vocab, tokens) correctly uses the vocabulary directly, maintaining consistency with the refactoring.


3695-3716: LGTM!

The get_lora_adapters handler correctly posts a task and waits for the result using the new task-based approach for LoRA information retrieval.


3718-3745: LGTM!

The post_lora_adapters handler correctly uses parse_lora_request to convert the request body and posts the task. The dynamic cast assertion ensures type safety.


2846-2857: The initialization flow is correct; no action needed. meta is guaranteed to be initialized before route handlers execute. update_meta() is called immediately after load_model() succeeds (line 255 of server.cpp) and before ctx_http.is_ready is set to true (line 256). Route handlers won't be called until is_ready is true, ensuring meta->slot_n_ctx and meta->model_name are always available when accessed.


3833-3840: Ensure consistent pooling_type checks between reranking and embeddings endpoints.

Line 3617 checks params.pooling_type for the reranking endpoint, while line 3840 checks meta->pooling_type for embeddings. These represent different values (user-configured vs. model's native type). Verify this intentional difference is correct: reranking validates user configuration, while embeddings validates model capability. If both should check the same value or if user-provided pooling overrides should be considered, update for consistency.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson ngxson marked this pull request as ready for review December 22, 2025 10:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/server/server-task.cpp (1)

1151-1157: Static variable in to_json_anthropic causes state leakage across requests.

The static bool text_block_started variable at line 1154 persists across different requests/invocations, which can cause incorrect behavior when handling multiple concurrent streaming requests. Each request should have its own state.

🔎 Proposed fix

Consider moving this state into the result object or tracking it differently. The state should be per-request, not global. Since this is in a partial result's to_json, the tracking should be part of the result state or derived from n_decoded and oaicompat_msg_diffs content.

 json server_task_result_cmpl_partial::to_json_anthropic() {
     json events = json::array();
     bool first = (n_decoded == 1);
-    static bool text_block_started = false;
+    // Determine if text block was started by checking if we have prior content
+    // This needs to be tracked in the result state, not as a static variable
+    bool text_block_started = false;  // TODO: track this in task_result_state
 
     if (first) {
         text_block_started = false;

Note: A complete fix requires tracking text_block_started in the per-request state (e.g., in task_result_state).

🧹 Nitpick comments (3)
tools/server/server-queue.cpp (1)

390-408: Correct batch result handling with index validation.

The wait_for_all implementation:

  • Pre-allocates results vector to correct size (line 393)
  • Uses result->index for proper placement (line 404)
  • Includes bounds check and duplicate detection assertions (lines 405-406)

The batch_res.results.clear() on line 392 is redundant since results is default-initialized empty, but it's harmless.

🔎 Optional: Remove redundant clear()
 server_response_reader::batch_response server_response_reader::wait_for_all(const std::function<bool()> & should_stop) {
     batch_response batch_res;
-    batch_res.results.clear();
     batch_res.results.resize(id_tasks.size());
tools/server/server-common.cpp (1)

118-129: Consider validating lora id before insertion.

The function accepts id = -1 as a default when parsing fails, but inserts it into the map without validation. A negative id might indicate a parsing error or invalid input.

🔎 Proposed validation
 std::map<int, float> parse_lora_request(const json & data) {
     std::map<int, float> lora;
 
     // set value
     for (const auto & entry : data) {
         int id      = json_value(entry, "id", -1);
+        if (id < 0) {
+            throw std::invalid_argument("lora entry must have a valid 'id' field");
+        }
         float scale = json_value(entry, "scale", 0.0f);
         lora[id] = scale;
     }
 
     return lora;
 }
tools/server/server-context.cpp (1)

3071-3082: Consider a cleaner pattern for preventing accidental ctx_server access.

The bool server_ctx; shadowing trick (lines 3076-3078) is creative but fragile. A future maintainer might remove it thinking it's dead code. Consider alternative approaches:

  • Extract sleeping-safe handlers into a separate helper that doesn't capture ctx_server
  • Use a static analysis annotation or [[maybe_unused]] with a more descriptive name

The current approach works but could be more robust.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1f15b4 and 35a3bc3.

📒 Files selected for processing (11)
  • tools/cli/cli.cpp
  • tools/server/server-common.cpp
  • tools/server/server-common.h
  • tools/server/server-context.cpp
  • tools/server/server-context.h
  • tools/server/server-http.cpp
  • tools/server/server-queue.cpp
  • tools/server/server-queue.h
  • tools/server/server-task.cpp
  • tools/server/server-task.h
  • tools/server/server.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server.cpp
  • tools/server/server-queue.h
  • tools/server/server-http.cpp
  • tools/cli/cli.cpp
  • tools/server/server-common.cpp
  • tools/server/server-task.h
  • tools/server/server-common.h
  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
  • tools/server/server-task.cpp
  • tools/server/server-context.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server.cpp
  • tools/server/server-queue.h
  • tools/server/server-http.cpp
  • tools/cli/cli.cpp
  • tools/server/server-common.cpp
  • tools/server/server-task.h
  • tools/server/server-common.h
  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
  • tools/server/server-task.cpp
  • tools/server/server-context.h
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

  • tools/server/server-queue.h
  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

  • tools/server/server-common.cpp
  • tools/server/server-common.h
  • tools/server/server-task.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

  • tools/server/server-common.cpp
🧬 Code graph analysis (4)
tools/server/server-queue.h (1)
tools/server/server-queue.cpp (5)
  • post_task (332-339)
  • post_task (332-332)
  • task (421-421)
  • post_tasks (341-351)
  • post_tasks (341-341)
tools/server/server-common.h (1)
tools/server/server-common.cpp (8)
  • parse_lora_request (118-129)
  • parse_lora_request (118-118)
  • tokens_to_str (1426-1433)
  • tokens_to_str (1426-1426)
  • tokens_to_str (1435-1438)
  • tokens_to_str (1435-1435)
  • tokens_to_str (1440-1442)
  • tokens_to_str (1440-1440)
tools/server/server-context.cpp (3)
tools/server/server-common.cpp (8)
  • are_lora_equal (131-144)
  • are_lora_equal (131-133)
  • lora_should_clear_cache (104-116)
  • lora_should_clear_cache (104-106)
  • oaicompat_chat_params_parse (832-1071)
  • oaicompat_chat_params_parse (832-835)
  • format_prompt_rerank (1639-1681)
  • format_prompt_rerank (1639-1644)
common/common.cpp (6)
  • lora (1171-1173)
  • lora (1171-1171)
  • common_token_to_piece (1550-1554)
  • common_token_to_piece (1550-1550)
  • common_token_to_piece (1556-1570)
  • common_token_to_piece (1556-1556)
tools/server/server-common.h (1)
  • json_value (35-47)
tools/server/server-task.cpp (1)
tools/server/server-common.cpp (2)
  • parse_lora_request (118-129)
  • parse_lora_request (118-118)
🔇 Additional comments (49)
tools/server/server-common.h (2)

110-110: LGTM! Simplified LoRA parsing API.

The signature change from std::vector<common_adapter_lora_info> to std::map<int, float> simplifies the API by directly returning an id-to-scale mapping. This aligns with the PR's goal of streamlining LoRA handling and is consistent with the implementation in server-common.cpp (lines 117-128).


326-326: LGTM! Vocab-centric token conversion overload.

Adding this overload that accepts const llama_vocab * directly enables token-to-string conversion without requiring a full context pointer. This supports the PR's thread-safety improvements by allowing conversion using vocabulary metadata rather than context state.

tools/server/server-queue.h (4)

8-8: LGTM!

Required include for the new std::vector<std::function<void(bool)>> callback collection.


29-31: LGTM! Multiple sleeping state callbacks support.

Changing from a single callback to a vector allows multiple components to register for sleeping state notifications. The comment at lines 84-85 correctly notes these must be registered before start_loop() is called, which ensures thread-safety during iteration.


100-102: LGTM!

Using push_back to accumulate callbacks aligns with the vector-based storage.


177-180: LGTM! Priority task posting support.

The new front parameter enables high-priority task insertion, useful for time-sensitive operations. The implementation in server-queue.cpp correctly handles index assignment for batched tasks.

tools/server/server-task.h (9)

9-9: LGTM!

Required include for std::map usage in LoRA adapter handling.


27-27: LGTM!

New task type SERVER_TASK_TYPE_GET_LORA complements the existing SERVER_TASK_TYPE_SET_LORA for read operations.


65-65: LGTM! Map-based LoRA configuration.

Using std::map<int, float> for adapter ID to scale mapping is cleaner than the previous vector-based approach and aligns with the simplified parse_lora_request API.


110-113: Acknowledged TODO for future refactoring.

The size_t type is appropriate for array indexing. The TODO noting plans to move index mapping to response_reader is a reasonable future improvement.


145-145: LGTM!

Consistent with the task_params.lora type change to std::map<int, float>.


155-159: Good change for thread-safety: decoupling from context.

Replacing llama_context * ctx with const llama_vocab * vocab plus int n_ctx_slot reduces dependency on the full context object. This supports the PR's goal of preventing data races from HTTP threads by using stable vocabulary data rather than mutable context state.


170-178: LGTM! Simplified create_child signature.

Removing the idx parameter simplifies the API since index management is now handled elsewhere (at task posting time per the server-queue.cpp implementation).


219-221: Acknowledged TODO for future refactoring.

Moving the index field to the base server_task_result struct centralizes batched task mapping. The TODO indicates this is a stepping stone toward a cleaner mapping approach in response_reader.


445-454: LGTM! New result type for GET_LORA tasks.

The server_task_result_get_lora struct properly encapsulates LoRA adapter information including the info, invocation string, and tokens. The nested lora struct provides good organization.

tools/server/server-context.h (6)

12-41: LGTM! Expanded metadata structure.

The renamed server_context_meta struct (from server_context_info) provides comprehensive model and vocabulary metadata. This expansion supports the meta-driven routing approach and reduces the need to access context state directly from HTTP handlers.


60-61: Good thread-safety documentation.

The comment clarifying that get_llama_context() is not thread-safe and should only be used from the main thread is valuable for preventing accidental misuse.


66-68: LGTM! Thread-safety documented.

The get_meta() accessor replaces get_info() with appropriate thread-safety documentation.


76-83: Simplified constructor and guarded update pattern.

The simplified constructor and update_meta() method provide a clear pattern for metadata initialization. The comment at line 80 correctly documents the thread-safety constraint.


110-115: LGTM!

Updated signature to include files parameter for multimodal support.


121-129: LGTM! Proper encapsulation with const correctness.

  • Using std::unique_ptr<const server_context_meta> allows late initialization while preserving const semantics for the metadata.
  • The const server_context_impl & reference enforces read-only access to the server context from routes.
  • Adding queue_tasks and queue_results references enables proper task/response flow without exposing the full context.
tools/cli/cli.cpp (1)

219-219: LGTM!

Updated to use the new get_meta() API. All accessed fields (has_inp_image, has_inp_audio, build_info, model_name) are available in the expanded server_context_meta structure.

tools/server/server.cpp (2)

122-122: LGTM - Cleaner routes initialization.

Removing the readiness predicate from the constructor simplifies the initialization and decouples route creation from HTTP readiness state. The route handlers will be protected by the middleware that checks is_ready before dispatching requests.


255-256: Good fix for data race prevention.

Calling routes.update_meta(ctx_server) after successful model load ensures that metadata is populated only when the model context is fully initialized. Combined with is_ready.store(true) on the next line, this guarantees clients won't access incomplete metadata.

tools/server/server-task.cpp (5)

35-37: LGTM - Clean map-based lora serialization.

The iteration over this->lora as a map with {first, second} pairs is consistent with the new std::map<int, float> representation from parse_lora_request.


147-151: Good API change to vocab-centric approach.

Changing from llama_context* to llama_vocab* with explicit n_ctx_slot parameter:

  1. Reduces coupling to full context object
  2. Prevents potential data races by not accessing context during parameter parsing
  3. Makes the slot-specific context size explicit rather than derived

This aligns with the PR objective of preventing data races from HTTP threads.


222-230: Simplified lora parsing.

The new parse_lora_request returns a map directly from the JSON data without requiring the base adapter vector. The empty map fallback on line 229 is appropriate when no lora is specified.


242-248: Correct use of n_ctx_slot for penalty defaults.

Using n_ctx_slot instead of llama_n_ctx(ctx) ensures the penalty window defaults are based on the slot's context size, which is semantically correct for per-slot processing.


1325-1347: LGTM - Well-structured lora result serialization.

The to_json() implementation properly:

  • Iterates through loras with correct indexing
  • Includes all required fields (id, path, scale, task_name, prompt_prefix)
  • Conditionally adds alora fields only when tokens are present
tools/server/server-http.cpp (3)

181-195: Correct 503 handling during server loading.

Returning 503 with a JSON error for non-HTML endpoints when the server is not ready is the correct behavior. The comment explains the rationale: preventing data races and inconsistent states by blocking all endpoint access during loading.


336-371: Good request lifetime management for streaming.

The changes properly address request/response lifecycle during streaming:

  1. server_http_req_ptr (unique_ptr) provides clear ownership semantics
  2. Both request and response are converted to shared_ptr for the streaming path (lines 345-346)
  3. The on_complete lambda captures both and resets them to ensure proper destruction order

This fixes the data race where the httplib request object could be destroyed before the response stream completes.


373-398: LGTM - Consistent request allocation pattern.

Both get and post handlers now:

  1. Allocate request as unique_ptr
  2. Dereference for handler call
  3. Move ownership into process_handler_response

This ensures the request outlives any streaming response.

tools/server/server-queue.cpp (3)

166-181: LGTM - Multi-callback sleeping state handling.

Iterating over callback_sleeping_state collection allows multiple observers to be notified when entering/exiting sleep state. This is cleaner than a single callback when multiple components need to react to state changes.


332-351: Proper task index initialization and front parameter.

Key improvements:

  • task.index = 0 for single task (line 334)
  • tasks[i].index = i for batch tasks (line 346)
  • front parameter enables priority posting for cancellation tasks

The explicit index assignment ensures results can be correctly associated with their originating tasks.


376-378: Index-based result association.

Using result->index to locate the correct state for updating is consistent with the task index assignment. The GGML_ASSERT(idx < states.size()) provides a safety check.

tools/server/server-common.cpp (1)

1426-1442: LGTM - Clean vocab-centric token-to-string conversion.

The changes properly:

  1. Update the private template helper to accept const llama_vocab*
  2. Add public overload that extracts vocab from context via llama_get_model + llama_model_get_vocab
  3. Add direct vocab-based public overload for callers that already have the vocab

This aligns with the broader vocab-centric refactoring in the PR.

tools/server/server-context.cpp (13)

509-532: LGTM - well-documented thread safety constraints.

The friend declaration enables proper pimpl pattern access, and the destructor correctly handles the sleeping state to prevent double-free. The comments on lines 513-515 clearly document the thread-safety requirements for accessing the public pointers.


1051-1062: LGTM - per-request LoRA configuration implementation.

The function correctly constructs a per-request LoRA list by copying the base adapters and applying scale overrides from the configuration map. Adapters not specified in the config are disabled (scale = 0.0f), which is the expected behavior for per-request LoRA.


2820-2829: LGTM - proper sleep state synchronization.

The constructor correctly handles the sleeping state to prevent data races. When sleep_idle_seconds < 0, sleeping is disabled so bypass is set. Otherwise, wait_until_no_sleep() ensures the server is awake before proceeding, addressing the data race concerns mentioned in the PR objectives.


3055-3065: LGTM - clean factory pattern and constructor initialization.

The create_response() factory properly passes the sleep parameters, and the constructor correctly initializes references to the internal components. This design decouples route handlers from the public server_context interface while maintaining proper access to the task and result queues.


3253-3295: LGTM - props endpoint correctly uses meta for thread-safe access.

The handler properly uses the meta struct for model/context information instead of accessing ctx_server directly, which aligns with the goal of preventing data races during sleeping state. The queue_tasks.is_sleeping() call should be thread-safe.


3695-3716: LGTM - data race fix for LoRA endpoint.

This handler correctly addresses the data race by using the task queue to retrieve LoRA adapter information instead of directly accessing params_base.lora_adapters. The SERVER_TASK_TYPE_GET_LORA task ensures the read happens on the main processing thread, preventing concurrent access issues from HTTP threads.


3718-3745: LGTM - consistent task-based pattern for setting LoRA adapters.

The handler follows the same thread-safe pattern as get_lora_adapters, using the task queue to modify LoRA settings. The parse_lora_request call converts the request body to the expected std::map<int, float> format used by construct_lora_list.


2846-2857: LGTM - completions handler properly synchronizes with sleeping state.

The create_response() call ensures the server is awake before proceeding, making subsequent ctx_server access safe. The signature change to pass server_http_req by const reference is appropriate.


3565-3566: LGTM - consistent use of vocab-based token-to-piece API.

This confirms the common_token_to_piece(vocab, token) signature is the standard API. The usage is correct and consistent with the function signature shown in the relevant code snippets.


3600-3601: LGTM - vocab-based detokenization.

The usage of tokens_to_str(ctx_server.vocab, tokens) is consistent with the vocab-centric API design throughout the refactored code.


3840-3841: LGTM - uses meta for pooling type check.

The check correctly uses meta->pooling_type instead of directly querying the context, consistent with the meta-centric design for accessing model properties.


2801-2802: No action needed. Both overloads of common_token_to_piece exist in common/common.h: one accepting const llama_context* and another accepting const llama_vocab*. Lines 2801-2802 correctly use the context-based overload, while line 1825 correctly uses the vocab-based overload. Both usages are valid.

Likely an incorrect or invalid review comment.


2972-2973: Request lifetime is safely managed during streaming, but the reference capture pattern is fragile.

The lambda on line 2972 captures &req by reference and is stored in res->next. While this appears risky at first glance, the request object remains valid during streaming because:

  1. The server_http_req object is converted to a shared_ptr in process_handler_response() (line ~343)
  2. Both the request and response are kept alive as shared pointers during streaming via chunked_content_provider() and on_complete()
  3. The streaming callback is only invoked while these shared pointers are in scope

However, this pattern relies on implicit lifetime guarantees from cpp-httplib's streaming mechanism. To improve robustness and clarity, consider capturing the request by value or using req.should_stop in a way that doesn't depend on reference capture semantics.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tools/server/server-context.cpp (1)

1051-1062: Consider validating LORA adapter indices or documenting the behavior.

The construct_lora_list function silently ignores adapter indices in the config map that are out of bounds (>= params_base.lora_adapters.size()). While this might be intentional for API flexibility, it could lead to silent failures that are hard to debug.

Consider either:

  1. Adding validation to return an error if invalid indices are provided
  2. Adding a warning log when indices are ignored
  3. Documenting this behavior clearly in the function comment
🔎 Example: Add validation
 std::vector<common_adapter_lora_info> construct_lora_list(const std::map<int, float> & config) {
     std::vector<common_adapter_lora_info> output = params_base.lora_adapters; // copy
+    // Validate config indices
+    for (const auto & [idx, scale] : config) {
+        if (idx < 0 || idx >= (int)output.size()) {
+            SRV_WRN("ignoring invalid lora adapter index %d (valid range: 0-%d)\n", idx, (int)output.size() - 1);
+        }
+    }
     for (size_t i = 0; i < output.size(); ++i) {
         auto it = config.find(i);
         if (it != config.end()) {
             output[i].scale = it->second;
         } else {
             output[i].scale = 0.0f;
         }
     }
     return output;
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35a3bc3 and 982eda7.

📒 Files selected for processing (1)
  • tools/server/server-context.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-context.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

  • tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: ios-xcode-build
  • GitHub Check: ggml-ci-x64-cpu-low-perf
  • GitHub Check: android-build
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: server-windows
🔇 Additional comments (2)
tools/server/server-context.cpp (2)

2822-2829: LGTM: Well-structured refactoring of response generation.

The refactoring to centralize response generation through create_response() and the bypass_sleep mechanism is well-designed:

  1. Clear separation: Endpoints that only need metadata use bypass_sleep=true, while those needing context access use bypass_sleep=false
  2. Consistent pattern: All route handlers follow the same pattern of calling create_response() first
  3. Safety: The wait_until_no_sleep() call ensures ctx_server members aren't accessed during sleeping state
  4. Documentation: The dummy variable pattern (lines 3077-3078, 3258-3259) helps document sleeping state constraints

The approach aligns with your preference for simplification (based on learnings) while maintaining safety.

Also applies to: 3055-3057, 3067-3082


3059-3065: No action required: meta initialization and thread-safety are properly designed.

The meta member is intentionally not initialized in the constructor. It uses std::unique_ptr for late initialization via the update_meta() method, which is called only when ctx_http.is_ready is false. Thread-safety is ensured by the atomic is_ready flag: handlers (which access meta) are only invoked after is_ready is set to true, establishing proper happens-before semantics. The documented comments and atomic synchronization primitive confirm this design is intentional and correct.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)

2970-3014: Capturing req by reference in the streaming lambda may cause use-after-free.

The lambda at line 2972 captures &req, but this lambda is stored in res->next and may be invoked after the original handle_completions_impl function returns. If req goes out of scope before the lambda is called, accessing req.should_stop() (lines 2985, 3014) would be undefined behavior.

Verify that server_http_req outlives the streaming response. If the HTTP framework guarantees req remains valid for the duration of streaming, this is safe. Otherwise, capture necessary state by value or use a different mechanism.

#!/bin/bash
# Check the lifetime of server_http_req and how streaming responses are handled
rg -n "server_http_req" --type cpp -A 5 -B 2 | head -100
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 982eda7 and b1b8098.

📒 Files selected for processing (3)
  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
  • tools/server/server-queue.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/server/server-queue.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).

Applied to files:

  • tools/server/server-context.cpp
  • tools/server/server-queue.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: ggml-ci-arm64-cpu-high-perf
  • GitHub Check: windows-2022-cmake-cuda (12.4)
  • GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
  • GitHub Check: ggml-ci-arm64-cpu-kleidiai
  • GitHub Check: ggml-ci-arm64-cpu-low-perf
  • GitHub Check: ubuntu-22-cmake-sycl-fp16
  • GitHub Check: ios-xcode-build
  • GitHub Check: windows-latest-cmake-hip
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-22-cmake-hip
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
  • GitHub Check: macOS-latest-cmake-x64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: server-windows
🔇 Additional comments (20)
tools/server/server-queue.cpp (4)

328-334: LGTM!

The post_task function correctly initializes task.index = 0 before posting and properly forwards the front parameter to the queue. The assertion ensures single-use semantics per reader.


337-346: LGTM!

The post_tasks function correctly assigns sequential indices to each task and forwards the front parameter. The index assignment before state creation ensures proper ordering for batch result collection.


370-375: LGTM!

Using result->index directly is cleaner than a getter method. The bounds check at line 373 ensures safe access to the states vector.


386-406: LGTM!

The batch result handling correctly:

  1. Pre-sizes the results vector to match id_tasks.size()
  2. Uses result->index for placement with proper bounds assertion
  3. Detects duplicate results with the nullptr check

The assertions at lines 401-402 provide good runtime validation for invariants.

tools/server/server-context.cpp (16)

509-532: LGTM!

The restructuring of server_context_impl with explicit friend declaration and clear public/private separation improves encapsulation. The destructor correctly avoids double-free by checking the sleeping state before calling destroy().


1051-1062: LGTM!

The construct_lora_list function correctly:

  1. Creates a copy of the base LoRA adapters
  2. Applies per-request scale overrides from the config map
  3. Defaults unspecified adapters to scale 0.0f

This is cleaner than the previous vector-based approach.


1064-1082: LGTM!

The updated LoRA handling in launch_slot_with_task correctly uses the new construct_lora_list helper and maintains the cache-clearing logic when LoRA configurations change.


1811-1836: Variable shadowing issue has been addressed.

The inner loop variable is now correctly named j (line 1824) instead of i, which fixes the variable shadowing bug flagged in the previous review. The outer loop's i at line 1817 is no longer shadowed.


1837-1849: LGTM!

The SERVER_TASK_TYPE_SET_LORA case correctly uses construct_lora_list with the new map-based request format and provides useful logging.


2785-2813: LGTM!

The get_meta() function provides a clean, consolidated way to expose server metadata. The structure initialization is complete and properly accesses internal state through the impl pointer.


2818-2829: LGTM!

The server_res_generator constructor properly handles the bypass_sleep optimization by short-circuiting when sleeping is disabled (sleep_idle_seconds < 0).


3055-3065: LGTM!

The create_response helper and server_routes constructor properly initialize the routing infrastructure with references to the queue and server context.


3071-3082: Good defensive practice with the dummy variable.

The pattern of declaring bool server_ctx; with the comment "do NOT delete this line" is a clever way to prevent accidental use of ctx_server in endpoints that should work during sleeping state. This provides compile-time safety.


3253-3293: LGTM!

The get_props endpoint properly accesses metadata through the meta pointer and correctly uses queue_tasks.is_sleeping() to report sleeping state. The conditional addition of chat_template_tool_use is clean.


3565-3566: Prefer vocab-based tokenization for consistency.

The call to common_token_to_piece(ctx_server.vocab, token) correctly uses the vocab directly rather than going through the context, which aligns with the refactoring mentioned in the AI summary.


3600-3601: Consistent with vocab-centric approach.

tokens_to_str(ctx_server.vocab, tokens) correctly uses the vocabulary directly, maintaining consistency with the refactoring.


3695-3716: LGTM!

The get_lora_adapters handler correctly posts a task and waits for the result using the new task-based approach for LoRA information retrieval.


3718-3745: LGTM!

The post_lora_adapters handler correctly uses parse_lora_request to convert the request body and posts the task. The dynamic cast assertion ensures type safety.


2846-2857: The initialization flow is correct; no action needed. meta is guaranteed to be initialized before route handlers execute. update_meta() is called immediately after load_model() succeeds (line 255 of server.cpp) and before ctx_http.is_ready is set to true (line 256). Route handlers won't be called until is_ready is true, ensuring meta->slot_n_ctx and meta->model_name are always available when accessed.


3833-3840: Ensure consistent pooling_type checks between reranking and embeddings endpoints.

Line 3617 checks params.pooling_type for the reranking endpoint, while line 3840 checks meta->pooling_type for embeddings. These represent different values (user-configured vs. model's native type). Verify this intentional difference is correct: reranking validates user configuration, while embeddings validates model capability. If both should check the same value or if user-provided pooling overrides should be considered, update for consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants