Skip to content

Conversation

@ParticularlyPythonicBS
Copy link
Member

@ParticularlyPythonicBS ParticularlyPythonicBS commented Sep 23, 2025

Internal Bucket Support for Team-Only Datasets

This PR introduces comprehensive support for internal (team-only) R2 buckets alongside the existing production (public) bucket system.

These internal bucket databases will be visible in a datamanager pull with the right api access or in the cloudflare bucket page.

Configuration Required

Add to your .env file:

R2_INTERNAL_BUCKET="your-internal-bucket-name"

Summary by CodeRabbit

  • New Features

    • Added internal (team-only) bucket support: datasets record a target bucket (production|internal). CLI prepare/pull gain a --bucket option, interactive flows differentiate production vs internal, listings show per-dataset bucket, and publish/delete/deploy operations target each dataset’s bucket.
  • Documentation

    • Updated README, setup, usage, workflow, and env examples to document multi-bucket behavior and the R2_INTERNAL_BUCKET variable.
  • Tests

    • Expanded tests to cover internal-bucket flows, alias resolution, permission checks across production/staging/internal, and related workflows.
  • Chores

    • CI workflow exposes R2_INTERNAL_BUCKET to the publish step; pre-commit tooling updated.

- Add multi-bucket architecture supporting production (public) and internal (team-only) R2 buckets
- Implement bucket routing based on dataset manifest configuration
- Add --bucket option to prepare and pull commands
- Update CI/CD workflows to handle internal bucket publications and deletions
@coderabbitai
Copy link

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds per-dataset bucket support (production vs internal) across CI publish script, CLI, core R2 APIs, config, manifest, docs, env, and tests. Operations (prepare, pull, verify, delete, publish) now target a dataset's configured bucket; manifest entries include a new "bucket" field and helpers to get/update it.

Changes

Cohort / File(s) Summary
Publish script bucketization
.github/scripts/publish_script.py
Adds INTERNAL_BUCKET constant, groups objects by target bucket, performs bucket-specific batch deletes and server-side copies, and updates logs/commit messages to include destination bucket context.
Workflow env injection
.github/workflows/publish.yml
Exposes R2_INTERNAL_BUCKET to the Publish Script step environment.
Core R2 API bucket support
src/datamanager/core.py, src/datamanager/config.py
Adds internal_bucket setting; resolves bucket aliases; extends upload/download/delete/pull_and_verify to accept an optional bucket and compute target bucket; adds upload_to_internal; verify checks internal bucket permissions.
CLI bucket-aware flows
src/datamanager/__main__.py
Adds --bucket option to prepare and pull; threads bucket through logic and R2 keys; adds an internal prepare interactive path; persists dataset bucket in manifest; updates menus and signatures.
Manifest helpers and data
src/datamanager/manifest.py, manifest.json, tests/conftest.py
Adds get_dataset_bucket and update_dataset_bucket, includes them in __all__; manifest entries gain a "bucket" field; a new internal dataset entry added; test fixture manifests updated.
Docs: setup/usage/workflow/README
README.md, docs/source/setup.md, docs/source/usage.md, docs/source/workflow.md
Documents the internal bucket concept, adds R2_INTERNAL_BUCKET to env docs and examples, updates CLI examples and workflow narrative to reflect multi-bucket behavior and access notes.
Env example
env.example
Adds R2_INTERNAL_BUCKET placeholder variable.
Tests for bucket-aware behavior
tests/test_core.py, tests/test_main.py, tests/test_manifest.py
Expands verify tests to include internal bucket; adds and updates tests asserting manifest bucket defaults and propagation; adds tests for internal prepare/pull flows and alias resolution.
Pre-commit bump
.pre-commit-config.yaml
Bumps uv-pre-commit rev and adds pyupgrade hook.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (__main__.py)
  participant Manifest as Manifest
  participant Core as Core (R2 APIs)
  participant R2Prod as R2: Production
  participant R2Int as R2: Internal

  rect rgb(240,248,255)
  note over User,CLI: Prepare dataset
  User->>CLI: prepare --bucket=<production|internal>
  CLI->>Manifest: update_dataset_bucket(name, bucket)
  CLI->>Core: upload_to_staging(file, staging_key, bucket)
  CLI->>Manifest: write history entry (staging_key, bucket)
  end

  rect rgb(245,255,240)
  note over User,CLI: Pull dataset
  User->>CLI: pull --bucket=<production|internal>
  CLI->>Manifest: get_dataset_bucket(name)
  CLI->>Core: pull_and_verify(object_key, sha, out, bucket)
  alt bucket=production
    Core->>R2Prod: GET object_key
  else bucket=internal
    Core->>R2Int: GET object_key
  end
  Core-->>CLI: verified
  CLI-->>User: file saved
  end
Loading
sequenceDiagram
  autonumber
  participant GH as GitHub Actions
  participant Pub as publish_script.py
  participant R2P as R2: Production
  participant R2I as R2: Internal

  rect rgb(255,250,240)
  note over GH,Pub: Publish step
  GH->>Pub: env (R2_INTERNAL_BUCKET,…)
  Pub->>Pub: Group objects by bucket
  par Delete production objects
    Pub->>R2P: Batch DELETE(keys)
  and Delete internal objects
    Pub->>R2I: Batch DELETE(keys)
  end
  par Copy production objects
    Pub->>R2P: Server-side COPY(staging_key -> object_key)
  and Copy internal objects
    Pub->>R2I: Server-side COPY(staging_key -> object_key)
  end
  Pub-->>GH: logs incl. bucket context
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A hop to Prod, a skip inside,
Two buckets where the bytes now ride.
I stash my carrots, staged just right,
Then copy under moonlit night.
With paws on manifests I twirl—internal secrets, public swirl. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change in the changeset—adding support for an internal, team-only R2 bucket—and aligns with the PR objectives and the code, config, docs, and tests modifications described in the summary; it is concise, specific, and free of noise.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dev_bucket

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_manifest.py (1)

507-515: Fix failing expectation: pass bucket alias to core or adjust assertion

CI shows the 4th arg to pull_and_verify is a real bucket name (e.g., "DUMMY") instead of "internal". Either:

  • Pass the alias through to core and let core resolve it, or
  • Update the test to check against settings.internal_bucket.

Given the rest of the suite expects aliases at boundaries, prefer passing the alias and resolving inside core.

Apply these diffs:

  1. Resolve alias in core and accept both alias or raw bucket names:
--- a/src/datamanager/core.py
+++ b/src/datamanager/core.py
@@
-from typing import Any, TypedDict
+from typing import Any, TypedDict
+from typing import Optional  # if needed elsewhere
@@
-def upload_to_r2(
-    client: S3Client, file_path: Path, object_key: str, bucket: str = None
-) -> None:
+def upload_to_r2(
+    client: S3Client, file_path: Path, object_key: str, bucket: str | None = None
+) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)
@@
-def download_from_r2(
-    client: S3Client, object_key: str, download_path: Path, bucket: str = None
-) -> None:
+def download_from_r2(
+    client: S3Client, object_key: str, download_path: Path, bucket: str | None = None
+) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)
@@
-def pull_and_verify(
-    object_key: str, expected_hash: str, output_path: Path, bucket: str = None
-) -> bool:
+def pull_and_verify(
+    object_key: str, expected_hash: str, output_path: Path, bucket: str | None = None
+) -> bool:
@@
-        download_from_r2(client, object_key, output_path, bucket)
+        download_from_r2(client, object_key, output_path, bucket)
@@
-def delete_from_r2(client: S3Client, object_key: str, bucket: str = None) -> None:
+def delete_from_r2(client: S3Client, object_key: str, bucket: str | None = None) -> None:
@@
-    target_bucket = bucket or settings.bucket
+    target_bucket = _resolve_bucket(bucket)

Add this helper (outside the changed ranges, e.g., near other helpers):

def _resolve_bucket(bucket: str | None) -> str:
    # Accept aliases or raw bucket names.
    if bucket is None or bucket == "production":
        return settings.bucket
    if bucket == "internal":
        return settings.internal_bucket
    return bucket
  1. Pass the alias from CLI to core:
--- a/src/datamanager/__main__.py
+++ b/src/datamanager/__main__.py
@@
-    # Determine which bucket to use
-    target_bucket = (
-        settings.internal_bucket if bucket == "internal" else settings.bucket
-    )
-
     success = core.pull_and_verify(
         version_entry["r2_object_key"],
         version_entry["sha256"],
         final_path,
-        target_bucket,
+        bucket,  # pass alias; core resolves to bucket name
     )
  1. Use alias for diff download too (core resolves to actual):
--- a/src/datamanager/__main__.py
+++ b/src/datamanager/__main__.py
@@
-            # Download from the appropriate bucket
-            target_bucket = (
-                settings.internal_bucket if bucket == "internal" else settings.bucket
-            )
-            core.download_from_r2(
-                client, latest_version["r2_object_key"], old_path, target_bucket
-            )
+            # Download from the appropriate bucket (alias; core resolves)
+            core.download_from_r2(
+                client, latest_version["r2_object_key"], old_path, bucket
+            )
🧹 Nitpick comments (14)
docs/source/workflow.md (1)

44-49: Clarify how the “appropriate bucket” is chosen.

Suggest explicitly stating that CI selects the target bucket based on the dataset’s bucket field in manifest.json to remove ambiguity.

- It copies the data from the staging bucket to the appropriate target bucket (production or internal).
+ It copies the data from the staging bucket to the appropriate target bucket (production or internal), as specified by each dataset’s bucket field in manifest.json.
.github/workflows/publish.yml (1)

36-45: Fail fast if R2_INTERNAL_BUCKET (and other required env) are unset.

Add a preflight validation step so the job errors with a clear message when any required variable is missing.

       - name: Install project and dependencies
         run: |
           uv sync --all-extras --dev
           uv pip install -e . --no-deps

+      - name: Validate required env
+        run: |
+          set -euo pipefail
+          for v in R2_ACCOUNT_ID R2_ACCESS_KEY_ID R2_SECRET_ACCESS_KEY R2_PRODUCTION_BUCKET R2_STAGING_BUCKET R2_INTERNAL_BUCKET; do
+            if [ -z "${!v:-}" ]; then
+              echo "::error::Missing required env: $v"
+              exit 1
+            fi
+          done
+        env:
+          R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
+          R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
+          R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
+          R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }}
+          R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }}
+          R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }}
+
       - name: Run Publish Script
         run: uv run .github/scripts/publish_script.py
         env:
           R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
           R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
           R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
           R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} # Use repo variable
           R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} # Use repo variable
           R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} # Use repo variable
docs/source/usage.md (1)

55-65: Note permission requirements for internal pulls.

Add a short note pointing users to verify if they get AccessDenied when pulling internal datasets.

-Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest.
+Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. Note: Pulling an internal dataset requires access to the internal bucket. If you see AccessDenied, run `uv run datamanager verify`.
src/datamanager/config.py (2)

6-11: Also read from OS environment, not only .env.

current code ignores exported env vars if .env is not present; reading both improves usability in CI and shells.

 from dotenv import find_dotenv, dotenv_values
+import os
 import warnings

 _ENV_PATH = Path(find_dotenv()) if find_dotenv() else None
 _ENV = dotenv_values(_ENV_PATH) if _ENV_PATH else {}

13-22: Apply env fallback in _need().

Prefer OS env first, then .env, then warn and use dummy.

 def _need(var: str) -> str:
-    val = _ENV.get(var)
+    val = os.getenv(var, _ENV.get(var))
     if val is None:
         warnings.warn(
             f"Env var {var} is missing – using dummy value for offline/test mode",
             RuntimeWarning,
         )
         val = "DUMMY"
     return str(val)
docs/source/setup.md (1)

3-8: LGTM; minor doc polish suggestion.

Consider adding a one-liner that access to Internal bucket should be tightly scoped to team members (principle of least privilege). Optional.

Also applies to: 46-47

tests/test_core.py (3)

91-105: Avoid brittle order-dependence in verify_r2_access results.

Tests assume [production, staging, internal] ordering. Please ensure core.verify_r2_access guarantees this order, or expose a bucket identifier so tests can map results by name instead of index.


116-124: Broaden assertions to staging/internal for read-only case.

Optional: assert the same read-only behavior for staging and internal to increase coverage.

 results = core.verify_r2_access()
-prod_result = results[0]
-
-assert prod_result["exists"] is True
-assert prod_result["permissions"]["read"] is True
-assert prod_result["permissions"]["write"] is False
-assert prod_result["permissions"]["delete"] is False
-assert "Partial access: [read]" in prod_result["message"]
+for r in results:
+    assert r["exists"] is True
+    assert r["permissions"]["read"] is True
+    assert r["permissions"]["write"] is False
+    assert r["permissions"]["delete"] is False
+    assert "Partial access: [read]" in r["message"]

132-138: Broaden assertions to all buckets for not-found case.

Given head_bucket raises for each call, all buckets should reflect not-found.

 results = core.verify_r2_access()
-prod_result = results[0]
-
-assert prod_result["exists"] is False
-assert not any(prod_result["permissions"].values())
-assert "Bucket not found" in prod_result["message"]
+for r in results:
+    assert r["exists"] is False
+    assert not any(r["permissions"].values())
+    assert "Bucket not found" in r["message"]
src/datamanager/manifest.py (2)

204-218: Type‑narrow the return and clarify defaulting

Constrain the return type to the allowed literals and keep the current defaulting behavior.

-from typing import Any, Optional
+from typing import Any, Optional, Literal
...
-def get_dataset_bucket(name: str) -> str:
+def get_dataset_bucket(name: str) -> Literal["production", "internal"]:
@@
-    return dataset.get("bucket", "production")
+    return dataset.get("bucket", "production")

220-234: Validate input and avoid unnecessary write on no‑op

  • Validate that only "production" or "internal" are accepted.
  • Return early if dataset is not found to avoid noop writes.
-from typing import Any, Optional
+from typing import Any, Optional, Literal
@@
-def update_dataset_bucket(name: str, bucket: str) -> None:
+def update_dataset_bucket(name: str, bucket: Literal["production", "internal"]) -> None:
@@
-    for item in data:
+    updated = False
+    for item in data:
         if item.get("fileName") == name:
             item["bucket"] = bucket
-            break
-    write_manifest(data)
+            updated = True
+            break
+    if updated:
+        write_manifest(data)
.github/scripts/publish_script.py (1)

49-55: Docs generation: avoid public links for internal datasets

For internal datasets, the generated "Download" link will point to a public base URL and likely be inaccessible. Prefer a label instead.

Outside this hunk, adjust finalize_dataset_docs:

# inside the loop over datasets, before building download_link
is_internal = dataset.get("bucket") == "internal"
...
download_link = "N/A"
if r2_object_key:
    if is_internal:
        download_link = "Internal-only"
    else:
        full_url = f"{BASE_DOWNLOAD_URL}{r2_object_key}"
        download_link = f"[Download]({full_url})"
src/datamanager/__main__.py (2)

265-271: Unused ctx in _run_prepare_logic

ctx isn’t used; consider removing the parameter or prefix with underscore to satisfy linters. Keeping it is OK if you plan to use it soon.


385-390: Consider restricting --bucket to choices

Define an Enum or use typer.Choice to prevent invalid values.

Example:

class Bucket(str, Enum):
    production = "production"
    internal = "internal"

bucket: Bucket = typer.Option(Bucket.production, "--bucket", "-b", case_sensitive=False)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ec6e33 and 0372b4e.

📒 Files selected for processing (16)
  • .github/scripts/publish_script.py (5 hunks)
  • .github/workflows/publish.yml (1 hunks)
  • README.md (6 hunks)
  • docs/source/setup.md (2 hunks)
  • docs/source/usage.md (2 hunks)
  • docs/source/workflow.md (1 hunks)
  • env.example (1 hunks)
  • manifest.json (2 hunks)
  • src/datamanager/__main__.py (13 hunks)
  • src/datamanager/config.py (1 hunks)
  • src/datamanager/core.py (7 hunks)
  • src/datamanager/manifest.py (2 hunks)
  • tests/conftest.py (1 hunks)
  • tests/test_core.py (1 hunks)
  • tests/test_main.py (4 hunks)
  • tests/test_manifest.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_manifest.py (2)
tests/conftest.py (1)
  • test_repo (16-82)
src/datamanager/manifest.py (3)
  • get_dataset_bucket (204-217)
  • update_dataset_bucket (220-233)
  • read_manifest (33-52)
tests/test_main.py (2)
tests/conftest.py (1)
  • test_repo (16-82)
src/datamanager/manifest.py (1)
  • get_dataset (78-92)
src/datamanager/__main__.py (2)
src/datamanager/core.py (2)
  • pull_and_verify (108-134)
  • download_from_r2 (81-105)
src/datamanager/manifest.py (1)
  • get_dataset (78-92)
🪛 GitHub Actions: CI
tests/test_main.py

[error] 514-514: Test 'test_pull_internal_bucket' failed: expected bucket parameter 'internal' but actual value 'DUMMY' (likely due to offline/test mode dummy values).

🪛 Ruff (0.13.1)
src/datamanager/core.py

62-62: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


82-82: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


109-109: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


203-203: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

src/datamanager/__main__.py

266-266: Unused function argument: ctx

(ARG001)

🔇 Additional comments (33)
docs/source/usage.md (1)

97-101: LGTM.

Clearer verify description aligning with three-bucket support.

tests/conftest.py (1)

46-46: LGTM; ensure legacy manifests remain compatible.

Adding bucket: "production" is good. Please confirm the loader defaults to production when bucket is absent so older manifests don’t break; add/keep a test for that.

src/datamanager/config.py (1)

31-31: LGTM.

internal_bucket addition aligns with docs and workflow.

env.example (1)

6-6: LGTM.

Example env updated for internal bucket.

src/datamanager/manifest.py (1)

24-26: Export list update looks good

New APIs are properly exported.

tests/test_manifest.py (5)

36-36: LGTM: Bucket field asserted

Asserting default "production" on existing dataset is correct per fixtures.


52-53: LGTM: Bucket preserved across history update

Ensures update path doesn't drop the field.


55-64: LGTM: get_dataset_bucket tests

Covers both present and missing dataset cases.


66-76: LGTM: update_dataset_bucket tests

Covers update and no‑op behaviors.


476-500: LGTM: Internal bucket prepare path

Validates internal flow and manifest fields.

manifest.json (2)

4-4: LGTM: Default bucket added to existing dataset

Matches code paths that default to "production".


36-51: LGTM: Example internal dataset entry

Structure aligns with create flow (bucket="internal", v1 history).

.github/scripts/publish_script.py (4)

137-142: Good: per‑dataset bucket routing for deletions

Bucket selection logic is clear and isolated.


174-213: Good: batch delete grouped per bucket

Chunking and error handling are appropriate.


241-255: Good: publish targets correct bucket and logs context

Server‑side copy uses target bucket; logs are clear.


15-15: R2_INTERNAL_BUCKET exported in CI
publish.yml exports R2_INTERNAL_BUCKET from repo vars: .github/workflows/publish.yml:44.

README.md (8)

62-69: LGTM: Feature bullets updated for multi-bucket and verify

Clear and consistent with CLI.


70-87: LGTM: Bucket types section

Concise and actionable guidance.


93-97: LGTM: Prereqs list includes internal bucket

Matches new env requirements.


139-139: LGTM: Env var added

R2_INTERNAL_BUCKET documented.


169-174: LGTM: Prepare examples show bucket selection

Good defaults and internal option.


220-221: LGTM: CLI note on --bucket

Accurately reflects new behavior.


260-271: LGTM: Pull examples for production and internal

Covers both latest and specific version cases.


303-307: LGTM: Verify mentions all three buckets

Aligned with core.verify_r2_access.

src/datamanager/core.py (2)

325-339: LGTM: upload_to_internal helper

Useful for explicit internal uploads; consistent progress handling.


280-296: LGTM: verify includes internal bucket

Covers all three buckets; messages are coherent.

src/datamanager/__main__.py (2)

70-73: LGTM: Verify table headers clarified

Minor UX polish.


104-112: LGTM: Datasets list shows bucket clearly

Icons and labels improve visibility.

tests/test_main.py (5)

33-33: Default bucket assertion LGTM

Good check that new datasets default to production.


233-234: Interactive prepare correctly propagates bucket

Ensures default production bucket flows into core logic.


299-300: Interactive pull correctly propagates bucket

Production bucket is forwarded to the pull logic as expected.


476-500: Internal prepare path looks good

Covers internal bucket flag end-to-end and validates manifest persistence.


502-515: Fix failing assertion: compare against settings.internal_bucket, not literal "internal"

Replace the assertion with:
assert call_args[3] == settings.internal_bucket

Likely an incorrect or invalid review comment.

refactor: centralize bucket resolution logic in core module

- Add _resolve_bucket() helper function with PEP 604 union types
- Update upload_to_r2, download_from_r2, and delete_from_r2 to use helper
- Simplify bucket handling in main module to pass aliases directly
- Support both bucket aliases (production/internal) and raw bucket names consistently
@codecov
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 95.28024% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datamanager/__main__.py 67.39% 12 Missing and 3 partials ⚠️
tests/test_manifest.py 97.50% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
src/datamanager/core.py (2)

68-85: _resolve_bucket logic looks good

Covers None/"production"/"internal" and pass-through of raw names. Consider centralizing all resolution via this helper (see previous comment).


386-400: Avoid duplication: delegate upload_to_internal to upload_to_r2

Reuse the generic uploader with bucket="internal" to keep behavior consistent and reduce maintenance.

 def upload_to_internal(client: S3Client, file_path: Path, object_key: str) -> None:
-    """Uploads a file to the INTERNAL R2 bucket with a progress bar."""
-    file_size = file_path.stat().st_size
-    with Progress() as progress:
-        task = progress.add_task(
-            f"[blue]Uploading to internal: {file_path.name}...", total=file_size
-        )
-        client.upload_file(
-            str(file_path),
-            settings.internal_bucket,
-            object_key,
-            Callback=lambda bytes_transferred: progress.update(
-                task, advance=bytes_transferred
-            ),
-        )
+    """Uploads a file to the INTERNAL R2 bucket with a progress bar."""
+    upload_to_r2(client, file_path, object_key, bucket="internal")
src/datamanager/__main__.py (3)

260-266: Unused ctx argument in _run_prepare_logic

Ruff ARG001: ctx is unused. Rename to _ctx or remove the parameter.

-def _run_prepare_logic(
-    ctx: typer.Context, name: str, file: Path, bucket: str = "production"
-) -> None:
+def _run_prepare_logic(
+    _ctx: typer.Context, name: str, file: Path, bucket: str = "production"
+) -> None:

Call sites need no change since it’s positional.


377-382: Constrain bucket option to avoid typos

Restrict CLI bucket values to "production" or "internal" for better UX.

-from typing import Callable, Optional, Any
+from typing import Callable, Optional, Any, Literal
@@
-def prepare(
+def prepare(
@@
-    bucket: str = typer.Option(
+    bucket: Literal["production", "internal"] = typer.Option(
         "production",
         "--bucket",
         "-b",
         help="Target bucket: 'production' or 'internal'. Defaults to production.",
     ),

Consider applying the same to the pull command’s bucket option.


179-184: Apply the same choice constraint to pull command

Mirror the prepare command’s restriction here.

-    bucket: str = typer.Option(
+    bucket: Literal["production", "internal"] = typer.Option(
         "production",
         "--bucket",
         "-b",
         help="Bucket to pull from: 'production' or 'internal'. Defaults to production.",
     ),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0372b4e and 2159e49.

📒 Files selected for processing (3)
  • src/datamanager/__main__.py (13 hunks)
  • src/datamanager/core.py (8 hunks)
  • src/datamanager/manifest.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/datamanager/manifest.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
  • get_dataset (78-92)
src/datamanager/core.py (1)
  • download_from_r2 (116-140)
🪛 Ruff (0.13.1)
src/datamanager/core.py

155-155: Do not catch blind exception: Exception

(BLE001)

src/datamanager/__main__.py

261-261: Unused function argument: ctx

(ARG001)

🔇 Additional comments (11)
src/datamanager/core.py (3)

96-101: upload_to_r2 correctly adopts alias resolution and PEP 604 type hints

Good use of _resolve_bucket and progress callback.

Also applies to: 107-109


116-138: download_from_r2 correctly resolves bucket and applies head/download to the same target

Looks consistent and safe. Errors are surfaced via ClientError.


343-357: verify_r2_access includes internal bucket — LGTM

Covers production, staging, internal. Good.

Please ensure settings.internal_bucket is non-empty in all deployment environments to avoid a false-negative “bucket not found”.

src/datamanager/__main__.py (8)

70-73: Minor table header rename — LGTM

Clearer labeling in the verify report.


104-112: Display of dataset bucket in list view — LGTM

Good UX; defaults to production and uses clear icons.


120-127: Threading bucket alias through pull logic — LGTM

Passing the alias to core centralizes resolution as intended.

Also applies to: 146-151


199-206: Interactive pull uses dataset bucket — LGTM

Choice formatting, null handling, and using the manifest’s bucket are consistent.

Also applies to: 207-209, 215-217, 223-225, 254-255


296-299: Alias passed to core.download_from_r2 — LGTM

Keeps bucket resolution centralized.


392-427: Production prepare interactive path — LGTM

Flow and messaging look good.


430-468: Internal prepare interactive path — LGTM

Clear separation and user messaging for internal datasets.


779-781: TUI adds distinct internal/production prepare actions — LGTM

Good discoverability.

Comment on lines +262 to 270
def delete_from_r2(
client: S3Client, object_key: str, bucket: str | None = None
) -> None:
"""Deletes an object from the R2 bucket."""
target_bucket = _resolve_bucket(bucket)
console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
try:
client.delete_object(Bucket=settings.bucket, Key=object_key)
client.delete_object(Bucket=target_bucket, Key=object_key)
console.print("✅ Rollback deletion successful.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clarify deletion target and narrow exception type

Improve observability by logging the resolved bucket, and avoid catching broad exceptions.

 def delete_from_r2(
     client: S3Client, object_key: str, bucket: str | None = None
 ) -> None:
     """Deletes an object from the R2 bucket."""
     target_bucket = _resolve_bucket(bucket)
-    console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
+    console.print(
+        f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..."
+    )
     try:
         client.delete_object(Bucket=target_bucket, Key=object_key)
         console.print("✅ Rollback deletion successful.")
-    except Exception as e:
+    except ClientError as e:
         # This is a best-effort cleanup. We notify the user if it fails.
         console.print(
             f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}"
         )
         console.print("You may need to manually delete the object from your R2 bucket.")

Also applies to: 271-277

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
tests/test_manifest.py (2)

115-119: Stabilize test environment by using the fixture

This test doesn’t chdir into the repo. It works today because read_manifest handles missing files, but using the test_repo fixture would make it fully isolated and order-independent.

Apply this diff:

-def test_get_dataset_bucket_nonexistent() -> None:
+def test_get_dataset_bucket_nonexistent(test_repo: Path) -> None:
     """Test getting bucket type for non-existent dataset."""
+    os.chdir(test_repo)
     bucket = manifest.get_dataset_bucket("non-existent.sqlite")
     assert bucket == "production"  # Should default to production

121-131: Optional: consider validating bucket values at API boundaries

Tests intentionally allow any string. If you later want stricter semantics, add a validator (e.g., in CLI) or a helper like is_valid_bucket() rather than changing update_dataset_bucket’s behavior.

tests/test_core.py (1)

264-268: Silence Ruff ARG001: remove unused fixture param

The mocker argument isn’t used here.

Apply this diff:

-def test_resolve_bucket_custom(mocker: MockerFixture) -> None:
+def test_resolve_bucket_custom() -> None:
     """Test _resolve_bucket with custom bucket name."""
     result = core._resolve_bucket("custom-bucket-name")
     assert result == "custom-bucket-name"
src/datamanager/core.py (1)

65-82: Clarify docstring: pass-through of non-alias names

The function may return a raw bucket name if it’s not an alias; reflect that in the docstring.

Apply this diff:

 def _resolve_bucket(bucket: str | None) -> str:
     """
-    Resolves bucket parameter to actual bucket name.
+    Resolves bucket parameter to an actual bucket name:
+    - None or "production" -> settings.bucket
+    - "internal" -> settings.internal_bucket
+    - any other value -> returned as-is (pass-through)
src/datamanager/__main__.py (2)

120-127: Default pull bucket to dataset’s configured bucket when unspecified

Safer DX: avoid accidental cross-bucket pulls. Consider making bucket optional and resolving from manifest if None.

Apply this diff:

-def _run_pull_logic(
-    name: str, version: str, output: Optional[Path], bucket: str = "production"
-) -> None:
+def _run_pull_logic(
+    name: str, version: str, output: Optional[Path], bucket: Optional[str] = None
+) -> None:
@@
-    success = core.pull_and_verify(
+    # Resolve default bucket from manifest if not provided
+    effective_bucket = bucket or manifest.get_dataset_bucket(name)
+    success = core.pull_and_verify(
         version_entry["r2_object_key"],
         version_entry["sha256"],
         final_path,
-        bucket,
+        effective_bucket,
     )

And in the CLI command:

-    bucket: str = typer.Option(
-        "production",
+    bucket: Optional[str] = typer.Option(
+        None,
         "--bucket",
         "-b",
-        help="Bucket to pull from: 'production' or 'internal'. Defaults to production.",
+        help="Bucket to pull from: 'production', 'internal', or a raw name. Defaults to the dataset's configured bucket.",
     ),

296-299: Use dataset’s bucket for updates, ignore user-supplied bucket

Prevents accidental cross-bucket mixing on updates of existing datasets.

Apply this diff:

-            # Download from the appropriate bucket using core helper
-            core.download_from_r2(
-                client, latest_version["r2_object_key"], old_path, bucket
-            )
+            # Use the dataset's configured bucket for existing datasets
+            effective_bucket = dataset.get("bucket", "production")
+            core.download_from_r2(
+                client, latest_version["r2_object_key"], old_path, effective_bucket
+            )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2159e49 and 8b9edd4.

📒 Files selected for processing (6)
  • .pre-commit-config.yaml (2 hunks)
  • src/datamanager/__main__.py (13 hunks)
  • src/datamanager/core.py (8 hunks)
  • tests/test_core.py (5 hunks)
  • tests/test_main.py (4 hunks)
  • tests/test_manifest.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_manifest.py (2)
tests/conftest.py (1)
  • test_repo (16-82)
src/datamanager/manifest.py (4)
  • get_dataset_bucket (204-218)
  • update_dataset_bucket (221-234)
  • read_manifest (33-52)
  • write_manifest (55-65)
tests/test_core.py (1)
src/datamanager/core.py (9)
  • download_from_r2 (113-137)
  • resolve_bucket_alias (52-62)
  • _resolve_bucket (65-81)
  • upload_to_staging (343-357)
  • upload_to_internal (360-362)
  • pull_and_verify (140-167)
  • delete_from_r2 (236-250)
  • verify_r2_access (315-340)
  • _check_bucket_permissions (253-312)
tests/test_main.py (3)
tests/conftest.py (1)
  • test_repo (16-82)
src/datamanager/manifest.py (1)
  • get_dataset (78-92)
src/datamanager/__main__.py (1)
  • _run_prepare_logic (260-366)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
  • get_dataset (78-92)
src/datamanager/core.py (1)
  • download_from_r2 (113-137)
🪛 Ruff (0.13.1)
tests/test_core.py

264-264: Unused function argument: mocker

(ARG001)

🔇 Additional comments (42)
.pre-commit-config.yaml (1)

24-28: Add pyupgrade target to match your minimum supported Python

Set pyupgrade's args to the project's minimum Python (e.g. --py311-plus for >=3.11). Detection of requires-python in this repo was inconclusive — confirm pyproject.toml and choose the matching flag.

   - repo: https://github.com/asottile/pyupgrade
     rev: v3.20.0
     hooks:
-      - id: pyupgrade
+      - id: pyupgrade
+        args: ["--py311-plus"]
tests/test_manifest.py (7)

36-36: LGTM: bucket field asserted on get_dataset

Good check that fixtures include the new "bucket" field.


52-53: LGTM: bucket preserved across history updates

Verifies that unrelated fields remain intact.


55-64: LGTM: defaults to production when unset/missing

Covers both present and non-existent dataset defaults.


66-76: LGTM: update_dataset_bucket basic path

Non-existent dataset no-op is acceptable per tests.


78-91: LGTM: backward-compat for old manifests

Solid coverage for missing bucket field.


93-103: LGTM: internal bucket retrieval

Covers explicit internal setting.


104-113: LGTM: production bucket retrieval

Covers explicit production setting.

tests/test_main.py (7)

33-33: LGTM: new datasets default to production bucket

Matches CLI default and manifest schema.


232-234: LGTM: interactive prepare passes explicit production bucket

Keeps core resolution centralized.


295-300: LGTM: interactive pull uses dataset’s bucket

Correctly derives from manifest for safer defaults.


476-501: LGTM: prepare with internal bucket

Validates end-to-end create path for internal datasets.


503-516: LGTM: pull with internal bucket

Ensures bucket param is propagated to core.


518-560: LGTM: object key prefix uses dataset stem

Prevents namespace collisions; independent of bucket.


562-604: LGTM: update path derives r2_dir from existing key

Ensures stable prefixing and expected version bumping.

tests/test_core.py (13)

16-37: LGTM: typed ClientError builder

Cleaner and mypy-friendly approach for consistent error shaping.


115-129: LGTM: verify includes internal bucket

Asserts presence and permissions across all three buckets.


197-207: LGTM: download error raises ClientError

Aligns with core.download_from_r2 behavior.


209-268: LGTM: bucket alias/custom resolution tests

Covers production, internal, None, and custom names.


270-307: LGTM: staging/internal upload wrappers

Correct buckets asserted without leaking callback internals.


309-333: LGTM: pull_and_verify download error path

Properly avoids cleanup on download failure.


335-353: LGTM: pull_and_verify success path

No unnecessary cleanup; verifies expected flow.


355-362: LGTM: delete_from_r2 best-effort handling

Matches current core behavior to not raise.


364-376: LGTM: verify_r2_access client creation failure

Produces connection sentinel entry.


378-392: LGTM: access denied path

Message and permissions map look correct.


394-407: LGTM: connection error path

Surfaces connection failure coherently.


409-428: LGTM: read-only permissions path

Partial access messaging is precise.


430-447: LGTM: no permissions path

Covers zero-permission condition.

src/datamanager/core.py (7)

52-63: LGTM: alias resolution delegates to internal helper

Centralizes behavior and supports pass-through.


93-106: LGTM: upload_to_r2 resolves bucket aliases

Correctly honors explicit, aliased, and default buckets.


113-138: LGTM: download_from_r2 resolves bucket aliases and surfaces errors

Head + download with progress and re-raise is appropriate.


140-168: LGTM: pull_and_verify narrows exceptions and cleans up on mismatch

Matches tests and improves error messaging.


236-251: Log resolved bucket and narrow exception handling

Include the resolved bucket in the log; prefer ClientError first, then fall back to a broad catch to preserve current behavior and tests.

Apply this diff:

 def delete_from_r2(
     client: S3Client, object_key: str, bucket: str | None = None
 ) -> None:
     """Deletes an object from the R2 bucket."""
     target_bucket = _resolve_bucket(bucket)
-    console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
+    console.print(
+        f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..."
+    )
     try:
         client.delete_object(Bucket=target_bucket, Key=object_key)
         console.print("✅ Rollback deletion successful.")
-    except Exception as e:
+    except ClientError as e:
         # This is a best-effort cleanup. We notify the user if it fails.
         console.print(
             f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}"
         )
         console.print("You may need to manually delete the object from your R2 bucket.")
+    except Exception as e:
+        console.print(
+            f"[bold red]Warning:[/] Unexpected error during R2 deletion: {e}"
+        )
+        console.print("You may need to manually delete the object from your R2 bucket.")

315-341: LGTM: verify includes internal bucket

Expanded surface while preserving connection error handling.


360-363: LGTM: upload_to_internal wrapper

Encapsulates alias usage cleanly.

src/datamanager/__main__.py (7)

70-73: LGTM: clearer table headers

Improves UX without behavior change.


100-116: LGTM: dataset list shows bucket context

Nice visual cue with internal/production icons.


334-348: LGTM: v1 object key uses dataset stem, not bucket name

Fixes potential namespace collisions across buckets.


375-388: LGTM: prepare command exposes bucket option

Aligns CLI with internal/production use-cases.


390-425: LGTM: interactive production prepare

Clear prompts and safe confirmation flow.


428-465: LGTM: interactive internal prepare

Good separation of flows and messaging.


775-779: LGTM: TUI adds internal prepare action

Completes the UX for team-only datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants