-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add internal bucket support for team-only datasets #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add multi-bucket architecture supporting production (public) and internal (team-only) R2 buckets - Implement bucket routing based on dataset manifest configuration - Add --bucket option to prepare and pull commands - Update CI/CD workflows to handle internal bucket publications and deletions
WalkthroughAdds per-dataset bucket support (production vs internal) across CI publish script, CLI, core R2 APIs, config, manifest, docs, env, and tests. Operations (prepare, pull, verify, delete, publish) now target a dataset's configured bucket; manifest entries include a new "bucket" field and helpers to get/update it. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as CLI (__main__.py)
participant Manifest as Manifest
participant Core as Core (R2 APIs)
participant R2Prod as R2: Production
participant R2Int as R2: Internal
rect rgb(240,248,255)
note over User,CLI: Prepare dataset
User->>CLI: prepare --bucket=<production|internal>
CLI->>Manifest: update_dataset_bucket(name, bucket)
CLI->>Core: upload_to_staging(file, staging_key, bucket)
CLI->>Manifest: write history entry (staging_key, bucket)
end
rect rgb(245,255,240)
note over User,CLI: Pull dataset
User->>CLI: pull --bucket=<production|internal>
CLI->>Manifest: get_dataset_bucket(name)
CLI->>Core: pull_and_verify(object_key, sha, out, bucket)
alt bucket=production
Core->>R2Prod: GET object_key
else bucket=internal
Core->>R2Int: GET object_key
end
Core-->>CLI: verified
CLI-->>User: file saved
end
sequenceDiagram
autonumber
participant GH as GitHub Actions
participant Pub as publish_script.py
participant R2P as R2: Production
participant R2I as R2: Internal
rect rgb(255,250,240)
note over GH,Pub: Publish step
GH->>Pub: env (R2_INTERNAL_BUCKET,…)
Pub->>Pub: Group objects by bucket
par Delete production objects
Pub->>R2P: Batch DELETE(keys)
and Delete internal objects
Pub->>R2I: Batch DELETE(keys)
end
par Copy production objects
Pub->>R2P: Server-side COPY(staging_key -> object_key)
and Copy internal objects
Pub->>R2I: Server-side COPY(staging_key -> object_key)
end
Pub-->>GH: logs incl. bucket context
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_manifest.py (1)
507-515: Fix failing expectation: pass bucket alias to core or adjust assertionCI shows the 4th arg to pull_and_verify is a real bucket name (e.g., "DUMMY") instead of "internal". Either:
- Pass the alias through to core and let core resolve it, or
- Update the test to check against settings.internal_bucket.
Given the rest of the suite expects aliases at boundaries, prefer passing the alias and resolving inside core.
Apply these diffs:
- Resolve alias in core and accept both alias or raw bucket names:
--- a/src/datamanager/core.py +++ b/src/datamanager/core.py @@ -from typing import Any, TypedDict +from typing import Any, TypedDict +from typing import Optional # if needed elsewhere @@ -def upload_to_r2( - client: S3Client, file_path: Path, object_key: str, bucket: str = None -) -> None: +def upload_to_r2( + client: S3Client, file_path: Path, object_key: str, bucket: str | None = None +) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket) @@ -def download_from_r2( - client: S3Client, object_key: str, download_path: Path, bucket: str = None -) -> None: +def download_from_r2( + client: S3Client, object_key: str, download_path: Path, bucket: str | None = None +) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket) @@ -def pull_and_verify( - object_key: str, expected_hash: str, output_path: Path, bucket: str = None -) -> bool: +def pull_and_verify( + object_key: str, expected_hash: str, output_path: Path, bucket: str | None = None +) -> bool: @@ - download_from_r2(client, object_key, output_path, bucket) + download_from_r2(client, object_key, output_path, bucket) @@ -def delete_from_r2(client: S3Client, object_key: str, bucket: str = None) -> None: +def delete_from_r2(client: S3Client, object_key: str, bucket: str | None = None) -> None: @@ - target_bucket = bucket or settings.bucket + target_bucket = _resolve_bucket(bucket)Add this helper (outside the changed ranges, e.g., near other helpers):
def _resolve_bucket(bucket: str | None) -> str: # Accept aliases or raw bucket names. if bucket is None or bucket == "production": return settings.bucket if bucket == "internal": return settings.internal_bucket return bucket
- Pass the alias from CLI to core:
--- a/src/datamanager/__main__.py +++ b/src/datamanager/__main__.py @@ - # Determine which bucket to use - target_bucket = ( - settings.internal_bucket if bucket == "internal" else settings.bucket - ) - success = core.pull_and_verify( version_entry["r2_object_key"], version_entry["sha256"], final_path, - target_bucket, + bucket, # pass alias; core resolves to bucket name )
- Use alias for diff download too (core resolves to actual):
--- a/src/datamanager/__main__.py +++ b/src/datamanager/__main__.py @@ - # Download from the appropriate bucket - target_bucket = ( - settings.internal_bucket if bucket == "internal" else settings.bucket - ) - core.download_from_r2( - client, latest_version["r2_object_key"], old_path, target_bucket - ) + # Download from the appropriate bucket (alias; core resolves) + core.download_from_r2( + client, latest_version["r2_object_key"], old_path, bucket + )
🧹 Nitpick comments (14)
docs/source/workflow.md (1)
44-49: Clarify how the “appropriate bucket” is chosen.Suggest explicitly stating that CI selects the target bucket based on the dataset’s bucket field in manifest.json to remove ambiguity.
- It copies the data from the staging bucket to the appropriate target bucket (production or internal). + It copies the data from the staging bucket to the appropriate target bucket (production or internal), as specified by each dataset’s bucket field in manifest.json..github/workflows/publish.yml (1)
36-45: Fail fast if R2_INTERNAL_BUCKET (and other required env) are unset.Add a preflight validation step so the job errors with a clear message when any required variable is missing.
- name: Install project and dependencies run: | uv sync --all-extras --dev uv pip install -e . --no-deps + - name: Validate required env + run: | + set -euo pipefail + for v in R2_ACCOUNT_ID R2_ACCESS_KEY_ID R2_SECRET_ACCESS_KEY R2_PRODUCTION_BUCKET R2_STAGING_BUCKET R2_INTERNAL_BUCKET; do + if [ -z "${!v:-}" ]; then + echo "::error::Missing required env: $v" + exit 1 + fi + done + env: + R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }} + R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} + R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} + R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} + R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} + R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} + - name: Run Publish Script run: uv run .github/scripts/publish_script.py env: R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }} R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} R2_PRODUCTION_BUCKET: ${{ vars.R2_PRODUCTION_BUCKET }} # Use repo variable R2_STAGING_BUCKET: ${{ vars.R2_STAGING_BUCKET }} # Use repo variable R2_INTERNAL_BUCKET: ${{ vars.R2_INTERNAL_BUCKET }} # Use repo variabledocs/source/usage.md (1)
55-65: Note permission requirements for internal pulls.Add a short note pointing users to verify if they get AccessDenied when pulling internal datasets.
-Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. +Downloads a dataset from the appropriate R2 bucket (production or internal) and verifies its integrity. The bucket is determined by the dataset's configuration in the manifest. Note: Pulling an internal dataset requires access to the internal bucket. If you see AccessDenied, run `uv run datamanager verify`.src/datamanager/config.py (2)
6-11: Also read from OS environment, not only .env.current code ignores exported env vars if .env is not present; reading both improves usability in CI and shells.
from dotenv import find_dotenv, dotenv_values +import os import warnings _ENV_PATH = Path(find_dotenv()) if find_dotenv() else None _ENV = dotenv_values(_ENV_PATH) if _ENV_PATH else {}
13-22: Apply env fallback in _need().Prefer OS env first, then .env, then warn and use dummy.
def _need(var: str) -> str: - val = _ENV.get(var) + val = os.getenv(var, _ENV.get(var)) if val is None: warnings.warn( f"Env var {var} is missing – using dummy value for offline/test mode", RuntimeWarning, ) val = "DUMMY" return str(val)docs/source/setup.md (1)
3-8: LGTM; minor doc polish suggestion.Consider adding a one-liner that access to Internal bucket should be tightly scoped to team members (principle of least privilege). Optional.
Also applies to: 46-47
tests/test_core.py (3)
91-105: Avoid brittle order-dependence in verify_r2_access results.Tests assume [production, staging, internal] ordering. Please ensure core.verify_r2_access guarantees this order, or expose a bucket identifier so tests can map results by name instead of index.
116-124: Broaden assertions to staging/internal for read-only case.Optional: assert the same read-only behavior for staging and internal to increase coverage.
results = core.verify_r2_access() -prod_result = results[0] - -assert prod_result["exists"] is True -assert prod_result["permissions"]["read"] is True -assert prod_result["permissions"]["write"] is False -assert prod_result["permissions"]["delete"] is False -assert "Partial access: [read]" in prod_result["message"] +for r in results: + assert r["exists"] is True + assert r["permissions"]["read"] is True + assert r["permissions"]["write"] is False + assert r["permissions"]["delete"] is False + assert "Partial access: [read]" in r["message"]
132-138: Broaden assertions to all buckets for not-found case.Given head_bucket raises for each call, all buckets should reflect not-found.
results = core.verify_r2_access() -prod_result = results[0] - -assert prod_result["exists"] is False -assert not any(prod_result["permissions"].values()) -assert "Bucket not found" in prod_result["message"] +for r in results: + assert r["exists"] is False + assert not any(r["permissions"].values()) + assert "Bucket not found" in r["message"]src/datamanager/manifest.py (2)
204-218: Type‑narrow the return and clarify defaultingConstrain the return type to the allowed literals and keep the current defaulting behavior.
-from typing import Any, Optional +from typing import Any, Optional, Literal ... -def get_dataset_bucket(name: str) -> str: +def get_dataset_bucket(name: str) -> Literal["production", "internal"]: @@ - return dataset.get("bucket", "production") + return dataset.get("bucket", "production")
220-234: Validate input and avoid unnecessary write on no‑op
- Validate that only "production" or "internal" are accepted.
- Return early if dataset is not found to avoid noop writes.
-from typing import Any, Optional +from typing import Any, Optional, Literal @@ -def update_dataset_bucket(name: str, bucket: str) -> None: +def update_dataset_bucket(name: str, bucket: Literal["production", "internal"]) -> None: @@ - for item in data: + updated = False + for item in data: if item.get("fileName") == name: item["bucket"] = bucket - break - write_manifest(data) + updated = True + break + if updated: + write_manifest(data).github/scripts/publish_script.py (1)
49-55: Docs generation: avoid public links for internal datasetsFor internal datasets, the generated "Download" link will point to a public base URL and likely be inaccessible. Prefer a label instead.
Outside this hunk, adjust finalize_dataset_docs:
# inside the loop over datasets, before building download_link is_internal = dataset.get("bucket") == "internal" ... download_link = "N/A" if r2_object_key: if is_internal: download_link = "Internal-only" else: full_url = f"{BASE_DOWNLOAD_URL}{r2_object_key}" download_link = f"[Download]({full_url})"src/datamanager/__main__.py (2)
265-271: Unused ctx in _run_prepare_logicctx isn’t used; consider removing the parameter or prefix with underscore to satisfy linters. Keeping it is OK if you plan to use it soon.
385-390: Consider restricting --bucket to choicesDefine an Enum or use typer.Choice to prevent invalid values.
Example:
class Bucket(str, Enum): production = "production" internal = "internal" bucket: Bucket = typer.Option(Bucket.production, "--bucket", "-b", case_sensitive=False)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.github/scripts/publish_script.py(5 hunks).github/workflows/publish.yml(1 hunks)README.md(6 hunks)docs/source/setup.md(2 hunks)docs/source/usage.md(2 hunks)docs/source/workflow.md(1 hunks)env.example(1 hunks)manifest.json(2 hunks)src/datamanager/__main__.py(13 hunks)src/datamanager/config.py(1 hunks)src/datamanager/core.py(7 hunks)src/datamanager/manifest.py(2 hunks)tests/conftest.py(1 hunks)tests/test_core.py(1 hunks)tests/test_main.py(4 hunks)tests/test_manifest.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_manifest.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (3)
get_dataset_bucket(204-217)update_dataset_bucket(220-233)read_manifest(33-52)
tests/test_main.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (1)
get_dataset(78-92)
src/datamanager/__main__.py (2)
src/datamanager/core.py (2)
pull_and_verify(108-134)download_from_r2(81-105)src/datamanager/manifest.py (1)
get_dataset(78-92)
🪛 GitHub Actions: CI
tests/test_main.py
[error] 514-514: Test 'test_pull_internal_bucket' failed: expected bucket parameter 'internal' but actual value 'DUMMY' (likely due to offline/test mode dummy values).
🪛 Ruff (0.13.1)
src/datamanager/core.py
62-62: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
82-82: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
109-109: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
203-203: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
src/datamanager/__main__.py
266-266: Unused function argument: ctx
(ARG001)
🔇 Additional comments (33)
docs/source/usage.md (1)
97-101: LGTM.Clearer verify description aligning with three-bucket support.
tests/conftest.py (1)
46-46: LGTM; ensure legacy manifests remain compatible.Adding bucket: "production" is good. Please confirm the loader defaults to production when bucket is absent so older manifests don’t break; add/keep a test for that.
src/datamanager/config.py (1)
31-31: LGTM.internal_bucket addition aligns with docs and workflow.
env.example (1)
6-6: LGTM.Example env updated for internal bucket.
src/datamanager/manifest.py (1)
24-26: Export list update looks goodNew APIs are properly exported.
tests/test_manifest.py (5)
36-36: LGTM: Bucket field assertedAsserting default "production" on existing dataset is correct per fixtures.
52-53: LGTM: Bucket preserved across history updateEnsures update path doesn't drop the field.
55-64: LGTM: get_dataset_bucket testsCovers both present and missing dataset cases.
66-76: LGTM: update_dataset_bucket testsCovers update and no‑op behaviors.
476-500: LGTM: Internal bucket prepare pathValidates internal flow and manifest fields.
manifest.json (2)
4-4: LGTM: Default bucket added to existing datasetMatches code paths that default to "production".
36-51: LGTM: Example internal dataset entryStructure aligns with create flow (bucket="internal", v1 history).
.github/scripts/publish_script.py (4)
137-142: Good: per‑dataset bucket routing for deletionsBucket selection logic is clear and isolated.
174-213: Good: batch delete grouped per bucketChunking and error handling are appropriate.
241-255: Good: publish targets correct bucket and logs contextServer‑side copy uses target bucket; logs are clear.
15-15: R2_INTERNAL_BUCKET exported in CI
publish.yml exports R2_INTERNAL_BUCKET from repo vars: .github/workflows/publish.yml:44.README.md (8)
62-69: LGTM: Feature bullets updated for multi-bucket and verifyClear and consistent with CLI.
70-87: LGTM: Bucket types sectionConcise and actionable guidance.
93-97: LGTM: Prereqs list includes internal bucketMatches new env requirements.
139-139: LGTM: Env var addedR2_INTERNAL_BUCKET documented.
169-174: LGTM: Prepare examples show bucket selectionGood defaults and internal option.
220-221: LGTM: CLI note on --bucketAccurately reflects new behavior.
260-271: LGTM: Pull examples for production and internalCovers both latest and specific version cases.
303-307: LGTM: Verify mentions all three bucketsAligned with core.verify_r2_access.
src/datamanager/core.py (2)
325-339: LGTM: upload_to_internal helperUseful for explicit internal uploads; consistent progress handling.
280-296: LGTM: verify includes internal bucketCovers all three buckets; messages are coherent.
src/datamanager/__main__.py (2)
70-73: LGTM: Verify table headers clarifiedMinor UX polish.
104-112: LGTM: Datasets list shows bucket clearlyIcons and labels improve visibility.
tests/test_main.py (5)
33-33: Default bucket assertion LGTMGood check that new datasets default to production.
233-234: Interactive prepare correctly propagates bucketEnsures default production bucket flows into core logic.
299-300: Interactive pull correctly propagates bucketProduction bucket is forwarded to the pull logic as expected.
476-500: Internal prepare path looks goodCovers internal bucket flag end-to-end and validates manifest persistence.
502-515: Fix failing assertion: compare against settings.internal_bucket, not literal "internal"Replace the assertion with:
assert call_args[3] == settings.internal_bucketLikely an incorrect or invalid review comment.
refactor: centralize bucket resolution logic in core module - Add _resolve_bucket() helper function with PEP 604 union types - Update upload_to_r2, download_from_r2, and delete_from_r2 to use helper - Simplify bucket handling in main module to pass aliases directly - Support both bucket aliases (production/internal) and raw bucket names consistently
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (5)
src/datamanager/core.py (2)
68-85: _resolve_bucket logic looks goodCovers None/"production"/"internal" and pass-through of raw names. Consider centralizing all resolution via this helper (see previous comment).
386-400: Avoid duplication: delegate upload_to_internal to upload_to_r2Reuse the generic uploader with bucket="internal" to keep behavior consistent and reduce maintenance.
def upload_to_internal(client: S3Client, file_path: Path, object_key: str) -> None: - """Uploads a file to the INTERNAL R2 bucket with a progress bar.""" - file_size = file_path.stat().st_size - with Progress() as progress: - task = progress.add_task( - f"[blue]Uploading to internal: {file_path.name}...", total=file_size - ) - client.upload_file( - str(file_path), - settings.internal_bucket, - object_key, - Callback=lambda bytes_transferred: progress.update( - task, advance=bytes_transferred - ), - ) + """Uploads a file to the INTERNAL R2 bucket with a progress bar.""" + upload_to_r2(client, file_path, object_key, bucket="internal")src/datamanager/__main__.py (3)
260-266: Unused ctx argument in _run_prepare_logicRuff ARG001: ctx is unused. Rename to _ctx or remove the parameter.
-def _run_prepare_logic( - ctx: typer.Context, name: str, file: Path, bucket: str = "production" -) -> None: +def _run_prepare_logic( + _ctx: typer.Context, name: str, file: Path, bucket: str = "production" +) -> None:Call sites need no change since it’s positional.
377-382: Constrain bucket option to avoid typosRestrict CLI bucket values to "production" or "internal" for better UX.
-from typing import Callable, Optional, Any +from typing import Callable, Optional, Any, Literal @@ -def prepare( +def prepare( @@ - bucket: str = typer.Option( + bucket: Literal["production", "internal"] = typer.Option( "production", "--bucket", "-b", help="Target bucket: 'production' or 'internal'. Defaults to production.", ),Consider applying the same to the pull command’s bucket option.
179-184: Apply the same choice constraint to pull commandMirror the prepare command’s restriction here.
- bucket: str = typer.Option( + bucket: Literal["production", "internal"] = typer.Option( "production", "--bucket", "-b", help="Bucket to pull from: 'production' or 'internal'. Defaults to production.", ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/datamanager/__main__.py(13 hunks)src/datamanager/core.py(8 hunks)src/datamanager/manifest.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/datamanager/manifest.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/core.py (1)
download_from_r2(116-140)
🪛 Ruff (0.13.1)
src/datamanager/core.py
155-155: Do not catch blind exception: Exception
(BLE001)
src/datamanager/__main__.py
261-261: Unused function argument: ctx
(ARG001)
🔇 Additional comments (11)
src/datamanager/core.py (3)
96-101: upload_to_r2 correctly adopts alias resolution and PEP 604 type hintsGood use of _resolve_bucket and progress callback.
Also applies to: 107-109
116-138: download_from_r2 correctly resolves bucket and applies head/download to the same targetLooks consistent and safe. Errors are surfaced via ClientError.
343-357: verify_r2_access includes internal bucket — LGTMCovers production, staging, internal. Good.
Please ensure settings.internal_bucket is non-empty in all deployment environments to avoid a false-negative “bucket not found”.
src/datamanager/__main__.py (8)
70-73: Minor table header rename — LGTMClearer labeling in the verify report.
104-112: Display of dataset bucket in list view — LGTMGood UX; defaults to production and uses clear icons.
120-127: Threading bucket alias through pull logic — LGTMPassing the alias to core centralizes resolution as intended.
Also applies to: 146-151
199-206: Interactive pull uses dataset bucket — LGTMChoice formatting, null handling, and using the manifest’s bucket are consistent.
Also applies to: 207-209, 215-217, 223-225, 254-255
296-299: Alias passed to core.download_from_r2 — LGTMKeeps bucket resolution centralized.
392-427: Production prepare interactive path — LGTMFlow and messaging look good.
430-468: Internal prepare interactive path — LGTMClear separation and user messaging for internal datasets.
779-781: TUI adds distinct internal/production prepare actions — LGTMGood discoverability.
| def delete_from_r2( | ||
| client: S3Client, object_key: str, bucket: str | None = None | ||
| ) -> None: | ||
| """Deletes an object from the R2 bucket.""" | ||
| target_bucket = _resolve_bucket(bucket) | ||
| console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...") | ||
| try: | ||
| client.delete_object(Bucket=settings.bucket, Key=object_key) | ||
| client.delete_object(Bucket=target_bucket, Key=object_key) | ||
| console.print("✅ Rollback deletion successful.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify deletion target and narrow exception type
Improve observability by logging the resolved bucket, and avoid catching broad exceptions.
def delete_from_r2(
client: S3Client, object_key: str, bucket: str | None = None
) -> None:
"""Deletes an object from the R2 bucket."""
target_bucket = _resolve_bucket(bucket)
- console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...")
+ console.print(
+ f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..."
+ )
try:
client.delete_object(Bucket=target_bucket, Key=object_key)
console.print("✅ Rollback deletion successful.")
- except Exception as e:
+ except ClientError as e:
# This is a best-effort cleanup. We notify the user if it fails.
console.print(
f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}"
)
console.print("You may need to manually delete the object from your R2 bucket.")Also applies to: 271-277
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
tests/test_manifest.py (2)
115-119: Stabilize test environment by using the fixtureThis test doesn’t chdir into the repo. It works today because read_manifest handles missing files, but using the test_repo fixture would make it fully isolated and order-independent.
Apply this diff:
-def test_get_dataset_bucket_nonexistent() -> None: +def test_get_dataset_bucket_nonexistent(test_repo: Path) -> None: """Test getting bucket type for non-existent dataset.""" + os.chdir(test_repo) bucket = manifest.get_dataset_bucket("non-existent.sqlite") assert bucket == "production" # Should default to production
121-131: Optional: consider validating bucket values at API boundariesTests intentionally allow any string. If you later want stricter semantics, add a validator (e.g., in CLI) or a helper like is_valid_bucket() rather than changing update_dataset_bucket’s behavior.
tests/test_core.py (1)
264-268: Silence Ruff ARG001: remove unused fixture paramThe mocker argument isn’t used here.
Apply this diff:
-def test_resolve_bucket_custom(mocker: MockerFixture) -> None: +def test_resolve_bucket_custom() -> None: """Test _resolve_bucket with custom bucket name.""" result = core._resolve_bucket("custom-bucket-name") assert result == "custom-bucket-name"src/datamanager/core.py (1)
65-82: Clarify docstring: pass-through of non-alias namesThe function may return a raw bucket name if it’s not an alias; reflect that in the docstring.
Apply this diff:
def _resolve_bucket(bucket: str | None) -> str: """ - Resolves bucket parameter to actual bucket name. + Resolves bucket parameter to an actual bucket name: + - None or "production" -> settings.bucket + - "internal" -> settings.internal_bucket + - any other value -> returned as-is (pass-through)src/datamanager/__main__.py (2)
120-127: Default pull bucket to dataset’s configured bucket when unspecifiedSafer DX: avoid accidental cross-bucket pulls. Consider making bucket optional and resolving from manifest if None.
Apply this diff:
-def _run_pull_logic( - name: str, version: str, output: Optional[Path], bucket: str = "production" -) -> None: +def _run_pull_logic( + name: str, version: str, output: Optional[Path], bucket: Optional[str] = None +) -> None: @@ - success = core.pull_and_verify( + # Resolve default bucket from manifest if not provided + effective_bucket = bucket or manifest.get_dataset_bucket(name) + success = core.pull_and_verify( version_entry["r2_object_key"], version_entry["sha256"], final_path, - bucket, + effective_bucket, )And in the CLI command:
- bucket: str = typer.Option( - "production", + bucket: Optional[str] = typer.Option( + None, "--bucket", "-b", - help="Bucket to pull from: 'production' or 'internal'. Defaults to production.", + help="Bucket to pull from: 'production', 'internal', or a raw name. Defaults to the dataset's configured bucket.", ),
296-299: Use dataset’s bucket for updates, ignore user-supplied bucketPrevents accidental cross-bucket mixing on updates of existing datasets.
Apply this diff:
- # Download from the appropriate bucket using core helper - core.download_from_r2( - client, latest_version["r2_object_key"], old_path, bucket - ) + # Use the dataset's configured bucket for existing datasets + effective_bucket = dataset.get("bucket", "production") + core.download_from_r2( + client, latest_version["r2_object_key"], old_path, effective_bucket + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.pre-commit-config.yaml(2 hunks)src/datamanager/__main__.py(13 hunks)src/datamanager/core.py(8 hunks)tests/test_core.py(5 hunks)tests/test_main.py(4 hunks)tests/test_manifest.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_manifest.py (2)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (4)
get_dataset_bucket(204-218)update_dataset_bucket(221-234)read_manifest(33-52)write_manifest(55-65)
tests/test_core.py (1)
src/datamanager/core.py (9)
download_from_r2(113-137)resolve_bucket_alias(52-62)_resolve_bucket(65-81)upload_to_staging(343-357)upload_to_internal(360-362)pull_and_verify(140-167)delete_from_r2(236-250)verify_r2_access(315-340)_check_bucket_permissions(253-312)
tests/test_main.py (3)
tests/conftest.py (1)
test_repo(16-82)src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/__main__.py (1)
_run_prepare_logic(260-366)
src/datamanager/__main__.py (2)
src/datamanager/manifest.py (1)
get_dataset(78-92)src/datamanager/core.py (1)
download_from_r2(113-137)
🪛 Ruff (0.13.1)
tests/test_core.py
264-264: Unused function argument: mocker
(ARG001)
🔇 Additional comments (42)
.pre-commit-config.yaml (1)
24-28: Add pyupgrade target to match your minimum supported PythonSet pyupgrade's args to the project's minimum Python (e.g. --py311-plus for >=3.11). Detection of requires-python in this repo was inconclusive — confirm pyproject.toml and choose the matching flag.
- repo: https://github.com/asottile/pyupgrade rev: v3.20.0 hooks: - - id: pyupgrade + - id: pyupgrade + args: ["--py311-plus"]tests/test_manifest.py (7)
36-36: LGTM: bucket field asserted on get_datasetGood check that fixtures include the new "bucket" field.
52-53: LGTM: bucket preserved across history updatesVerifies that unrelated fields remain intact.
55-64: LGTM: defaults to production when unset/missingCovers both present and non-existent dataset defaults.
66-76: LGTM: update_dataset_bucket basic pathNon-existent dataset no-op is acceptable per tests.
78-91: LGTM: backward-compat for old manifestsSolid coverage for missing bucket field.
93-103: LGTM: internal bucket retrievalCovers explicit internal setting.
104-113: LGTM: production bucket retrievalCovers explicit production setting.
tests/test_main.py (7)
33-33: LGTM: new datasets default to production bucketMatches CLI default and manifest schema.
232-234: LGTM: interactive prepare passes explicit production bucketKeeps core resolution centralized.
295-300: LGTM: interactive pull uses dataset’s bucketCorrectly derives from manifest for safer defaults.
476-501: LGTM: prepare with internal bucketValidates end-to-end create path for internal datasets.
503-516: LGTM: pull with internal bucketEnsures bucket param is propagated to core.
518-560: LGTM: object key prefix uses dataset stemPrevents namespace collisions; independent of bucket.
562-604: LGTM: update path derives r2_dir from existing keyEnsures stable prefixing and expected version bumping.
tests/test_core.py (13)
16-37: LGTM: typed ClientError builderCleaner and mypy-friendly approach for consistent error shaping.
115-129: LGTM: verify includes internal bucketAsserts presence and permissions across all three buckets.
197-207: LGTM: download error raises ClientErrorAligns with core.download_from_r2 behavior.
209-268: LGTM: bucket alias/custom resolution testsCovers production, internal, None, and custom names.
270-307: LGTM: staging/internal upload wrappersCorrect buckets asserted without leaking callback internals.
309-333: LGTM: pull_and_verify download error pathProperly avoids cleanup on download failure.
335-353: LGTM: pull_and_verify success pathNo unnecessary cleanup; verifies expected flow.
355-362: LGTM: delete_from_r2 best-effort handlingMatches current core behavior to not raise.
364-376: LGTM: verify_r2_access client creation failureProduces connection sentinel entry.
378-392: LGTM: access denied pathMessage and permissions map look correct.
394-407: LGTM: connection error pathSurfaces connection failure coherently.
409-428: LGTM: read-only permissions pathPartial access messaging is precise.
430-447: LGTM: no permissions pathCovers zero-permission condition.
src/datamanager/core.py (7)
52-63: LGTM: alias resolution delegates to internal helperCentralizes behavior and supports pass-through.
93-106: LGTM: upload_to_r2 resolves bucket aliasesCorrectly honors explicit, aliased, and default buckets.
113-138: LGTM: download_from_r2 resolves bucket aliases and surfaces errorsHead + download with progress and re-raise is appropriate.
140-168: LGTM: pull_and_verify narrows exceptions and cleans up on mismatchMatches tests and improves error messaging.
236-251: Log resolved bucket and narrow exception handlingInclude the resolved bucket in the log; prefer ClientError first, then fall back to a broad catch to preserve current behavior and tests.
Apply this diff:
def delete_from_r2( client: S3Client, object_key: str, bucket: str | None = None ) -> None: """Deletes an object from the R2 bucket.""" target_bucket = _resolve_bucket(bucket) - console.print(f"Attempting to delete [yellow]{object_key}[/] from R2...") + console.print( + f"Attempting to delete [yellow]{object_key}[/] from R2 bucket [cyan]{target_bucket}[/]..." + ) try: client.delete_object(Bucket=target_bucket, Key=object_key) console.print("✅ Rollback deletion successful.") - except Exception as e: + except ClientError as e: # This is a best-effort cleanup. We notify the user if it fails. console.print( f"[bold red]Warning:[/] Failed to delete R2 object during rollback: {e}" ) console.print("You may need to manually delete the object from your R2 bucket.") + except Exception as e: + console.print( + f"[bold red]Warning:[/] Unexpected error during R2 deletion: {e}" + ) + console.print("You may need to manually delete the object from your R2 bucket.")
315-341: LGTM: verify includes internal bucketExpanded surface while preserving connection error handling.
360-363: LGTM: upload_to_internal wrapperEncapsulates alias usage cleanly.
src/datamanager/__main__.py (7)
70-73: LGTM: clearer table headersImproves UX without behavior change.
100-116: LGTM: dataset list shows bucket contextNice visual cue with internal/production icons.
334-348: LGTM: v1 object key uses dataset stem, not bucket nameFixes potential namespace collisions across buckets.
375-388: LGTM: prepare command exposes bucket optionAligns CLI with internal/production use-cases.
390-425: LGTM: interactive production prepareClear prompts and safe confirmation flow.
428-465: LGTM: interactive internal prepareGood separation of flows and messaging.
775-779: LGTM: TUI adds internal prepare actionCompletes the UX for team-only datasets.
Internal Bucket Support for Team-Only Datasets
This PR introduces comprehensive support for internal (team-only) R2 buckets alongside the existing production (public) bucket system.
These internal bucket databases will be visible in a datamanager pull with the right api access or in the cloudflare bucket page.
Configuration Required
Add to your .env file:
R2_INTERNAL_BUCKET="your-internal-bucket-name"Summary by CodeRabbit
New Features
Documentation
Tests
Chores