Skip to content

Conversation

@SOORAJTS2001
Copy link

@SOORAJTS2001 SOORAJTS2001 commented Nov 29, 2025

fix: modify url_download to do multipart download using multi-threading when supported by the server, which shows better performance and consumes less resources
If server doesn't give content-length or if there is any data to POST into, switch back to _url_download instead

This is a follow up for the discussion in #6236

Summary by CodeRabbit

Release Notes

  • New Features
    • Downloads now support parallel multipart transfers with configurable segment count for improved speed.
    • Added automatic detection of server capabilities to intelligently switch between parallel and standard download modes.
    • Enhanced download reliability with improved retry logic and timeout controls.

✏️ Tip: You can customize this high-level summary in your review settings.

…ding when supported by the server, which shows better performance and consumes less resources

If server doesn't give `content-length` or if there is any `data` to POST into,  switch back to `_url_download`
@mr-avocado mr-avocado bot moved this to Review Requested in Default project Nov 29, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 29, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The change introduces range-based multipart download functionality to the avocado/utils/download.py module. It adds a new internal _download_range helper function for fetching specific byte ranges with retry logic, and updates the url_download function to accept a segments parameter for controlling parallelism. The implementation uses ThreadPoolExecutor for concurrent downloads and includes HEAD-based server capability checks (Content-Length and Accept-Ranges headers) to determine whether to use multipart or fallback to single-download paths. A USER_AGENT constant is introduced, and downloaded parts are merged into the final file with per-part cleanup. Typing annotations are added throughout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding multi-part download capability to the url_download function, which aligns with the primary objective of the PR.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
avocado/utils/download.py (1)

116-119: Use bare raise to preserve the original traceback.

Using raise e re-raises the exception but loses the original traceback context. Use bare raise instead.

         except (socket.timeout, TimeoutError, HTTPError) as e:
             if attempt == retries:
-                raise e
+                raise
             time.sleep(delay)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74b7379 and 3aa3af6.

📒 Files selected for processing (1)
  • avocado/utils/download.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.6)
avocado/utils/download.py

110-112: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


113-113: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


115-115: Consider moving this statement to an else block

(TRY300)


118-118: Use raise without specifying exception name

Remove exception name

(TRY201)


137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🔇 Additional comments (2)
avocado/utils/download.py (2)

25-27: LGTM!

The new imports and USER_AGENT constant are appropriate for the multipart download functionality.

Also applies to: 35-36


125-125: I'm unable to access the repository to verify the project's Python version requirements and existing code patterns due to connectivity issues. However, based on the original concern, I can provide a rewritten review comment:


Verify Python version compatibility for union type syntax.

The bytes | None syntax requires Python 3.10+. If the project supports Python versions earlier than 3.10, use Optional[bytes] from typing instead, or ensure from __future__ import annotations is imported at the top of the file. Verify the minimum Python version requirement in the project's configuration (setup.py, setup.cfg, or pyproject.toml).

Comment on lines 137 to 147
headers = urllib.request.urlopen(
urllib.request.Request(url, method="HEAD"), timeout=10
).headers # Using HEAD method to get the content length
size = int(headers.get("Content-Length", -1))
supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()

if size <= 0 or data or not supports_range:
# if the server doesn't provide the size or accepted range / if we want sent data to the server (POST Method),
# switch to single download with urlopen
_url_download(url=url, filename=filename, data=data)
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling for HEAD request.

The HEAD request can raise HTTPError, URLError, or socket.timeout exceptions, causing the function to crash without falling back to _url_download. Additionally, the hardcoded 10-second timeout ignores the timeout parameter.

-    headers = urllib.request.urlopen(
-        urllib.request.Request(url, method="HEAD"), timeout=10
-    ).headers  # Using HEAD method to get the content length
-    size = int(headers.get("Content-Length", -1))
-    supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
-
-    if size <= 0 or data or not supports_range:
+    try:
+        head_response = urllib.request.urlopen(
+            urllib.request.Request(url, method="HEAD"), timeout=timeout
+        )
+        headers = head_response.headers
+        size = int(headers.get("Content-Length", -1))
+        supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+    except (socket.timeout, HTTPError, urllib.error.URLError):
+        size = -1
+        supports_range = False
+
+    if size <= 0 or data or not supports_range:

Note: You'll need to add urllib.error to the imports for URLError.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.6)

137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

…In` instead of `assertEqual` due to the error shown about deprecated pkg_resources
@SOORAJTS2001
Copy link
Author

Hi @clebergnu, can you please take a look at this PR

@clebergnu clebergnu self-requested a review December 4, 2025 13:52
@clebergnu
Copy link
Contributor

/packit copr-build

@SOORAJTS2001
Copy link
Author

Thanks @clebergnu, I have resolved the conflict

@clebergnu
Copy link
Contributor

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the url_download function by implementing multi-part downloading using a thread pool. This is a great enhancement for performance when downloading large files from servers that support range requests. My review focuses on improving the robustness of the new implementation, particularly concerning the cleanup of temporary files in case of download or merge failures, to prevent leaving garbage files on the filesystem.

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution. Please look into the review points by me and the AI reviews. Thanks!


log = logging.getLogger(__name__)

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this specific user agent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, Servers do not inherently require a user agent to enable a download, as the User-Agent header is technically optional according to HTTP specifications. However, many servers are configured to block or modify responses for requests without a user agent or with an unrecognized one

# Merge the split files and remove them
with open(filename, "wb") as f:
for part in part_files:
with open(part, "rb") as pf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause the unintentional overwrite of files, such as:

avocado.utils.download.url_download('https://avocado-project.org/data/assets/jeos-25-64.qcow2.xz', '/tmp/t.xz')

Will overwrite existing /tmp/t.xz.part* files. Ideally the temporary files will have temporary names (like a suffix) and also there will be protection against overwriting existing ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @clebergnu,
Shouldn't the temp files be overwrited for same file download every time, because we don't know whether those files are completely downloaded or not

@SOORAJTS2001
Copy link
Author

Thanks,

  1. I would do the switchable option between parallel download and conventional download
  2. Handle the inconsistency on file names and on deleting them

If there is anything missing please let me know

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.52%. Comparing base (50bcca0) to head (a9d7c22).
⚠️ Report is 31 commits behind head on master.

Files with missing lines Patch % Lines
avocado/utils/download.py 83.33% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6247      +/-   ##
==========================================
+ Coverage   73.47%   73.52%   +0.04%     
==========================================
  Files         206      206              
  Lines       22494    22523      +29     
==========================================
+ Hits        16528    16560      +32     
+ Misses       5966     5963       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
avocado/utils/download.py (2)

139-148: CRITICAL: HEAD request lacks error handling and uses incorrect timeout.

This issue was flagged in previous reviews but remains unaddressed. The HEAD request on lines 139-141 can raise HTTPError, urllib.error.URLError, or socket.timeout exceptions, causing the function to crash without falling back to _url_download. Additionally, the hardcoded 10-second timeout ignores the timeout parameter.

🔎 Proposed fix
+    try:
-    headers = urllib.request.urlopen(
-        urllib.request.Request(url, method="HEAD"), timeout=10
-    ).headers  # Using HEAD method to get the content length
-    size = int(headers.get("Content-Length", -1))
-    supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+        head_response = urllib.request.urlopen(
+            urllib.request.Request(url, method="HEAD"), timeout=timeout
+        )
+        headers = head_response.headers
+        size = int(headers.get("Content-Length", -1))
+        supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+    except (socket.timeout, HTTPError, urllib.error.URLError):
+        size = -1
+        supports_range = False

     if segments == 1 or size <= 0 or data or not supports_range:
         # Use single download when size/range is unavailable, request is POST, or segment size is 1
         _url_download(url=url, filename=filename, data=data)
         return

Note: You'll need to add urllib.error to the imports for URLError.


152-152: MAJOR: Temp file naming may overwrite existing files.

The temp file naming pattern temp{i}_{path.name} could unintentionally collide with existing user files. As noted in previous reviews, if the target directory already contains files like temp0_myfile.txt, they will be silently overwritten during the download process.

Consider using more unique temp filenames to prevent collisions.

🔎 Proposed fix using Python's tempfile module
+    import tempfile
+
     part_size = size // segments
     path = Path(filename)  # takes absolute path
-    part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)]
+    # Generate unique temp filenames in the same directory
+    part_files = []
+    for i in range(segments):
+        fd, temp_path = tempfile.mkstemp(
+            prefix=f"{path.stem}_part{i}_",
+            suffix=path.suffix or ".tmp",
+            dir=path.parent
+        )
+        os.close(fd)  # Close the file descriptor, we'll write to it later
+        part_files.append(temp_path)

Alternatively, use a UUID-based naming scheme:

+    import uuid
+
     part_size = size // segments
     path = Path(filename)
-    part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)]
+    download_id = uuid.uuid4().hex[:8]
+    part_files = [str(path.parent / f".{path.name}.{download_id}.part{i}") for i in range(segments)]
🧹 Nitpick comments (2)
avocado/utils/download.py (2)

118-121: Consider using bare raise instead of raise e.

On line 120, use raise without the exception variable to preserve the original traceback more cleanly. This is a Python best practice when re-raising within an except block.

🔎 Suggested fix
         except (socket.timeout, TimeoutError, HTTPError) as e:
             if attempt == retries:
-                raise e
+                raise
             time.sleep(delay)

175-187: Excellent error handling and cleanup implementation!

The try/except/finally structure properly addresses the cleanup concerns raised in previous reviews:

  • Successfully downloaded parts are merged into the final file
  • On any failure, the incomplete destination file is removed
  • The finally block ensures all temporary part files are always cleaned up
  • Graceful fallback to single-part download improves reliability

The bare Exception catch (line 175) is flagged by static analysis but is reasonable here since any failure should trigger the fallback behavior, and the finally block guarantees cleanup.

If you prefer more explicit exception handling, consider catching specific exceptions:

🔎 Optional: More specific exception handling
-    except Exception as e:
+    except (socket.timeout, TimeoutError, HTTPError, OSError, IOError) as e:
         # If anything fails, remove the incomplete destination file and switch to single-part download
         if os.path.exists(filename):
             os.remove(filename)
         log.warning(
             "Multipart download failed (%s). Falling back to single-part download.", e
         )
         _url_download(url=url, filename=filename, data=data)

However, the current approach is also acceptable given the fallback strategy.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c97438c and 63a6aec.

📒 Files selected for processing (1)
  • avocado/utils/download.py
🧰 Additional context used
🪛 Ruff (0.14.10)
avocado/utils/download.py

112-114: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


115-115: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


117-117: Consider moving this statement to an else block

(TRY300)


120-120: Use raise without specifying exception name

Remove exception name

(TRY201)


139-141: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


140-140: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


175-175: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)
avocado/utils/download.py (4)

25-28: LGTM! Imports are appropriate for the new multipart download functionality.

Also applies to: 31-31


153-156: LGTM! Small file handling is correct.

The check for part_size == 0 properly handles files smaller than the number of segments by falling back to single download. This addresses the edge case flagged in previous reviews.


158-165: LGTM! Byte range calculation is correct.

The task function correctly calculates byte ranges for each segment, ensuring the last segment captures all remaining bytes. The logic properly handles the inclusive nature of HTTP Range headers (0-based, end-inclusive).


166-174: LGTM! Parallel download and merge logic is sound.

The use of ThreadPoolExecutor for I/O-bound parallel downloads is appropriate, and the merge operation correctly combines part files using shutil.copyfileobj.

@SOORAJTS2001
Copy link
Author

Hi @clebergnu, I have made the commented changes

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Review Requested

Development

Successfully merging this pull request may close these issues.

2 participants