Multi-part download for `url_download` in utils/download.py #6247

SOORAJTS2001 · 2025-11-29T11:04:25Z

fix: modify url_download to do multipart download using multi-threading when supported by the server, which shows better performance and consumes less resources
If server doesn't give content-length or if there is any data to POST into, switch back to _url_download instead

This is a follow up for the discussion in #6236

Summary by CodeRabbit

Release Notes

New Features
- Downloads now support parallel multipart transfers with configurable segment count for improved speed.
- Added automatic detection of server capabilities to intelligently switch between parallel and standard download modes.
- Enhanced download reliability with improved retry logic and timeout controls.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ding when supported by the server, which shows better performance and consumes less resources If server doesn't give `content-length` or if there is any `data` to POST into, switch back to `_url_download`

coderabbitai · 2025-11-29T11:04:36Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The change introduces range-based multipart download functionality to the avocado/utils/download.py module. It adds a new internal _download_range helper function for fetching specific byte ranges with retry logic, and updates the url_download function to accept a segments parameter for controlling parallelism. The implementation uses ThreadPoolExecutor for concurrent downloads and includes HEAD-based server capability checks (Content-Length and Accept-Ranges headers) to determine whether to use multipart or fallback to single-download paths. A USER_AGENT constant is introduced, and downloaded parts are merged into the final file with per-part cleanup. Typing annotations are added throughout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding multi-part download capability to the url_download function, which aligns with the primary objective of the PR.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

avocado/utils/download.py (1)
116-119: Use bare raise to preserve the original traceback.

Using raise e re-raises the exception but loses the original traceback context. Use bare raise instead.
         except (socket.timeout, TimeoutError, HTTPError) as e:
             if attempt == retries:
-                raise e
+                raise
             time.sleep(delay)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74b7379 and 3aa3af6.

📒 Files selected for processing (1)

avocado/utils/download.py (2 hunks)

🧰 Additional context used

🪛 Ruff (0.14.6)

avocado/utils/download.py

110-112: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

113-113: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

115-115: Consider moving this statement to an else block

(TRY300)

118-118: Use raise without specifying exception name

Remove exception name

(TRY201)

137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🔇 Additional comments (2)

avocado/utils/download.py (2)

25-27: LGTM!

The new imports and USER_AGENT constant are appropriate for the multipart download functionality.

Also applies to: 35-36

125-125: I'm unable to access the repository to verify the project's Python version requirements and existing code patterns due to connectivity issues. However, based on the original concern, I can provide a rewritten review comment:

Verify Python version compatibility for union type syntax.

The bytes | None syntax requires Python 3.10+. If the project supports Python versions earlier than 3.10, use Optional[bytes] from typing instead, or ensure from __future__ import annotations is imported at the top of the file. Verify the minimum Python version requirement in the project's configuration (setup.py, setup.cfg, or pyproject.toml).

coderabbitai · 2025-11-29T11:07:04Z

avocado/utils/download.py

+    headers = urllib.request.urlopen(
+        urllib.request.Request(url, method="HEAD"), timeout=10
+    ).headers  # Using HEAD method to get the content length
+    size = int(headers.get("Content-Length", -1))
+    supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+
+    if size <= 0 or data or not supports_range:
+        # if the server doesn't provide the size or accepted range / if we want sent data to the server (POST Method),
+        # switch to single download with urlopen
+        _url_download(url=url, filename=filename, data=data)
+        return


⚠️ Potential issue | 🟠 Major

Missing error handling for HEAD request.

The HEAD request can raise HTTPError, URLError, or socket.timeout exceptions, causing the function to crash without falling back to _url_download. Additionally, the hardcoded 10-second timeout ignores the timeout parameter.

- headers = urllib.request.urlopen( - urllib.request.Request(url, method="HEAD"), timeout=10 - ).headers # Using HEAD method to get the content length - size = int(headers.get("Content-Length", -1)) - supports_range = "bytes" in headers.get("Accept-Ranges", "").lower() - - if size <= 0 or data or not supports_range: + try: + head_response = urllib.request.urlopen( + urllib.request.Request(url, method="HEAD"), timeout=timeout + ) + headers = head_response.headers + size = int(headers.get("Content-Length", -1)) + supports_range = "bytes" in headers.get("Accept-Ranges", "").lower() + except (socket.timeout, HTTPError, urllib.error.URLError): + size = -1 + supports_range = False + + if size <= 0 or data or not supports_range:

Note: You'll need to add urllib.error to the imports for URLError.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.6)

137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

avocado/utils/download.py

…lower python versions

…In` instead of `assertEqual` due to the error shown about deprecated pkg_resources

SOORAJTS2001 · 2025-11-30T05:04:58Z

Hi @clebergnu, can you please take a look at this PR

clebergnu · 2025-12-12T22:15:01Z

/packit copr-build

…rtIn

SOORAJTS2001 · 2025-12-13T05:37:39Z

Thanks @clebergnu, I have resolved the conflict

clebergnu · 2025-12-13T10:00:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the url_download function by implementing multi-part downloading using a thread pool. This is a great enhancement for performance when downloading large files from servers that support range requests. My review focuses on improving the robustness of the new implementation, particularly concerning the cleanup of temporary files in case of download or merge failures, to prevent leaving garbage files on the filesystem.

avocado/utils/download.py

clebergnu

Thank you so much for this contribution. Please look into the review points by me and the AI reviews. Thanks!

clebergnu · 2025-12-12T22:18:35Z

avocado/utils/download.py


 log = logging.getLogger(__name__)

+USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"


Is there a reason for this specific user agent?

Not really, Servers do not inherently require a user agent to enable a download, as the User-Agent header is technically optional according to HTTP specifications. However, many servers are configured to block or modify responses for requests without a user agent or with an unrecognized one

avocado/utils/download.py

clebergnu · 2025-12-13T09:59:03Z

avocado/utils/download.py

+    # Merge the split files and remove them
+    with open(filename, "wb") as f:
+        for part in part_files:
+            with open(part, "rb") as pf:


This may cause the unintentional overwrite of files, such as:

avocado.utils.download.url_download('https://avocado-project.org/data/assets/jeos-25-64.qcow2.xz', '/tmp/t.xz')

Will overwrite existing /tmp/t.xz.part* files. Ideally the temporary files will have temporary names (like a suffix) and also there will be protection against overwriting existing ones.

Hi @clebergnu,
Shouldn't the temp files be overwrited for same file download every time, because we don't know whether those files are completely downloaded or not

avocado/utils/download.py

SOORAJTS2001 · 2025-12-13T13:59:25Z

Thanks,

I would do the switchable option between parallel download and conventional download
Handle the inconsistency on file names and on deleting them

If there is anything missing please let me know

codecov · 2025-12-16T15:59:35Z

Codecov Report

❌ Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.52%. Comparing base (50bcca0) to head (a9d7c22).
⚠️ Report is 31 commits behind head on master.

Files with missing lines	Patch %	Lines
avocado/utils/download.py	83.33%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6247      +/-   ##
==========================================
+ Coverage   73.47%   73.52%   +0.04%     
==========================================
  Files         206      206              
  Lines       22494    22523      +29     
==========================================
+ Hits        16528    16560      +32     
+ Misses       5966     5963       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

avocado/utils/download.py (2)

139-148: CRITICAL: HEAD request lacks error handling and uses incorrect timeout.

This issue was flagged in previous reviews but remains unaddressed. The HEAD request on lines 139-141 can raise HTTPError, urllib.error.URLError, or socket.timeout exceptions, causing the function to crash without falling back to _url_download. Additionally, the hardcoded 10-second timeout ignores the timeout parameter.

🔎 Proposed fix

+    try:
-    headers = urllib.request.urlopen(
-        urllib.request.Request(url, method="HEAD"), timeout=10
-    ).headers  # Using HEAD method to get the content length
-    size = int(headers.get("Content-Length", -1))
-    supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+        head_response = urllib.request.urlopen(
+            urllib.request.Request(url, method="HEAD"), timeout=timeout
+        )
+        headers = head_response.headers
+        size = int(headers.get("Content-Length", -1))
+        supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+    except (socket.timeout, HTTPError, urllib.error.URLError):
+        size = -1
+        supports_range = False

     if segments == 1 or size <= 0 or data or not supports_range:
         # Use single download when size/range is unavailable, request is POST, or segment size is 1
         _url_download(url=url, filename=filename, data=data)
         return

Note: You'll need to add urllib.error to the imports for URLError.

152-152: MAJOR: Temp file naming may overwrite existing files.

The temp file naming pattern temp{i}_{path.name} could unintentionally collide with existing user files. As noted in previous reviews, if the target directory already contains files like temp0_myfile.txt, they will be silently overwritten during the download process.

Consider using more unique temp filenames to prevent collisions.

🔎 Proposed fix using Python's tempfile module

+    import tempfile
+
     part_size = size // segments
     path = Path(filename)  # takes absolute path
-    part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)]
+    # Generate unique temp filenames in the same directory
+    part_files = []
+    for i in range(segments):
+        fd, temp_path = tempfile.mkstemp(
+            prefix=f"{path.stem}_part{i}_",
+            suffix=path.suffix or ".tmp",
+            dir=path.parent
+        )
+        os.close(fd)  # Close the file descriptor, we'll write to it later
+        part_files.append(temp_path)

Alternatively, use a UUID-based naming scheme:

+    import uuid
+
     part_size = size // segments
     path = Path(filename)
-    part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)]
+    download_id = uuid.uuid4().hex[:8]
+    part_files = [str(path.parent / f".{path.name}.{download_id}.part{i}") for i in range(segments)]

🧹 Nitpick comments (2)

avocado/utils/download.py (2)
118-121: Consider using bare raise instead of raise e.

On line 120, use raise without the exception variable to preserve the original traceback more cleanly. This is a Python best practice when re-raising within an except block.
🔎 Suggested fix
         except (socket.timeout, TimeoutError, HTTPError) as e:
             if attempt == retries:
-                raise e
+                raise
             time.sleep(delay)
175-187: Excellent error handling and cleanup implementation!

The try/except/finally structure properly addresses the cleanup concerns raised in previous reviews:

Successfully downloaded parts are merged into the final file

On any failure, the incomplete destination file is removed

The finally block ensures all temporary part files are always cleaned up

Graceful fallback to single-part download improves reliability

The bare Exception catch (line 175) is flagged by static analysis but is reasonable here since any failure should trigger the fallback behavior, and the finally block guarantees cleanup.

If you prefer more explicit exception handling, consider catching specific exceptions:
🔎 Optional: More specific exception handling
-    except Exception as e:
+    except (socket.timeout, TimeoutError, HTTPError, OSError, IOError) as e:
         # If anything fails, remove the incomplete destination file and switch to single-part download
         if os.path.exists(filename):
             os.remove(filename)
         log.warning(
             "Multipart download failed (%s). Falling back to single-part download.", e
         )
         _url_download(url=url, filename=filename, data=data)
However, the current approach is also acceptable given the fallback strategy.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c97438c and 63a6aec.

📒 Files selected for processing (1)

avocado/utils/download.py

🧰 Additional context used

🪛 Ruff (0.14.10)

avocado/utils/download.py

112-114: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

115-115: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

117-117: Consider moving this statement to an else block

(TRY300)

120-120: Use raise without specifying exception name

Remove exception name

(TRY201)

139-141: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

140-140: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

175-175: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (4)

avocado/utils/download.py (4)

25-28: LGTM! Imports are appropriate for the new multipart download functionality.

Also applies to: 31-31

153-156: LGTM! Small file handling is correct.

The check for part_size == 0 properly handles files smaller than the number of segments by falling back to single download. This addresses the edge case flagged in previous reviews.

158-165: LGTM! Byte range calculation is correct.

The task function correctly calculates byte ranges for each segment, ensuring the last segment captures all remaining bytes. The logic properly handles the inclusive nature of HTTP Range headers (0-based, end-inclusive).

166-174: LGTM! Parallel download and merge logic is sound.

The use of ThreadPoolExecutor for I/O-bound parallel downloads is appropriate, and the merge operation correctly combines part files using shutil.copyfileobj.

SOORAJTS2001 · 2025-12-28T08:13:49Z

Hi @clebergnu, I have made the commented changes

Thanks!

fix: modify url_download to do multipart download using multi-threa…

3aa3af6

…ding when supported by the server, which shows better performance and consumes less resources If server doesn't give `content-length` or if there is any `data` to POST into, switch back to `_url_download`

mr-avocado bot added this to Default project Nov 29, 2025

mr-avocado bot moved this to Review Requested in Default project Nov 29, 2025

coderabbitai bot reviewed Nov 29, 2025

View reviewed changes

SOORAJTS2001 added 2 commits November 29, 2025 19:34

fix: change bytes|None = None to typing.Optional[bytes] to support …

52656fa

…lower python versions

fix: modify functional test for test_empty_test_list to use `assert…

c97438c

…In` instead of `assertEqual` due to the error shown about deprecated pkg_resources

clebergnu self-requested a review December 4, 2025 13:52

fix: reverted empty_test_list test to use assertEqual instead of asse…

a9d7c22

…rtIn

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

avocado/utils/download.py Outdated Show resolved Hide resolved

clebergnu requested changes Dec 13, 2025

View reviewed changes

Handle multi-part download edge cases and filename usage.

63a6aec

coderabbitai bot reviewed Dec 28, 2025

View reviewed changes

SOORAJTS2001 requested a review from clebergnu December 28, 2025 08:16


		log = logging.getLogger(__name__)

		USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

Multi-part download for url_download in utils/download.py #6247

Are you sure you want to change the base?

Multi-part download for url_download in utils/download.py #6247

Conversation

SOORAJTS2001 commented Nov 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Estimated code review effort

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SOORAJTS2001 commented Nov 30, 2025

Uh oh!

clebergnu commented Dec 12, 2025

Uh oh!

SOORAJTS2001 commented Dec 13, 2025

Uh oh!

clebergnu commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

clebergnu left a comment

Choose a reason for hiding this comment

Uh oh!

clebergnu Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

SOORAJTS2001 Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clebergnu Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

SOORAJTS2001 Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SOORAJTS2001 commented Dec 13, 2025

Uh oh!

codecov bot commented Dec 16, 2025

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

SOORAJTS2001 commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Multi-part download for `url_download` in utils/download.py #6247

Multi-part download for `url_download` in utils/download.py #6247

SOORAJTS2001 commented Nov 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2025 •

edited

Loading