-
Notifications
You must be signed in to change notification settings - Fork 361
Multi-part download for url_download in utils/download.py
#6247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Multi-part download for url_download in utils/download.py
#6247
Conversation
…ding when supported by the server, which shows better performance and consumes less resources If server doesn't give `content-length` or if there is any `data` to POST into, switch back to `_url_download`
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe change introduces range-based multipart download functionality to the Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
avocado/utils/download.py (1)
116-119: Use bareraiseto preserve the original traceback.Using
raise ere-raises the exception but loses the original traceback context. Use bareraiseinstead.except (socket.timeout, TimeoutError, HTTPError) as e: if attempt == retries: - raise e + raise time.sleep(delay)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
avocado/utils/download.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.6)
avocado/utils/download.py
110-112: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
113-113: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
115-115: Consider moving this statement to an else block
(TRY300)
118-118: Use raise without specifying exception name
Remove exception name
(TRY201)
137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🔇 Additional comments (2)
avocado/utils/download.py (2)
25-27: LGTM!The new imports and
USER_AGENTconstant are appropriate for the multipart download functionality.Also applies to: 35-36
125-125: I'm unable to access the repository to verify the project's Python version requirements and existing code patterns due to connectivity issues. However, based on the original concern, I can provide a rewritten review comment:
Verify Python version compatibility for union type syntax.
The
bytes | Nonesyntax requires Python 3.10+. If the project supports Python versions earlier than 3.10, useOptional[bytes]fromtypinginstead, or ensurefrom __future__ import annotationsis imported at the top of the file. Verify the minimum Python version requirement in the project's configuration (setup.py, setup.cfg, or pyproject.toml).
| headers = urllib.request.urlopen( | ||
| urllib.request.Request(url, method="HEAD"), timeout=10 | ||
| ).headers # Using HEAD method to get the content length | ||
| size = int(headers.get("Content-Length", -1)) | ||
| supports_range = "bytes" in headers.get("Accept-Ranges", "").lower() | ||
|
|
||
| if size <= 0 or data or not supports_range: | ||
| # if the server doesn't provide the size or accepted range / if we want sent data to the server (POST Method), | ||
| # switch to single download with urlopen | ||
| _url_download(url=url, filename=filename, data=data) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for HEAD request.
The HEAD request can raise HTTPError, URLError, or socket.timeout exceptions, causing the function to crash without falling back to _url_download. Additionally, the hardcoded 10-second timeout ignores the timeout parameter.
- headers = urllib.request.urlopen(
- urllib.request.Request(url, method="HEAD"), timeout=10
- ).headers # Using HEAD method to get the content length
- size = int(headers.get("Content-Length", -1))
- supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
-
- if size <= 0 or data or not supports_range:
+ try:
+ head_response = urllib.request.urlopen(
+ urllib.request.Request(url, method="HEAD"), timeout=timeout
+ )
+ headers = head_response.headers
+ size = int(headers.get("Content-Length", -1))
+ supports_range = "bytes" in headers.get("Accept-Ranges", "").lower()
+ except (socket.timeout, HTTPError, urllib.error.URLError):
+ size = -1
+ supports_range = False
+
+ if size <= 0 or data or not supports_range:Note: You'll need to add urllib.error to the imports for URLError.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.6)
137-139: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
138-138: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
…lower python versions
…In` instead of `assertEqual` due to the error shown about deprecated pkg_resources
|
Hi @clebergnu, can you please take a look at this PR |
|
/packit copr-build |
|
Thanks @clebergnu, I have resolved the conflict |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement to the url_download function by implementing multi-part downloading using a thread pool. This is a great enhancement for performance when downloading large files from servers that support range requests. My review focuses on improving the robustness of the new implementation, particularly concerning the cleanup of temporary files in case of download or merge failures, to prevent leaving garbage files on the filesystem.
clebergnu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution. Please look into the review points by me and the AI reviews. Thanks!
|
|
||
| log = logging.getLogger(__name__) | ||
|
|
||
| USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this specific user agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, Servers do not inherently require a user agent to enable a download, as the User-Agent header is technically optional according to HTTP specifications. However, many servers are configured to block or modify responses for requests without a user agent or with an unrecognized one
avocado/utils/download.py
Outdated
| # Merge the split files and remove them | ||
| with open(filename, "wb") as f: | ||
| for part in part_files: | ||
| with open(part, "rb") as pf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause the unintentional overwrite of files, such as:
avocado.utils.download.url_download('https://avocado-project.org/data/assets/jeos-25-64.qcow2.xz', '/tmp/t.xz')
Will overwrite existing /tmp/t.xz.part* files. Ideally the temporary files will have temporary names (like a suffix) and also there will be protection against overwriting existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @clebergnu,
Shouldn't the temp files be overwrited for same file download every time, because we don't know whether those files are completely downloaded or not
|
Thanks,
If there is anything missing please let me know |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6247 +/- ##
==========================================
+ Coverage 73.47% 73.52% +0.04%
==========================================
Files 206 206
Lines 22494 22523 +29
==========================================
+ Hits 16528 16560 +32
+ Misses 5966 5963 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
avocado/utils/download.py (2)
139-148: CRITICAL: HEAD request lacks error handling and uses incorrect timeout.This issue was flagged in previous reviews but remains unaddressed. The HEAD request on lines 139-141 can raise
HTTPError,urllib.error.URLError, orsocket.timeoutexceptions, causing the function to crash without falling back to_url_download. Additionally, the hardcoded 10-second timeout ignores thetimeoutparameter.🔎 Proposed fix
+ try: - headers = urllib.request.urlopen( - urllib.request.Request(url, method="HEAD"), timeout=10 - ).headers # Using HEAD method to get the content length - size = int(headers.get("Content-Length", -1)) - supports_range = "bytes" in headers.get("Accept-Ranges", "").lower() + head_response = urllib.request.urlopen( + urllib.request.Request(url, method="HEAD"), timeout=timeout + ) + headers = head_response.headers + size = int(headers.get("Content-Length", -1)) + supports_range = "bytes" in headers.get("Accept-Ranges", "").lower() + except (socket.timeout, HTTPError, urllib.error.URLError): + size = -1 + supports_range = False if segments == 1 or size <= 0 or data or not supports_range: # Use single download when size/range is unavailable, request is POST, or segment size is 1 _url_download(url=url, filename=filename, data=data) returnNote: You'll need to add
urllib.errorto the imports forURLError.
152-152: MAJOR: Temp file naming may overwrite existing files.The temp file naming pattern
temp{i}_{path.name}could unintentionally collide with existing user files. As noted in previous reviews, if the target directory already contains files liketemp0_myfile.txt, they will be silently overwritten during the download process.Consider using more unique temp filenames to prevent collisions.
🔎 Proposed fix using Python's tempfile module
+ import tempfile + part_size = size // segments path = Path(filename) # takes absolute path - part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)] + # Generate unique temp filenames in the same directory + part_files = [] + for i in range(segments): + fd, temp_path = tempfile.mkstemp( + prefix=f"{path.stem}_part{i}_", + suffix=path.suffix or ".tmp", + dir=path.parent + ) + os.close(fd) # Close the file descriptor, we'll write to it later + part_files.append(temp_path)Alternatively, use a UUID-based naming scheme:
+ import uuid + part_size = size // segments path = Path(filename) - part_files = [str(path.parent / f"temp{i}_{path.name}") for i in range(segments)] + download_id = uuid.uuid4().hex[:8] + part_files = [str(path.parent / f".{path.name}.{download_id}.part{i}") for i in range(segments)]
🧹 Nitpick comments (2)
avocado/utils/download.py (2)
118-121: Consider using bareraiseinstead ofraise e.On line 120, use
raisewithout the exception variable to preserve the original traceback more cleanly. This is a Python best practice when re-raising within an except block.🔎 Suggested fix
except (socket.timeout, TimeoutError, HTTPError) as e: if attempt == retries: - raise e + raise time.sleep(delay)
175-187: Excellent error handling and cleanup implementation!The try/except/finally structure properly addresses the cleanup concerns raised in previous reviews:
- Successfully downloaded parts are merged into the final file
- On any failure, the incomplete destination file is removed
- The finally block ensures all temporary part files are always cleaned up
- Graceful fallback to single-part download improves reliability
The bare
Exceptioncatch (line 175) is flagged by static analysis but is reasonable here since any failure should trigger the fallback behavior, and the finally block guarantees cleanup.If you prefer more explicit exception handling, consider catching specific exceptions:
🔎 Optional: More specific exception handling
- except Exception as e: + except (socket.timeout, TimeoutError, HTTPError, OSError, IOError) as e: # If anything fails, remove the incomplete destination file and switch to single-part download if os.path.exists(filename): os.remove(filename) log.warning( "Multipart download failed (%s). Falling back to single-part download.", e ) _url_download(url=url, filename=filename, data=data)However, the current approach is also acceptable given the fallback strategy.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
avocado/utils/download.py
🧰 Additional context used
🪛 Ruff (0.14.10)
avocado/utils/download.py
112-114: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
115-115: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
117-117: Consider moving this statement to an else block
(TRY300)
120-120: Use raise without specifying exception name
Remove exception name
(TRY201)
139-141: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
140-140: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
175-175: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (4)
avocado/utils/download.py (4)
25-28: LGTM! Imports are appropriate for the new multipart download functionality.Also applies to: 31-31
153-156: LGTM! Small file handling is correct.The check for
part_size == 0properly handles files smaller than the number of segments by falling back to single download. This addresses the edge case flagged in previous reviews.
158-165: LGTM! Byte range calculation is correct.The task function correctly calculates byte ranges for each segment, ensuring the last segment captures all remaining bytes. The logic properly handles the inclusive nature of HTTP Range headers (0-based, end-inclusive).
166-174: LGTM! Parallel download and merge logic is sound.The use of
ThreadPoolExecutorfor I/O-bound parallel downloads is appropriate, and the merge operation correctly combines part files usingshutil.copyfileobj.
|
Hi @clebergnu, I have made the commented changes Thanks! |
fix: modify
url_downloadto do multipart download using multi-threading when supported by the server, which shows better performance and consumes less resourcesIf server doesn't give
content-lengthor if there is anydatato POST into, switch back to_url_downloadinsteadThis is a follow up for the discussion in #6236
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.