Fix: Shape mismatch in extend mode causing AssertionError #373
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix: Shape mismatch in extend mode causing AssertionError
Description
This PR fixes a critical bug in the
extendmode wheretarget_latentsandx0tensor shapes don't match after padding/trimming operations, causing the pipeline to crash with an AssertionError.Problem
When using the extend mode (extending audio left/right), the following error occurs:
Root Cause
The shape mismatch happens due to:
max_infer_fame_length(240 seconds)These operations can create a 1-5 frame difference between
target_latentsandx0.Solution
Added automatic shape alignment before the assertion check:
Logic:
target_latentsis shorter: pad with zeros on the righttarget_latentsis longer: trim excess frames from the rightImpact
Before Fix:
After Fix:
Testing
Tested scenarios:
Files Changed
acestep/pipeline_ace_step.py- Added shape alignment logic in extend modeSeverity
🔴 CRITICAL - Without this fix, extend mode is completely broken.
Additional Notes
This fix addresses the issue reported by users when using the Upload tab with Text2Music Parameters in extend mode. The shape mismatch was causing the pipeline to fail before generating any audio.