-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Dear VAMB team,
With the latest development branch of VAMB I'm getting this problem after fitting and in the middle of clustering:
2025-03-02 03:17:13.478 | INFO | Epoch: 300 Loss: 6.42182e-01 CE: 5.69886e-01 AB: 1.67774e-03 SSE: 4.97392e-02 KLD: 2.09010e-02 Batchsize: 4096
2025-03-02 03:17:13.483 | INFO | Encoding to latent representation
2025-03-02 03:19:42.538 | INFO | Trained VAE and encoded in 13238.23 seconds.
2025-03-02 03:19:42.538 | INFO | Clustering
2025-03-02 03:19:42.538 | INFO | Windowsize: 300
2025-03-02 03:19:42.538 | INFO | Min successful thresholds detected: 15
2025-03-02 03:19:42.538 | INFO | Max clusters: None
2025-03-02 03:19:42.538 | INFO | Use CUDA for clustering: True
2025-03-02 03:19:42.538 | INFO | Binsplitter: "_"
2025-03-02 03:20:01.036 | ERROR | An error has been caught in function 'main', process 'MainProcess' (95274), thread 'MainThread' (123169634703168):
Traceback (most recent call last):
File "/data/shiraz/miniconda3/envs/vamb/bin/vamb", line 8, in <module>
sys.exit(main())
│ │ └ <function main at 0x700490ba8e00>
│ └ <built-in function exit>
└ <module 'sys' (built-in)>
> File "/home/shiraz/src/vamb/vamb/__main__.py", line 2398, in main
run(runner, opt.common.general)
│ │ │ │ └ <vamb.__main__.GeneralOptions object at 0x700490b969d0>
│ │ │ └ <vamb.__main__.BinnerCommonOptions object at 0x700490ade510>
│ │ └ <vamb.__main__.BinDefaultOptions object at 0x700490ade660>
│ └ functools.partial(<function run_bin_default at 0x700490ba8220>, <vamb.__main__.BinDefaultOptions object at 0x700490ade660>)
└ <function run at 0x700490b7f060>
File "/home/shiraz/src/vamb/vamb/__main__.py", line 700, in run
runner()
└ functools.partial(<function run_bin_default at 0x700490ba8220>, <vamb.__main__.BinDefaultOptions object at 0x700490ade660>)
File "/home/shiraz/src/vamb/vamb/__main__.py", line 1332, in run_bin_default
cluster_and_write_files(
└ <function cluster_and_write_files at 0x700490b7fb00>
File "/home/shiraz/src/vamb/vamb/__main__.py", line 1199, in cluster_and_write_files
for i, cluster in enumerate(clusters):
│ │ └ <itertools.islice object at 0x7004930999e0>
│ └ <vamb.cluster.Cluster object at 0x700490b43e60>
└ 96
File "/home/shiraz/src/vamb/vamb/cluster.py", line 297, in __next__
cluster, _, points = self.find_cluster()
│ └ <function ClusterGenerator.find_cluster at 0x7004c0f34fe0>
└ ClusterGenerator(5582932 points, 97 clusters)
File "/home/shiraz/src/vamb/vamb/cluster.py", line 542, in find_cluster
medoid, distances = self.wander_medoid(seed)
│ │ │ └ 1756270
│ │ └ <function ClusterGenerator.wander_medoid at 0x7004c0f34ea0>
│ └ ClusterGenerator(5582932 points, 97 clusters)
└ 1596310
File "/home/shiraz/src/vamb/vamb/cluster.py", line 424, in wander_medoid
sample_cluster, sample_distances, sample_density = self.sample_medoid(
│ │ │ └ <function ClusterGenerator.sample_medoid at 0x7004c0f35080>
│ │ └ ClusterGenerator(5582932 points, 97 clusters)
│ └ tensor([0.4708, 0.4492, 0.5287, ..., 0.5651, 0.5209, 0.4904], device='cuda:0')
└ tensor([ 590411, 637452, 956697, 958690, 965290, 985399, 987508, 1300302,
1303055, 1334212, 1601852, 1603650, 1...
File "/home/shiraz/src/vamb/vamb/cluster.py", line 613, in sample_medoid
distances = _calc_distances(self.matrix, medoid)
│ │ │ └ 3758147
│ │ └ <member 'matrix' of 'ClusterGenerator' objects>
│ └ ClusterGenerator(5582932 points, 97 clusters)
└ <function _calc_distances at 0x7004c0f351c0>
File "/home/shiraz/src/vamb/vamb/cluster.py", line 664, in _calc_distances
dists = 0.5 - matrix.matmul(matrix[index])
│ │ │ └ 3758147
│ │ └ tensor([[-0.1746, 0.0333, -0.3253, ..., -0.1780, 0.0398, 0.0627],
│ │ [-0.0008, -0.0626, -0.1380, ..., -0.1338, 0.0...
│ └ <method 'matmul' of 'torch._C.TensorBase' objects>
└ tensor([[-0.1746, 0.0333, -0.3253, ..., -0.1780, 0.0398, 0.0627],
[-0.0008, -0.0626, -0.1380, ..., -0.1338, 0.0...
File "/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_tensor.py", line 39, in wrapped
return f(*args, **kwargs)
│ │ └ {}
│ └ (tensor([ 3.0695e-02, 5.8454e-02, 1.8414e-02, ..., -4.3887e-02,
│ -7.0716e-02, -4.4601e-05], device='cuda:0'), 0.5)
└ <function Tensor.__rsub__ at 0x70057b9e9e40>
File "/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_tensor.py", line 1028, in __rsub__
return _C._VariableFunctions.rsub(self, other)
│ │ │ │ └ 0.5
│ │ │ └ tensor([ 3.0695e-02, 5.8454e-02, 1.8414e-02, ..., -4.3887e-02,
│ │ │ -7.0716e-02, -4.4601e-05], device='cuda:0')
│ │ └ <staticmethod(<built-in method rsub of type object at 0x70057a8ca2a0>)>
│ └ <torch._C._VariableFunctionsClass object at 0x7005aa182ec0>
└ <module 'torch._C' from '/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_C.cpython-313-x86_64-linux-gnu...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 3.81 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 73.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The output folder contains:
model.pt, latent.npz, log.txt, and vae_clusters_metadata.tsv, but the latter only has info ninety-something clusters, so this file was probably not finished bring written. No clusters.tsv.
Expected behaviour:
We've clustered much larget data sets on this size of GPU (24 GB VRAM) earlier, so it's unexpected that the GPU runs out of memory.
Data set size:
Total n samples: 217
Total filtered assembly size: 46GB (+ 2kb)
Total n contigs filtered: 5.6M
Total BAM file size (all samples): 808 GB
composition.npz: 1.4 GB
abundance.npz: 848MB
executed command:
vamb bin default --outdir bin --fasta assembly.fna --bamdir map -o '_' --cuda
upon failure (after fitting, during clustering), then again:
vamb bin default --composition bin/composition.npz --abundance bin/abundance.npz --outdir bin2 --cuda -o '_'
but failed with same error
input data structure:
Assemblies from each sample were filtered >=2b and headers renamed by as: "sampleName_contigNumber"
tail assembly.fna:
>3H3_19995
TAAATAAAGCTCAGGATGCATATGATATAGCAGAGGAAGAATACAATGCATCAAAAACAT
TAGTGGAAAGCTATAAAGCTGAAATGGATTCTTATGCTGCTTCAAGAAATGGTACAACTT
GGCACGATAGACAAGACGCATGGCAAAAAAAGTACGATGAGGCAAGAGGGGATTGGGAGA
CAGAGAAAGGTAATTTAGGGGGAAAACAGAAAAATTATGAAAACGCAAAGAATGCTTTAA
CTGAAGCTCAGAGTGTGTTAGAGAATTATAAGGATCCTCAGGAAAAAATAATTGCGGAAT
TTCTTTTAGAAATGTTGAATGTTAATCCAGATCCAACTATATTAGAGTCATATATTGAAG
AACTTAATAACAGAATCACTGAGTTGAAAACGAATACAGATATTATCGGAAATTATAGTG
AATTAGTAGCAGATACCCAGGTATTAACTGTACTTGTGAAGAATTATGTGTTTTGGGTGA
AAGAAACATCGGAGAATGATTCTGAAATTAATACAATGCTGGGAGAACTTGGAAAGGAAA
TTCAGACCCTGACTCACAAACATTTAATGAGCAATATGCGTCATGGGAAGAAACTTGGCA
GAATAGGATAAATTGTCTGGAAAACTTAATTCAGGAATTACCCCAATTTTTTGAAAGCCA
AAAAAAGGAATTAAAAAATACGGTGATTAACACTGAACTCCTATCGGAATATAATGCAAA
TGAAAAAATGAGTATACTGGATGAACTAAGGAGAAGTAAAGTTTCGGATATTAATGTTAT
TGAAAAGGTTTGTTCTTTGTTATTCGGTAAATATCGTGGAATCGCATGGTTTTCTCTGGG
ACTGGCTGTATTTTTTGATATATCTTCATTGCTGGCTGGACTTTTTATTTATGGACTTTC
AAAGAAAAAGGCTAAGAATTAGAGAAGTAGAAATGCAGAAAACATTTGAATTCGGTAACA
CAGAGGAAGGCTTCGAGACGATTGCCAAGATATTCTTTGAGCAATTGACTCAAGAGCTGG
AGAAAGCGTATAATGATCACATAAAATAACGCTGTACAGAATGCTGAGAGCGAACTATGA
GAAAGATGATGTTGTTCAGTTAGGATAAGTACGAAGGGAGAAAAGAGCTAAAGGATAAGG
CACTTAAAATACAAAAAGTCAAAGGAGAAAATAGAAGCTATGAAAAATGAAAAAATATTA
ACTATAATTAAAGGACAGGAATTTAAATTAAGCCTCAAAGACAAGATTGAAATAAACGAC
ATTTTTTATGATCAATATCTGGAAGCTGCAGCCATGCTGGAGAATATCGTGGCAAATGAG
GAACGCGATAAACAGCCAGATTGGAAGAAAGCGGAGACGGAGAATAATATTATTGCTTTT
TGCGGAGAACGTGGCGAGGGTAAAAGCAGCGCAATGTTTACTTTTATTAATGCAGTAGTT
AACGAGAAAGAGCAAAAAGAATCTACTATATTTGCGCAGTGCGAGAATGTCAAAAATACT
GTTTTTTCAGAACCTATTGTTATTGATCCTTCTGCTTTTGATAATGTACATAATGTTTTA
GATATTATTATCGCTTCGCTTTATCGGAAATTTTCTGATAAATACGATGTTTCACCTGAA
AGATTTGCTAATTATAGAAGGGAAGAGTTATTAAACGAATTTCAAAAGGTATATAAGGAT
ATTTCTTTGCTTAACGATCCTGTTAAAATGCTGGAAGAGGAATATGATTACGAGGGAAGC
ATAGAAAAGATATCAAAGATGGGGGAAAGTCTGCGGTTAAGACGTGACTTGAGTAATTTG
GTAAAGTTATATCTGGATTATATGATGACGGAAGATTCTCGTAACCAATATACTTCAAAA
AAACTTTTGATTGCAATCGATGATCTGGACATGTGCAATGCCAATGCGTATAAAATGGCT
GAACAGATACGTAAGTATTTAATTATTCCAGATATTGTCATAGTGATGGCACTTAAAGTG
GAACAATTGCAGCTTTGTGT
>3H3_19996
CTGGCATTCGCGGTCGGAAAAGATATCGCAGGTCAGACGGTTGTCAGCGACATTGCGAAG
ATGCCGCATCTTCTGATCGCGGGTGCGACCGGATCGGGTAAATCAGTCTGCATCAATACG
CTGATCATGAGTGTGATTTATAAAGCGAAGCCGAGCGAGGTCAAGCTCATCATGATCGAC
CCCAAGGTGGTTGAGTTAAGTGTATACAACGGTATTCCGCATCTTCTGATTCCGGTTGTG
ACCGACCCGAAAAAGGCGTCCGGCGCCCTCAACTGGGCGGTGGCAGAGATGACCGACCGT
TACCAGAAGTTTGCAAAATACGGCGTGCGCGATCTTAAGGGTTTCAATGCCAAGGTTGAG
TCGATCGCGGATATCGACGATCCGAAGAAACCGGAGAAACTGCCGCAGATCATCATTATC
GTGGATGAGCTTGCCGATCTGATGATGGTAGCGCCGGGCGAAGTAGAAGATTCGATCTGT
CGTCTGGCCCAGCTGGCGCGTGCAGCGGGCATTCATCTGGTGATTGCGACGCAGCGACCA
TCGGTCAATGTCATCACCGGTGTGATCAAGGCAAATATTCCGTCGAGAATTGCTTTTTCC
GTCTCTTCCGGAGTGGATTCCCGTACCATTATTGATATGAATGGCGCGGAGAAACTGCTC
GGAAAGGGCGATATGCTGTTCTATCCGTCGGGCTATCAGAAACCGCAGCGTGTACAGGGC
GCGTTTGTGTCCGACAATGAGGTTTCGGATGTGGTCGGATTTTTGAAACAGGAGGGGCTG
ACCGCAGAGTACAGCGCCGAGGTTGAGTCCAAGATCCGTTCGACGGCGATGGATGCGGGC
TTTGGCGGCGGTGAGCGCGATGCCTATTTTGCACAGGCAGGTAAATTTATTATTGAGAAA
GACAAAGCATCCATCGGCATGCTGCAGCGTATGTTTAAGATTGGCTTCAACCGTGCCGCG
CGTATCATGGATCAGCTTGCAGATGCAGGCGTGGTCGGTGAGGAAGAAGGCACGAAGCCG
CGTAAGGTGCTGATGAGCATGGAGCAGTTTGAAAACATGATGGAAGAAGGATATTAAAAA
GCGAAATCATTATAAAATATCAAAATCAGGAGGTATCCTATGAAGCTTAAAAGCTGGCTG
AAAGAATTGCCGTATACATTATTGCAGGGCAGCCTGGAGACGGAGGTTGACGAGGTGGTC
TACGATTCAAGAAAGGCGGCGCCGGGGACGGTGTTTGTGTGTATGCGCGGGGCAAACGTC
GATTCGCATACCTTTATCCCGGATGTGGTTGAAAAGGGCGCGCCGGTTCTGGTGGTCGAG
CATCCGGTTGAGGCGCCTGAGAACGTGACGGTCATTCAAGTGGAGAATGGACGGAACGCG
CTATCGCTTCTTTCGGCGGCACGTTTTGATTATCCGGCGCGGAAGATGACGGCGATCGGT
GTGACCGGCACGAAGGGAAAGACGACTACCACTTATATGATCAAGGCGATTCTGGAGGCA
GCCGGACAGAAGACGGGTCTCATCGGAACGAACGGCGCTGTGATCGGTGAGAATCATTAT
CCGACCAAAAATACGACGCCAGAATCCTACATTTTGCAAGAATATTTTGCAAAAATGGTG
GAAGCGGGCTGCCGTTACATCGTGATGGAGGTTTCTTCCCAAAGCTACCTGATGCACCGG
GTAGATGGACTTTTCTTTGATTATGGAATTTTCCTGAATATTTCCAATGATCATATTGGC
CCGAATGAACATGCAAGTTTTGAAGAATATCTTTACTACAAAAAGCAGCTTTTGAAAAAC
TGCCGGACAGCGCTCGTCAACCGCGATGATCCGTACTTTGATGCGATCGTAGAAGGGGCG
ACAGCAGAGATCCTGACCTTTTCATTGGAACAGGCGGCTGATTTTACAGCGGATGACATT
CACTATGTACGCGAACATGATTTCGTGGGCGTCGAATTTCAGACGCATGGACGATATGAG
AGCGATCTGCGTGTCGGCAT
...
ls -ltrh map/ | tail:
-rw-r--r-- 1 shiraz data 3.7G Feb 26 00:16 3G1.bam
-rw-r--r-- 1 shiraz data 3.2G Feb 26 02:14 3H2.bam
-rw-r--r-- 1 shiraz data 3.9G Feb 26 02:42 3H1.bam
-rw-r--r-- 1 shiraz data 3.8G Feb 26 02:52 3G3.bam
-rw-r--r-- 1 shiraz data 3.5G Feb 26 02:58 3G2.bam
-rw-r--r-- 1 shiraz data 3.8G Feb 26 04:58 3H3.bam
ls -ltrh fq/ | tail:
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H1_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H1_2.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H2_2.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H2_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H3_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H3_2.fq.gz
samtools view 3H3.bam | head:
A00821:408:HC2G5DSXY:1:1455:13883:2581 2121 1A1_1 1701 4 105H42M = 1701 0 GCAGAGAGCTTTCATCTCTGCCCAAATTTGTTTTCTGGACAA FFFFFFFFF:F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFF NM:i:3 MD:Z:24T2C13TAS:i:31 XS:i:27 SA:Z:1F10_2005,2976,-,64S79M4S,4,9; XA:Z:2C11_22171,-131,35M112S,2;2G10_11848,+662,112S35M,2;
A00821:408:HC2G5DSXY:1:1455:13856:2754 2121 1A1_1 1701 4 105H42M = 1701 0 GCAGAGAGCTTTCATCTCTGCCCAAATTTGTTTTCTGGACAA FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, NM:i:3 MD:Z:24T2C13TAS:i:31 XS:i:27 SA:Z:1F10_2005,2976,-,64S79M4S,4,9; XA:Z:2C11_22171,-131,35M112S,2;2G10_11848,+662,112S35M,2;
A00821:408:HC2G5DSXY:1:1420:21522:22200 99 1A1_1 45086 0 113S20M14S = 45086 20 GGATTCGTTCCCTCTCGACCAGAATGTTTGCTCTCTGATTCGTTATGCAATGTATGCATGTGAGCAGCATCTTCTGGTTAAGCCCAGCGTTATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAG FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:20 MC:Z:22S20M105S AS:i:20 XS:i:20 XA:Z:1B11_55,+195567,113S20M14S,0;
A00821:408:HC2G5DSXY:1:1421:22761:11600 99 1A1_1 45086 0 113S20M14S = 45086 20 GGATTCGTTCCCTCTCGACCAGAATGTTTGCTCTCTGATTCGTTATGCAATGTATGCATGTGAGCAGCATCTTCTGGTTAAGCCCAGCGTTATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAG FF,FFFFFF,FFF:FFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFF:FFFFFFF:F:FFFFFFFF::FF:::FFFF::F:F:FFFFFFFFF:FFFFF,FFFFFFFF,FFFFFFF:F:FFFFFFF:FFF,FF:FFF:FFF:FF NM:i:0 MD:Z:20 MC:Z:22S20M105S AS:i:20 XS:i:20 XA:Z:1B11_55,+195567,113S20M14S,0;
A00821:408:HC2G5DSXY:1:1420:21522:22200 147 1A1_1 45086 0 22S20M105S = 45086 -20 ATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAGGCAGAGTGATTATTTCGGGAATGTGCCGATTTTCCAGATTGATCCGACATATGGTCATGAACCCAAAGACTTTGAATATGTCGCTGGTCAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:20 MC:Z:113S20M14