Skip to content

Out of cuda memory on 24 GB gpu and 200-sample data set #401

@shiraz-shah

Description

@shiraz-shah

Dear VAMB team,
With the latest development branch of VAMB I'm getting this problem after fitting and in the middle of clustering:

2025-03-02 03:17:13.478 | INFO    | 		Epoch: 300  Loss: 6.42182e-01  CE: 5.69886e-01  AB: 1.67774e-03  SSE: 4.97392e-02  KLD: 2.09010e-02  Batchsize: 4096
2025-03-02 03:17:13.483 | INFO    | 	Encoding to latent representation
2025-03-02 03:19:42.538 | INFO    | 	Trained VAE and encoded in 13238.23 seconds.

2025-03-02 03:19:42.538 | INFO    | Clustering
2025-03-02 03:19:42.538 | INFO    | 	Windowsize: 300
2025-03-02 03:19:42.538 | INFO    | 	Min successful thresholds detected: 15
2025-03-02 03:19:42.538 | INFO    | 	Max clusters: None
2025-03-02 03:19:42.538 | INFO    | 	Use CUDA for clustering: True
2025-03-02 03:19:42.538 | INFO    | 	Binsplitter: "_"
2025-03-02 03:20:01.036 | ERROR   | An error has been caught in function 'main', process 'MainProcess' (95274), thread 'MainThread' (123169634703168):
Traceback (most recent call last):

  File "/data/shiraz/miniconda3/envs/vamb/bin/vamb", line 8, in <module>
    sys.exit(main())
    │   │    └ <function main at 0x700490ba8e00>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>

> File "/home/shiraz/src/vamb/vamb/__main__.py", line 2398, in main
    run(runner, opt.common.general)
    │   │       │   │      └ <vamb.__main__.GeneralOptions object at 0x700490b969d0>
    │   │       │   └ <vamb.__main__.BinnerCommonOptions object at 0x700490ade510>
    │   │       └ <vamb.__main__.BinDefaultOptions object at 0x700490ade660>
    │   └ functools.partial(<function run_bin_default at 0x700490ba8220>, <vamb.__main__.BinDefaultOptions object at 0x700490ade660>)
    └ <function run at 0x700490b7f060>

  File "/home/shiraz/src/vamb/vamb/__main__.py", line 700, in run
    runner()
    └ functools.partial(<function run_bin_default at 0x700490ba8220>, <vamb.__main__.BinDefaultOptions object at 0x700490ade660>)

  File "/home/shiraz/src/vamb/vamb/__main__.py", line 1332, in run_bin_default
    cluster_and_write_files(
    └ <function cluster_and_write_files at 0x700490b7fb00>

  File "/home/shiraz/src/vamb/vamb/__main__.py", line 1199, in cluster_and_write_files
    for i, cluster in enumerate(clusters):
        │  │                    └ <itertools.islice object at 0x7004930999e0>
        │  └ <vamb.cluster.Cluster object at 0x700490b43e60>
        └ 96

  File "/home/shiraz/src/vamb/vamb/cluster.py", line 297, in __next__
    cluster, _, points = self.find_cluster()
                         │    └ <function ClusterGenerator.find_cluster at 0x7004c0f34fe0>
                         └ ClusterGenerator(5582932 points, 97 clusters)

  File "/home/shiraz/src/vamb/vamb/cluster.py", line 542, in find_cluster
    medoid, distances = self.wander_medoid(seed)
    │                   │    │             └ 1756270
    │                   │    └ <function ClusterGenerator.wander_medoid at 0x7004c0f34ea0>
    │                   └ ClusterGenerator(5582932 points, 97 clusters)
    └ 1596310

  File "/home/shiraz/src/vamb/vamb/cluster.py", line 424, in wander_medoid
    sample_cluster, sample_distances, sample_density = self.sample_medoid(
    │               │                                  │    └ <function ClusterGenerator.sample_medoid at 0x7004c0f35080>
    │               │                                  └ ClusterGenerator(5582932 points, 97 clusters)
    │               └ tensor([0.4708, 0.4492, 0.5287,  ..., 0.5651, 0.5209, 0.4904], device='cuda:0')
    └ tensor([ 590411,  637452,  956697,  958690,  965290,  985399,  987508, 1300302,
              1303055, 1334212, 1601852, 1603650, 1...

  File "/home/shiraz/src/vamb/vamb/cluster.py", line 613, in sample_medoid
    distances = _calc_distances(self.matrix, medoid)
                │               │    │       └ 3758147
                │               │    └ <member 'matrix' of 'ClusterGenerator' objects>
                │               └ ClusterGenerator(5582932 points, 97 clusters)
                └ <function _calc_distances at 0x7004c0f351c0>

  File "/home/shiraz/src/vamb/vamb/cluster.py", line 664, in _calc_distances
    dists = 0.5 - matrix.matmul(matrix[index])
                  │      │      │      └ 3758147
                  │      │      └ tensor([[-0.1746,  0.0333, -0.3253,  ..., -0.1780,  0.0398,  0.0627],
                  │      │                [-0.0008, -0.0626, -0.1380,  ..., -0.1338,  0.0...
                  │      └ <method 'matmul' of 'torch._C.TensorBase' objects>
                  └ tensor([[-0.1746,  0.0333, -0.3253,  ..., -0.1780,  0.0398,  0.0627],
                            [-0.0008, -0.0626, -0.1380,  ..., -0.1338,  0.0...

  File "/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_tensor.py", line 39, in wrapped
    return f(*args, **kwargs)
           │  │       └ {}
           │  └ (tensor([ 3.0695e-02,  5.8454e-02,  1.8414e-02,  ..., -4.3887e-02,
           │            -7.0716e-02, -4.4601e-05], device='cuda:0'), 0.5)
           └ <function Tensor.__rsub__ at 0x70057b9e9e40>
  File "/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_tensor.py", line 1028, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
           │  │                  │    │     └ 0.5
           │  │                  │    └ tensor([ 3.0695e-02,  5.8454e-02,  1.8414e-02,  ..., -4.3887e-02,
           │  │                  │              -7.0716e-02, -4.4601e-05], device='cuda:0')
           │  │                  └ <staticmethod(<built-in method rsub of type object at 0x70057a8ca2a0>)>
           │  └ <torch._C._VariableFunctionsClass object at 0x7005aa182ec0>
           └ <module 'torch._C' from '/data/shiraz/miniconda3/envs/vamb/lib/python3.13/site-packages/torch/_C.cpython-313-x86_64-linux-gnu...

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 3.81 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 23.11 GiB is allocated by PyTorch, and 73.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The output folder contains:

model.pt, latent.npz, log.txt, and vae_clusters_metadata.tsv, but the latter only has info ninety-something clusters, so this file was probably not finished bring written. No clusters.tsv.

Expected behaviour:

We've clustered much larget data sets on this size of GPU (24 GB VRAM) earlier, so it's unexpected that the GPU runs out of memory.

Data set size:

Total n samples: 217
Total filtered assembly size: 46GB (+ 2kb)
Total n contigs filtered: 5.6M
Total BAM file size (all samples): 808 GB
composition.npz: 1.4 GB
abundance.npz: 848MB

executed command:

vamb bin default --outdir bin --fasta assembly.fna --bamdir map -o '_' --cuda
upon failure (after fitting, during clustering), then again:
vamb bin default --composition bin/composition.npz --abundance bin/abundance.npz --outdir bin2 --cuda -o '_'
but failed with same error

input data structure:

Assemblies from each sample were filtered >=2b and headers renamed by as: "sampleName_contigNumber"

tail assembly.fna:

>3H3_19995
TAAATAAAGCTCAGGATGCATATGATATAGCAGAGGAAGAATACAATGCATCAAAAACAT
TAGTGGAAAGCTATAAAGCTGAAATGGATTCTTATGCTGCTTCAAGAAATGGTACAACTT
GGCACGATAGACAAGACGCATGGCAAAAAAAGTACGATGAGGCAAGAGGGGATTGGGAGA
CAGAGAAAGGTAATTTAGGGGGAAAACAGAAAAATTATGAAAACGCAAAGAATGCTTTAA
CTGAAGCTCAGAGTGTGTTAGAGAATTATAAGGATCCTCAGGAAAAAATAATTGCGGAAT
TTCTTTTAGAAATGTTGAATGTTAATCCAGATCCAACTATATTAGAGTCATATATTGAAG
AACTTAATAACAGAATCACTGAGTTGAAAACGAATACAGATATTATCGGAAATTATAGTG
AATTAGTAGCAGATACCCAGGTATTAACTGTACTTGTGAAGAATTATGTGTTTTGGGTGA
AAGAAACATCGGAGAATGATTCTGAAATTAATACAATGCTGGGAGAACTTGGAAAGGAAA
TTCAGACCCTGACTCACAAACATTTAATGAGCAATATGCGTCATGGGAAGAAACTTGGCA
GAATAGGATAAATTGTCTGGAAAACTTAATTCAGGAATTACCCCAATTTTTTGAAAGCCA
AAAAAAGGAATTAAAAAATACGGTGATTAACACTGAACTCCTATCGGAATATAATGCAAA
TGAAAAAATGAGTATACTGGATGAACTAAGGAGAAGTAAAGTTTCGGATATTAATGTTAT
TGAAAAGGTTTGTTCTTTGTTATTCGGTAAATATCGTGGAATCGCATGGTTTTCTCTGGG
ACTGGCTGTATTTTTTGATATATCTTCATTGCTGGCTGGACTTTTTATTTATGGACTTTC
AAAGAAAAAGGCTAAGAATTAGAGAAGTAGAAATGCAGAAAACATTTGAATTCGGTAACA
CAGAGGAAGGCTTCGAGACGATTGCCAAGATATTCTTTGAGCAATTGACTCAAGAGCTGG
AGAAAGCGTATAATGATCACATAAAATAACGCTGTACAGAATGCTGAGAGCGAACTATGA
GAAAGATGATGTTGTTCAGTTAGGATAAGTACGAAGGGAGAAAAGAGCTAAAGGATAAGG
CACTTAAAATACAAAAAGTCAAAGGAGAAAATAGAAGCTATGAAAAATGAAAAAATATTA
ACTATAATTAAAGGACAGGAATTTAAATTAAGCCTCAAAGACAAGATTGAAATAAACGAC
ATTTTTTATGATCAATATCTGGAAGCTGCAGCCATGCTGGAGAATATCGTGGCAAATGAG
GAACGCGATAAACAGCCAGATTGGAAGAAAGCGGAGACGGAGAATAATATTATTGCTTTT
TGCGGAGAACGTGGCGAGGGTAAAAGCAGCGCAATGTTTACTTTTATTAATGCAGTAGTT
AACGAGAAAGAGCAAAAAGAATCTACTATATTTGCGCAGTGCGAGAATGTCAAAAATACT
GTTTTTTCAGAACCTATTGTTATTGATCCTTCTGCTTTTGATAATGTACATAATGTTTTA
GATATTATTATCGCTTCGCTTTATCGGAAATTTTCTGATAAATACGATGTTTCACCTGAA
AGATTTGCTAATTATAGAAGGGAAGAGTTATTAAACGAATTTCAAAAGGTATATAAGGAT
ATTTCTTTGCTTAACGATCCTGTTAAAATGCTGGAAGAGGAATATGATTACGAGGGAAGC
ATAGAAAAGATATCAAAGATGGGGGAAAGTCTGCGGTTAAGACGTGACTTGAGTAATTTG
GTAAAGTTATATCTGGATTATATGATGACGGAAGATTCTCGTAACCAATATACTTCAAAA
AAACTTTTGATTGCAATCGATGATCTGGACATGTGCAATGCCAATGCGTATAAAATGGCT
GAACAGATACGTAAGTATTTAATTATTCCAGATATTGTCATAGTGATGGCACTTAAAGTG
GAACAATTGCAGCTTTGTGT
>3H3_19996
CTGGCATTCGCGGTCGGAAAAGATATCGCAGGTCAGACGGTTGTCAGCGACATTGCGAAG
ATGCCGCATCTTCTGATCGCGGGTGCGACCGGATCGGGTAAATCAGTCTGCATCAATACG
CTGATCATGAGTGTGATTTATAAAGCGAAGCCGAGCGAGGTCAAGCTCATCATGATCGAC
CCCAAGGTGGTTGAGTTAAGTGTATACAACGGTATTCCGCATCTTCTGATTCCGGTTGTG
ACCGACCCGAAAAAGGCGTCCGGCGCCCTCAACTGGGCGGTGGCAGAGATGACCGACCGT
TACCAGAAGTTTGCAAAATACGGCGTGCGCGATCTTAAGGGTTTCAATGCCAAGGTTGAG
TCGATCGCGGATATCGACGATCCGAAGAAACCGGAGAAACTGCCGCAGATCATCATTATC
GTGGATGAGCTTGCCGATCTGATGATGGTAGCGCCGGGCGAAGTAGAAGATTCGATCTGT
CGTCTGGCCCAGCTGGCGCGTGCAGCGGGCATTCATCTGGTGATTGCGACGCAGCGACCA
TCGGTCAATGTCATCACCGGTGTGATCAAGGCAAATATTCCGTCGAGAATTGCTTTTTCC
GTCTCTTCCGGAGTGGATTCCCGTACCATTATTGATATGAATGGCGCGGAGAAACTGCTC
GGAAAGGGCGATATGCTGTTCTATCCGTCGGGCTATCAGAAACCGCAGCGTGTACAGGGC
GCGTTTGTGTCCGACAATGAGGTTTCGGATGTGGTCGGATTTTTGAAACAGGAGGGGCTG
ACCGCAGAGTACAGCGCCGAGGTTGAGTCCAAGATCCGTTCGACGGCGATGGATGCGGGC
TTTGGCGGCGGTGAGCGCGATGCCTATTTTGCACAGGCAGGTAAATTTATTATTGAGAAA
GACAAAGCATCCATCGGCATGCTGCAGCGTATGTTTAAGATTGGCTTCAACCGTGCCGCG
CGTATCATGGATCAGCTTGCAGATGCAGGCGTGGTCGGTGAGGAAGAAGGCACGAAGCCG
CGTAAGGTGCTGATGAGCATGGAGCAGTTTGAAAACATGATGGAAGAAGGATATTAAAAA
GCGAAATCATTATAAAATATCAAAATCAGGAGGTATCCTATGAAGCTTAAAAGCTGGCTG
AAAGAATTGCCGTATACATTATTGCAGGGCAGCCTGGAGACGGAGGTTGACGAGGTGGTC
TACGATTCAAGAAAGGCGGCGCCGGGGACGGTGTTTGTGTGTATGCGCGGGGCAAACGTC
GATTCGCATACCTTTATCCCGGATGTGGTTGAAAAGGGCGCGCCGGTTCTGGTGGTCGAG
CATCCGGTTGAGGCGCCTGAGAACGTGACGGTCATTCAAGTGGAGAATGGACGGAACGCG
CTATCGCTTCTTTCGGCGGCACGTTTTGATTATCCGGCGCGGAAGATGACGGCGATCGGT
GTGACCGGCACGAAGGGAAAGACGACTACCACTTATATGATCAAGGCGATTCTGGAGGCA
GCCGGACAGAAGACGGGTCTCATCGGAACGAACGGCGCTGTGATCGGTGAGAATCATTAT
CCGACCAAAAATACGACGCCAGAATCCTACATTTTGCAAGAATATTTTGCAAAAATGGTG
GAAGCGGGCTGCCGTTACATCGTGATGGAGGTTTCTTCCCAAAGCTACCTGATGCACCGG
GTAGATGGACTTTTCTTTGATTATGGAATTTTCCTGAATATTTCCAATGATCATATTGGC
CCGAATGAACATGCAAGTTTTGAAGAATATCTTTACTACAAAAAGCAGCTTTTGAAAAAC
TGCCGGACAGCGCTCGTCAACCGCGATGATCCGTACTTTGATGCGATCGTAGAAGGGGCG
ACAGCAGAGATCCTGACCTTTTCATTGGAACAGGCGGCTGATTTTACAGCGGATGACATT
CACTATGTACGCGAACATGATTTCGTGGGCGTCGAATTTCAGACGCATGGACGATATGAG
AGCGATCTGCGTGTCGGCAT
...

ls -ltrh map/ | tail:

-rw-r--r-- 1 shiraz data 3.7G Feb 26 00:16 3G1.bam
-rw-r--r-- 1 shiraz data 3.2G Feb 26 02:14 3H2.bam
-rw-r--r-- 1 shiraz data 3.9G Feb 26 02:42 3H1.bam
-rw-r--r-- 1 shiraz data 3.8G Feb 26 02:52 3G3.bam
-rw-r--r-- 1 shiraz data 3.5G Feb 26 02:58 3G2.bam
-rw-r--r-- 1 shiraz data 3.8G Feb 26 04:58 3H3.bam

ls -ltrh fq/ | tail:

lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H1_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H1_2.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H2_2.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H2_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H3_1.fq.gz
lrwxrwxrwx 1 shiraz data 90 Feb 16 09:19 3H3_2.fq.gz

samtools view 3H3.bam | head:

A00821:408:HC2G5DSXY:1:1455:13883:2581	2121	1A1_1	1701	4	105H42M	=	1701	0	GCAGAGAGCTTTCATCTCTGCCCAAATTTGTTTTCTGGACAA	FFFFFFFFF:F:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFF	NM:i:3	MD:Z:24T2C13TAS:i:31	XS:i:27	SA:Z:1F10_2005,2976,-,64S79M4S,4,9;	XA:Z:2C11_22171,-131,35M112S,2;2G10_11848,+662,112S35M,2;
A00821:408:HC2G5DSXY:1:1455:13856:2754	2121	1A1_1	1701	4	105H42M	=	1701	0	GCAGAGAGCTTTCATCTCTGCCCAAATTTGTTTTCTGGACAA	FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,	NM:i:3	MD:Z:24T2C13TAS:i:31	XS:i:27	SA:Z:1F10_2005,2976,-,64S79M4S,4,9;	XA:Z:2C11_22171,-131,35M112S,2;2G10_11848,+662,112S35M,2;
A00821:408:HC2G5DSXY:1:1420:21522:22200	99	1A1_1	45086	0	113S20M14S	=	45086	20	GGATTCGTTCCCTCTCGACCAGAATGTTTGCTCTCTGATTCGTTATGCAATGTATGCATGTGAGCAGCATCTTCTGGTTAAGCCCAGCGTTATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAG	FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:20	MC:Z:22S20M105S	AS:i:20	XS:i:20	XA:Z:1B11_55,+195567,113S20M14S,0;
A00821:408:HC2G5DSXY:1:1421:22761:11600	99	1A1_1	45086	0	113S20M14S	=	45086	20	GGATTCGTTCCCTCTCGACCAGAATGTTTGCTCTCTGATTCGTTATGCAATGTATGCATGTGAGCAGCATCTTCTGGTTAAGCCCAGCGTTATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAG	FF,FFFFFF,FFF:FFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFF:FFFFFFF:F:FFFFFFFF::FF:::FFFF::F:F:FFFFFFFFF:FFFFF,FFFFFFFF,FFFFFFF:F:FFFFFFF:FFF,FF:FFF:FFF:FF	NM:i:0	MD:Z:20	MC:Z:22S20M105S	AS:i:20	XS:i:20	XA:Z:1B11_55,+195567,113S20M14S,0;
A00821:408:HC2G5DSXY:1:1420:21522:22200	147	1A1_1	45086	0	22S20M105S	=	45086	-20	ATCATTGCAATGGGTGAGCCGTGTGACGGAGAGCTGATGCTTCATGAGGCATACAGGCAGAGTGATTATTTCGGGAATGTGCCGATTTTCCAGATTGATCCGACATATGGTCATGAACCCAAAGACTTTGAATATGTCGCTGGTCAG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:20	MC:Z:113S20M14

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs more infoIssue cannot be resolved until we get more information

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions