Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mosdepth fails when run with less than 8GB RAM on ~50MB CRAM #245

Open
dennishendriksen opened this issue Dec 9, 2024 · 21 comments
Open

Comments

@dennishendriksen
Copy link

Hi @brentp,

Mosdepth is using more memory than expected when ran with a .bed file on a ~50MB .cram file:

    local args=()
    args+=("--threads"  "1")
# enabling/disabling this line doesn't make a difference
#    args+=("--no-per-base")
    args+=("--by" "nanopore.bed")
    args+=("--thresholds" "1,5,10,15,20,30,50,100")
    args+=("--fasta" "<cut>/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz")
    args+=("mosdepth")
    args+=("vip_fam0_HG002.cram")
    ${CMD_MOSDEPTH} "${args[@]}"

We are running mosdepth 0.3.8 using a singularity apptainer image build from docker://quay.io/biocontainers/mosdepth:0.3.8--hd299d5a_0

I was expecting mosdepth to be able to run with much less memory based on the numbers reported in README and your comment in #161 (comment). Running with 16GB resolves the issue.

Data to reproduce the issue:
https://download.molgeniscloud.org/downloads/vip/_dev/github/mosdepth/
https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
https://download.molgeniscloud.org/downloads/vip/images/mosdepth-0.3.8.sif

Is this the behavior that you would expect?

Best regards,
@dennishendriksen

@brentp
Copy link
Owner

brentp commented Dec 13, 2024

It should require only ~ 1GB of memory, perhaps 2. Do you see the same memory use if you remove --by and --thresholds ?

@dennishendriksen
Copy link
Author

dennishendriksen commented Dec 16, 2024

Hi @brentp,

I did some tests and this is the median mosdepth physical memory usage:

8.3GB   # with    --by and --thresholds
6.8GB   # without --by and --thresholds

So there is a difference, but both values are way outside the 1-2GB range. Any thoughts? I'm not familiar with nim, but was wondering if it could be related to a memory management setting?

Greetings,
@dennishendriksen

@LudvigOlsen
Copy link
Contributor

When running mosdepth on a ~280mb BAM file (hg38) using a ~2gb .bed.gz file (~8gb when uncompressed) in --by (and setting --mapq 20 --min-frag-len 100 --max-frag-len 220 -no-per-base) my system uses >40gb of memory.

When running mosdepth, the "Max mem used" slowly builds up to that and then stops increasing (perhaps when being done with chromosome 1).

Considering whether I should chunk the bed file to reduce the RAM usage? Might increase running time to have multiple calls though.

Note that I am using a fork of mosdepth, but I don't think I've changed anything relevant to RAM usage in the standard usecase.

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

that sounds excessive! can you verify that mosdepth from the releases here has the same memory usage?

@LudvigOlsen
Copy link
Contributor

@brentp I just installed mosdepth via bioconda into a new environment. It uses the same amount of RAM, unfortunately.

The files are available online as part of a preprint, so it should be easy for you to reproduce:

The BED file is available here: https://zenodo.org/records/14215762
It is called "whole_genome.mappable.binned_10bp.bed.gz"

The test BAM file is available here: https://zenodo.org/records/13909979

Something like the following:

$ mosdepth --by whole_genome.mappable.binned_10bp.bed.gz --threads 1 --mapq 20 --min-frag-len 100 --max-frag-len 220 --no-per-base <OUTPUT_PATH> IC38.hg38.downsampled.aligned.sorted.markdup.bam

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

ok. just a quick point--does whole_genome.mappable.binned_10bp.bed.gz have, for example 24.9 million 10bp intervals for chromosome 1? That would be the reason. I do not optimize for bed files of this size and so that is likely the issue.
You can also use --by 10 and then filter to mappable regions after.

@LudvigOlsen
Copy link
Contributor

Something on that scale, yes! Would --by 10 have a much lower memory requirement?

I need to be sure I always have the same 10bp intervals, as I use a lot of other reference information for the same bins. But if that behavior is stable, I could perhaps recompute everything to use the mosdepth bins. Would be a lot of work but if it heavily reduces memory usage, it could be the best approach.

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

yes, --by 10 should have lower memory requirement and it is consistent, it will always create 0..10, 10..20, etc.
I am downloading your test-data on a slow connection so it will take a while, but you could try --by 10

@LudvigOlsen
Copy link
Contributor

Perfect, thanks! It spent 5min and < 4gb, compared to 15min and 42gb. That would probably be worth rerunning everything for!

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

phew! glad to hear it! I am surprised that the bed file is so bad. I am looking into it.

@LudvigOlsen
Copy link
Contributor

Great! I'm lucky all intervals are 10bp, but of course it would be great to have similar performance for bed files, if possible :-)

@LudvigOlsen
Copy link
Contributor

LudvigOlsen commented Jan 13, 2025

EDIT: Disregard this post.


I tried disabling the sorting in bed_to_table(). It seems to have done the trick (3.14gb ram, 4min). If you document clearly that the bedfile must already be sorted (or perhaps add a --sort flag?), shouldn't it be fine to disable this? @brentp

proc bed_to_table(bed: string): TableRef[string, seq[region_t]] =
  var bed_regions = newTable[string, seq[region_t]]()
  var kstr = kstring_t(l:0, m: 0, s: nil)
  var hf = hts_open(cstring(bed), "r")
  while hts_getline(hf, cint(10), addr kstr) > 0:
    if kstr.s[0] == 't' and ($kstr.s).startswith("track "):
      continue
    if kstr.s[0] == '#':
      continue
    var v = bed_line_to_region($kstr.s)
    if v == nil: continue
    bed_regions.mgetOrPut(v.chrom, new_seq[region_t]()).add(v)

  #  ---- DISABLED THIS SORTING ----
  # since it is read into mem, can also well sort.
  #for chrom, ivs in bed_regions.mpairs:
  #    sort(ivs, proc (a, b: region_t): int = int(a.start) - int(b.start))

  hts.free(kstr.s)
  return bed_regions

I should say, I haven't checked that the output is the same, but my bedfile is already sorted so I assume it should be.

I can make a PR if you want it (e.g., with the --sort option).


EDIT: I guess this could be considered a breaking change. In that case, is it better with a --dont-sort option? What should be the default when the cost of sorting here (instead of before) is so much higher (assuming we run on more than one sample)?

@brentp
Copy link
Owner

brentp commented Jan 13, 2025

hmm. good sleuthing! this should not increase the memory so there should be a way we can improve this without a flag. I'll have a look this week.

@brentp
Copy link
Owner

brentp commented Jan 15, 2025

so removing that section does not change the memory use for me! are you sure that's the issue?

@LudvigOlsen
Copy link
Contributor

Hmm.. Let me just check that it properly called the version I had changed.

@LudvigOlsen
Copy link
Contributor

ahhh, damnit, I was still testing with --by 10... Sorry about that. Afternoon findings are unreliable, I guess :D

I don't really get it though. Aren't you just iterating over the intervals similarly to what --by 10 does?

Btw., unrelated but I was thinking of adding an option (in my fork at least) for calculating fragment midpoint coverage, so it can be used for things like cfDNA profiling of transcription factor binding sites. I noticed that any positions in-between paired reads are not counted in the coverage. It seems like that is the default for coverage, but wouldn't it be easy to calculate the actual coverage using something like arr[min(rec.start, rec.matepos)] += 1; arr[min(rec.start, rec.matepos) + rec.isize] -= 1; and then skipping the mate read? (This approach could be selectable via a flag of course).
I don't want to mess up this issue with that though, so if it's an interesting discussion should I open an issue on it? There might be a good reason not to consider the middle positions that I don't know.

@brentp
Copy link
Owner

brentp commented Jan 15, 2025

for bed, it reads everything into memory and nim strings use 64 bits for length, 64 bits for capacity, and n-bytes for the actual data.

what you describe for fragment coverage could work under --fast-mode, where we ignore cigar operations.

@LudvigOlsen
Copy link
Contributor

LudvigOlsen commented Jan 15, 2025

The memory use seems to grow over time, as if it's increasing as it moves through the chromosome or something. So, unless it's incredibly slow at reading / uncompressing the bed file, it may not be from the initial loading into memory?

what you describe for fragment coverage could work under --fast-mode, where we ignore cigar operations.

Right, so if it's a proper pair and the rec.start position is before rec.matepos, then count it. (What is the difference between .matepos and .mate_pos, btw? Was only able to find mate_pos in the docs.)
Will think of what to do when rec.start == rec.matepos so we only need to track those for deduplication. I will make a PR to move the discussion to and then you can always reject the idea, if it isn't meaningful.

@brentp
Copy link
Owner

brentp commented Jan 15, 2025

yes, I'm open to a PR. please keep me posted. I'm picky about adding features and don't want you to do a bunch of work only to run into my hesitations

@LudvigOlsen
Copy link
Contributor

LudvigOlsen commented Jan 15, 2025

I fully understand the need for having only essential functionality from a maintenance and usability perspective. That's why I have a fork with the functionality I need. This seems generally relevant though.

@LudvigOlsen
Copy link
Contributor

If I quit just after calling bed_to_table(), it reaches a max. memory usage of 39.45gb. Nothing I tried reduced this though (but my nim knowledge is also very limited). Perhaps some chromosome chunking is necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants