Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterator query over network stuck with CRAM on FTP #1877

Open
rick-heig opened this issue Jan 22, 2025 · 7 comments · May be fixed by #1878
Open

Iterator query over network stuck with CRAM on FTP #1877

rick-heig opened this issue Jan 22, 2025 · 7 comments · May be fixed by #1878
Assignees

Comments

@rick-heig
Copy link

rick-heig commented Jan 22, 2025

Hello,
I am accessing CRAM files over the network and sometimes sam_itr_querys gets stuck indefinitely (while still downloading data).

I have tested HTSLIB 1.16 and 1.21 (git checkout the tag) and get the same behaviour.

This may be related with issue : #604


I open my files the following way and iterate on regions with sam_itr_query() :

        htsFile *fp = hts_open(cram_file.c_str(), "r");
        if (!fp) {
            std::string error("Cannot open ");
            error += cram_file;
            throw DataCallerError(error);
        }
        hts_idx_t *idx = sam_index_load(fp, std::string(cram_file + ".crai").c_str());
        if (!idx) {
            throw DataCallerError(std::string("Failed to load index file"));
        }
        sam_hdr_t * hdrhdr = sam_hdr_read(fp);
        if (!hdr) {
            std::string error("Failed to read header from file ");
            error += cram_file;
            throw DataCallerError(error);
        }

        hts_itr_t *iter;
        while(...) { /* Iterate over many regions */
            if (iter) {
                 sam_itr_destroy(iter);
                 iter = NULL;
            }
            hts_itr_t *iter = sam_itr_querys(idx, hdr, region.c_str());
            ... do some work, e.g., pile up of reads ...
        }
       

Sometimes it works well and I can access the CRAM file data and sometimes it gets stuck and executes indefinitely. When I check network activity it downloads data continuously. If I rerun, normally the query returns quickly and downloads only little data.

When I interrupt my program I get the following backtrace :

  * frame #0: 0x00007ff80ba8dd1a libsystem_kernel.dylib`__select + 10
    frame #1: 0x000000010010430c phase_caller`wait_perform(fp=0x000000010124c340) at hfile_libcurl.c:729:17 [opt]
    frame #2: 0x0000000100105710 phase_caller`libcurl_read(fpv=0x000000010124c340, bufferv=0x0000000102809000, nbytes=<unavailable>) at hfile_libcurl.c:834:17 [opt]
    frame #3: 0x0000000100049d86 phase_caller`refill_buffer(fp=0x000000010124c340) at hfile.c:186:13 [opt]
    frame #4: 0x000000010004a0ee phase_caller`hread2(fp=<unavailable>, destv=0x0000700007d75960, nbytes=43, nread=65493) at hfile.c:339:23 [opt]
    frame #5: 0x00000001000ca179 phase_caller`cram_seek [inlined] hread(fp=0x000000010124c340, buffer=0x0000700007d75960, nbytes=65536) at hfile.h:244:56 [opt]
    frame #6: 0x00000001000ca127 phase_caller`cram_seek(fd=<unavailable>, offset=11493247130, whence=<unavailable>) at cram_io.c:5453:20 [opt]
    frame #7: 0x00000001000bea42 phase_caller`cram_seek_to_refpos(fd=0x00000001003af000, r=0x0000700007d85af8) at cram_index.c:583:22 [opt]
    frame #8: 0x00000001000cabd4 phase_caller`cram_set_voption(fd=0x00000001003af000, opt=<unavailable>, args=0x0000700007d85ac0) at cram_io.c:5815:17 [opt]
    frame #9: 0x00000001000ca789 phase_caller`cram_set_option(fd=<unavailable>, opt=<unavailable>) at cram_io.c:5703:9 [opt]
    frame #10: 0x0000000100063b94 phase_caller`cram_itr_query(idx=0x000000010124c9d0, tid=16, beg=<unavailable>, end=248678, readrec=<unavailable>) at sam.c:1696:19 [opt]
    frame #11: 0x0000000100057b9e phase_caller`hts_itr_querys(idx=0x000000010124c9d0, reg="chr17:248676-248678", getid=(phase_caller`bam_name2id at sam.h:780), hdr=0x000000010124ce30, itr_query=(phase_caller`cram_itr_query at sam.c:1681), readrec=<unavailable>) at hts.c:4161:12 [opt]
    frame #12: 0x0000000100063d21 phase_caller`sam_itr_querys(idx=<unavailable>, hdr=<unavailable>, region=<unavailable>) at sam.c:1757:12 [opt] [artificial]

I tested with the following CRAM file :

ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram.crai

I have managed to execute a few thousand of queries and sometimes after a few it gets stuck.

If you have any insights what to look for I can try some debugging.
Thanks.
Rick

@jkbonfield
Copy link
Contributor

Just checking as I don't think this is likely the problem, but are you specifying a local reference? Sometimes CRAM can wedge trying to get a reference out of the EBI (we plan to remove this feature at some point as it's now become unreliable).

I also think you it being related to #604 is a definitely possibility too. If that's correct about not handling errors, then I could imagine indefinite try-fail-try-fail loops when something breaks with the connection. I'm not sure what happened with that issue.

@rick-heig
Copy link
Author

Hello,
Thank you for your response.

I have a local cache with reference files in ~/.cache/hts-ref. This gets used by my program and samtools. I don't think this is the problem as with the same set of queries (about 7000+), the first time I got about 2000+ without issues, now after a few I directly get the problem. Also when I run the same program with the CRAM file locally (downloaded) instead of with the FTP address there is no issue at all.

During my runs I encounter the following cases

  1. When I call iter = sam_itr_querys(idx, hdr, region.c_str());
    I get a NULL iterator without any error message, and when I try to close the file I get :
[W::hts_close] EOF marker is absent. The input is probably truncated

Probably because the connection is down.

  1. When I try to reopen the file (this also sometimes happens the first time I try to open the file) :
[E::hts_open_format] Failed to open file "ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram" : Connection reset by peer
  1. And sometimes I get the file to open but the .crai index does not open:
[E::easy_errno] Libcurl reported error 78 (Remote file not found)
[E::cram_index_load] Could not retrieve index file for 'ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram.crai'
  1. The case described above where the program is stuck in the backtrace shown and continuously pulls data from the network at regions that would be only a few kB at most.

I wonder if this is a problem on their side, maybe the FTP server drops connections, maybe my accesses enable some kind of DDoS protection because I do many small random access to the CRAM file.

Cases 1-3) may be on their side, but sometimes it gets stuck in case 4) for a region that is fairly small (query and pileup of reads intersecting +-1bp around a position), when there is no issue, my program continues rapidly, but when the issue occurs it gets stuck with the backtrace shown in the first post and download hundreds of MB without returning from sam_itr_querys().

Maybe there is an issue with a connection drop and it indefinitely retries here https://github.com/samtools/htslib/blob/develop/hfile_libcurl.c#L833 for some reason.

There might be some edge case that is not handled.


Also I checked my pileup loop against the bcftools code, it seems a bit different how multiple regions are handled, I basically do (once the file is opened as shown in first post) :

for(all regions) {
    if (iter) {
        sam_itr_destroy(iter);
        iter = NULL;
    }
    iter = sam_itr_querys(idx, hdr, region.c_str());
    if (!iter) ... handle error

    bam_plp_t s_plp = bam_plp_init(pileup_filter, (void*)&dc);
    while ((v_plp = bam_plp_auto(s_plp, &curr_tid, &curr_pos, &n_plp)) != 0) {
        ... process reads
    }
    bam_plp_reset(s_plp);
    bam_plp_destroy(s_plp);
}

This works well with local files, but maybe I missed something or this is not the way to go for accessing remote files.

Thank you for your insights

@daviesrob
Copy link
Member

From the stack trace, it looks like this is due to the cram reader is trying to recover from a failed SEEK_SET by reading lots of data and discarding it.

The place it's got stuck in cram_seek_to_refpos() is here, which can only be reached if the SEEK_SET on the line above failed. Also, if cram_seek(..., ..., SEEK_CUR) fails to seek directly then it goes into a read-and-discard loop here until it reaches the right data. From the stack trace we can see that it's trying to seek forward 11493247130 bytes, which would explain the large amount of data being transferred.

I'm not sure why cram_seek_to_refpos() has this fallback behaviour, and I don't think it's helping much in this case...

If you're querying lots of locations on a single file, it may be better to download the entire file with something like wget or curl and then work on the local copy. This is likely to put less stress on the EBI servers than attempting to do lots of index lookups on the remote copy.

If you're only looking at a few locations, you could try switching from ftp: to https:. The EBI's ftp server responds to both, and the latter may be more reliable.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 23, 2025

Cases 1-3 are demonstrating that the ftp server is unreliable, most likely due to overloading. I recall hearing that the ftp server is actually a fuse layer behind the scenes, so it's possible this is shared with the https interface and also has the same time outs.

Case 4 is clearly a bug somewhere in how we're handling error cases. We shouldn't hang indefinitely.

I can't recall why I added the fallback to read-and-discard. I assume it was for some purpose involving streaming. Samtools has traditionally had a rather counterintuitive mix of index searches and stream-and-filter searches depending on command and options, so perhaps this was a (probably unwise) attempt to duplicate that behaviour. CRAM got squashed into a single ~1Mb commit on merge so most of the history for why things were done got lost. I'll have to hunt and see if I have early copies if it before it was merged.

Edit: ah of course - that's io_lib. :) That function originated in jkbonfield/io_lib@a2b1660b#diff-32b45d66820284cc8463fecdc3814d2a7f3dc33d355f9760f156a2a95264fd57R172-R187, and it was always like it with the fallback.

I think it's a misfeature.

@rick-heig
Copy link
Author

Thank you @daviesrob
It seems to be the case, maybe I can patch this behavior out for when accessing remote files as reading instead of seeking will generate a lot of traffic and lost time.

As my program does small accesses at pinpoint locations (a few thousands of them), I am trying to make it faster than downloading the whole CRAM, especially since I need to access all 3202 samples CRAM files. (Similarly I ran my program on datasets with 200k+ samples but with files mounted over network mount point).

Thank you @jkbonfield I will still try https for 1-3 to see if it makes a difference.

I will also see if I can patch out this misfeature, for example returning an error and wrapping in a try-fail loop until it seeks correctly or reached a maximum number of attempts.

For the moment I only access a single sample for the tests, probably it will get much worse when I run multi-threaded with many samples...

If I cannot get around it I'll have to download the whole 3202 CRAM files and run locally in our cluster. It is a pity because for each sample I only need to query a few hundred MBs from the CRAM.

Thanks for the clear explanations about the code.

@rick-heig
Copy link
Author

Thank you @daviesrob for suggesting https, I wasn't aware the files were also available over https.

For the moment https works without issues at a rate of about 2-4 pileup queries per second over 3bp regions for a single sample. (I am trying 7000 regions, so it should take about half an hour).

I'll see if the processing finishes for the sample and then I'll see how it scales for multiple samples.

If this works for me I'll be all set, but it may still be interesting to change behavior of htslib in case 4).

@rick-heig
Copy link
Author

Hello,
I can confirm I have no problems with accessing the CRAM files over https.

Thanks

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 23, 2025
This was a feature that came all the way from the initial index
support added to io_lib, but I think it's a misfeature.  The
consequence of it is on failing with a SEEK_SET (eg network error,
or file corruption) it falls back to doing a read-and-discard loop to
simulate the seek via SEEK_CUR.

This may perhaps be of use when querying stdin, but it's highly
unlikely for us to be doing that while also having an index on disk
and it's not something we support with other formats.

Fixes samtools#1877
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 23, 2025
This was a feature that came all the way from the initial index
support added to io_lib, but I think it's a misfeature.  The
consequence of it is on failing with a SEEK_SET (eg network error,
or file corruption) it falls back to doing a read-and-discard loop to
simulate the seek via SEEK_CUR.

This may perhaps be of use when querying stdin, but it's highly
unlikely for us to be doing that while also having an index on disk
and it's not something we support with other formats.

Fixes samtools#1877
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants