Iterator query over network stuck with CRAM on FTP #1877

rick-heig · 2025-01-22T17:17:49Z

Hello,
I am accessing CRAM files over the network and sometimes sam_itr_querys gets stuck indefinitely (while still downloading data).

I have tested HTSLIB 1.16 and 1.21 (git checkout the tag) and get the same behaviour.

This may be related with issue : #604

I open my files the following way and iterate on regions with sam_itr_query() :

        htsFile *fp = hts_open(cram_file.c_str(), "r");
        if (!fp) {
            std::string error("Cannot open ");
            error += cram_file;
            throw DataCallerError(error);
        }
        hts_idx_t *idx = sam_index_load(fp, std::string(cram_file + ".crai").c_str());
        if (!idx) {
            throw DataCallerError(std::string("Failed to load index file"));
        }
        sam_hdr_t * hdrhdr = sam_hdr_read(fp);
        if (!hdr) {
            std::string error("Failed to read header from file ");
            error += cram_file;
            throw DataCallerError(error);
        }

        hts_itr_t *iter;
        while(...) { /* Iterate over many regions */
            if (iter) {
                 sam_itr_destroy(iter);
                 iter = NULL;
            }
            hts_itr_t *iter = sam_itr_querys(idx, hdr, region.c_str());
            ... do some work, e.g., pile up of reads ...
        }

Sometimes it works well and I can access the CRAM file data and sometimes it gets stuck and executes indefinitely. When I check network activity it downloads data continuously. If I rerun, normally the query returns quickly and downloads only little data.

When I interrupt my program I get the following backtrace :

  * frame #0: 0x00007ff80ba8dd1a libsystem_kernel.dylib`__select + 10
    frame #1: 0x000000010010430c phase_caller`wait_perform(fp=0x000000010124c340) at hfile_libcurl.c:729:17 [opt]
    frame #2: 0x0000000100105710 phase_caller`libcurl_read(fpv=0x000000010124c340, bufferv=0x0000000102809000, nbytes=<unavailable>) at hfile_libcurl.c:834:17 [opt]
    frame #3: 0x0000000100049d86 phase_caller`refill_buffer(fp=0x000000010124c340) at hfile.c:186:13 [opt]
    frame #4: 0x000000010004a0ee phase_caller`hread2(fp=<unavailable>, destv=0x0000700007d75960, nbytes=43, nread=65493) at hfile.c:339:23 [opt]
    frame #5: 0x00000001000ca179 phase_caller`cram_seek [inlined] hread(fp=0x000000010124c340, buffer=0x0000700007d75960, nbytes=65536) at hfile.h:244:56 [opt]
    frame #6: 0x00000001000ca127 phase_caller`cram_seek(fd=<unavailable>, offset=11493247130, whence=<unavailable>) at cram_io.c:5453:20 [opt]
    frame #7: 0x00000001000bea42 phase_caller`cram_seek_to_refpos(fd=0x00000001003af000, r=0x0000700007d85af8) at cram_index.c:583:22 [opt]
    frame #8: 0x00000001000cabd4 phase_caller`cram_set_voption(fd=0x00000001003af000, opt=<unavailable>, args=0x0000700007d85ac0) at cram_io.c:5815:17 [opt]
    frame #9: 0x00000001000ca789 phase_caller`cram_set_option(fd=<unavailable>, opt=<unavailable>) at cram_io.c:5703:9 [opt]
    frame #10: 0x0000000100063b94 phase_caller`cram_itr_query(idx=0x000000010124c9d0, tid=16, beg=<unavailable>, end=248678, readrec=<unavailable>) at sam.c:1696:19 [opt]
    frame #11: 0x0000000100057b9e phase_caller`hts_itr_querys(idx=0x000000010124c9d0, reg="chr17:248676-248678", getid=(phase_caller`bam_name2id at sam.h:780), hdr=0x000000010124ce30, itr_query=(phase_caller`cram_itr_query at sam.c:1681), readrec=<unavailable>) at hts.c:4161:12 [opt]
    frame #12: 0x0000000100063d21 phase_caller`sam_itr_querys(idx=<unavailable>, hdr=<unavailable>, region=<unavailable>) at sam.c:1757:12 [opt] [artificial]

I tested with the following CRAM file :

ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram.crai

I have managed to execute a few thousand of queries and sometimes after a few it gets stuck.

If you have any insights what to look for I can try some debugging.
Thanks.
Rick

The text was updated successfully, but these errors were encountered:

jkbonfield · 2025-01-23T09:17:19Z

Just checking as I don't think this is likely the problem, but are you specifying a local reference? Sometimes CRAM can wedge trying to get a reference out of the EBI (we plan to remove this feature at some point as it's now become unreliable).

I also think you it being related to #604 is a definitely possibility too. If that's correct about not handling errors, then I could imagine indefinite try-fail-try-fail loops when something breaks with the connection. I'm not sure what happened with that issue.

rick-heig · 2025-01-23T10:00:23Z

Hello,
Thank you for your response.

I have a local cache with reference files in ~/.cache/hts-ref. This gets used by my program and samtools. I don't think this is the problem as with the same set of queries (about 7000+), the first time I got about 2000+ without issues, now after a few I directly get the problem. Also when I run the same program with the CRAM file locally (downloaded) instead of with the FTP address there is no issue at all.

During my runs I encounter the following cases

When I call iter = sam_itr_querys(idx, hdr, region.c_str());
I get a NULL iterator without any error message, and when I try to close the file I get :

[W::hts_close] EOF marker is absent. The input is probably truncated

Probably because the connection is down.

When I try to reopen the file (this also sometimes happens the first time I try to open the file) :

[E::hts_open_format] Failed to open file "ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram" : Connection reset by peer

And sometimes I get the file to open but the .crai index does not open:

[E::easy_errno] Libcurl reported error 78 (Remote file not found)
[E::cram_index_load] Could not retrieve index file for 'ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram.crai'

The case described above where the program is stuck in the backtrace shown and continuously pulls data from the network at regions that would be only a few kB at most.

I wonder if this is a problem on their side, maybe the FTP server drops connections, maybe my accesses enable some kind of DDoS protection because I do many small random access to the CRAM file.

Cases 1-3) may be on their side, but sometimes it gets stuck in case 4) for a region that is fairly small (query and pileup of reads intersecting +-1bp around a position), when there is no issue, my program continues rapidly, but when the issue occurs it gets stuck with the backtrace shown in the first post and download hundreds of MB without returning from sam_itr_querys().

Maybe there is an issue with a connection drop and it indefinitely retries here https://github.com/samtools/htslib/blob/develop/hfile_libcurl.c#L833 for some reason.

There might be some edge case that is not handled.

Also I checked my pileup loop against the bcftools code, it seems a bit different how multiple regions are handled, I basically do (once the file is opened as shown in first post) :

for(all regions) {
    if (iter) {
        sam_itr_destroy(iter);
        iter = NULL;
    }
    iter = sam_itr_querys(idx, hdr, region.c_str());
    if (!iter) ... handle error

    bam_plp_t s_plp = bam_plp_init(pileup_filter, (void*)&dc);
    while ((v_plp = bam_plp_auto(s_plp, &curr_tid, &curr_pos, &n_plp)) != 0) {
        ... process reads
    }
    bam_plp_reset(s_plp);
    bam_plp_destroy(s_plp);
}

This works well with local files, but maybe I missed something or this is not the way to go for accessing remote files.

Thank you for your insights

daviesrob · 2025-01-23T10:10:33Z

From the stack trace, it looks like this is due to the cram reader is trying to recover from a failed SEEK_SET by reading lots of data and discarding it.

The place it's got stuck in cram_seek_to_refpos() is here, which can only be reached if the SEEK_SET on the line above failed. Also, if cram_seek(..., ..., SEEK_CUR) fails to seek directly then it goes into a read-and-discard loop here until it reaches the right data. From the stack trace we can see that it's trying to seek forward 11493247130 bytes, which would explain the large amount of data being transferred.

I'm not sure why cram_seek_to_refpos() has this fallback behaviour, and I don't think it's helping much in this case...

If you're querying lots of locations on a single file, it may be better to download the entire file with something like wget or curl and then work on the local copy. This is likely to put less stress on the EBI servers than attempting to do lots of index lookups on the remote copy.

If you're only looking at a few locations, you could try switching from ftp: to https:. The EBI's ftp server responds to both, and the latter may be more reliable.

jkbonfield · 2025-01-23T10:17:26Z

Cases 1-3 are demonstrating that the ftp server is unreliable, most likely due to overloading. I recall hearing that the ftp server is actually a fuse layer behind the scenes, so it's possible this is shared with the https interface and also has the same time outs.

Case 4 is clearly a bug somewhere in how we're handling error cases. We shouldn't hang indefinitely.

I can't recall why I added the fallback to read-and-discard. I assume it was for some purpose involving streaming. Samtools has traditionally had a rather counterintuitive mix of index searches and stream-and-filter searches depending on command and options, so perhaps this was a (probably unwise) attempt to duplicate that behaviour. CRAM got squashed into a single ~1Mb commit on merge so most of the history for why things were done got lost. I'll have to hunt and see if I have early copies if it before it was merged.

Edit: ah of course - that's io_lib. :) That function originated in jkbonfield/io_lib@a2b1660b#diff-32b45d66820284cc8463fecdc3814d2a7f3dc33d355f9760f156a2a95264fd57R172-R187, and it was always like it with the fallback.

I think it's a misfeature.

rick-heig · 2025-01-23T10:32:00Z

Thank you @daviesrob
It seems to be the case, maybe I can patch this behavior out for when accessing remote files as reading instead of seeking will generate a lot of traffic and lost time.

As my program does small accesses at pinpoint locations (a few thousands of them), I am trying to make it faster than downloading the whole CRAM, especially since I need to access all 3202 samples CRAM files. (Similarly I ran my program on datasets with 200k+ samples but with files mounted over network mount point).

Thank you @jkbonfield I will still try https for 1-3 to see if it makes a difference.

I will also see if I can patch out this misfeature, for example returning an error and wrapping in a try-fail loop until it seeks correctly or reached a maximum number of attempts.

For the moment I only access a single sample for the tests, probably it will get much worse when I run multi-threaded with many samples...

If I cannot get around it I'll have to download the whole 3202 CRAM files and run locally in our cluster. It is a pity because for each sample I only need to query a few hundred MBs from the CRAM.

Thanks for the clear explanations about the code.

rick-heig · 2025-01-23T10:43:33Z

Thank you @daviesrob for suggesting https, I wasn't aware the files were also available over https.

For the moment https works without issues at a rate of about 2-4 pileup queries per second over 3bp regions for a single sample. (I am trying 7000 regions, so it should take about half an hour).

I'll see if the processing finishes for the sample and then I'll see how it scales for multiple samples.

If this works for me I'll be all set, but it may still be interesting to change behavior of htslib in case 4).

rick-heig · 2025-01-23T12:10:00Z

Hello,
I can confirm I have no problems with accessing the CRAM files over https.

Thanks

This was a feature that came all the way from the initial index support added to io_lib, but I think it's a misfeature. The consequence of it is on failing with a SEEK_SET (eg network error, or file corruption) it falls back to doing a read-and-discard loop to simulate the seek via SEEK_CUR. This may perhaps be of use when querying stdin, but it's highly unlikely for us to be doing that while also having an index on disk and it's not something we support with other formats. Fixes samtools#1877

daviesrob assigned jkbonfield Jan 23, 2025

jkbonfield linked a pull request Jan 23, 2025 that will close this issue

Remove cram seek ability to do range queries via SEEK_CUR. #1878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterator query over network stuck with CRAM on FTP #1877

Iterator query over network stuck with CRAM on FTP #1877

rick-heig commented Jan 22, 2025 •

edited

Loading

jkbonfield commented Jan 23, 2025

rick-heig commented Jan 23, 2025

daviesrob commented Jan 23, 2025

jkbonfield commented Jan 23, 2025 •

edited

Loading

rick-heig commented Jan 23, 2025

rick-heig commented Jan 23, 2025

rick-heig commented Jan 23, 2025

Iterator query over network stuck with CRAM on FTP #1877

Iterator query over network stuck with CRAM on FTP #1877

Comments

rick-heig commented Jan 22, 2025 • edited Loading

jkbonfield commented Jan 23, 2025

rick-heig commented Jan 23, 2025

daviesrob commented Jan 23, 2025

jkbonfield commented Jan 23, 2025 • edited Loading

rick-heig commented Jan 23, 2025

rick-heig commented Jan 23, 2025

rick-heig commented Jan 23, 2025

rick-heig commented Jan 22, 2025 •

edited

Loading

jkbonfield commented Jan 23, 2025 •

edited

Loading