-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterator query over network stuck with CRAM on FTP #1877
Comments
Just checking as I don't think this is likely the problem, but are you specifying a local reference? Sometimes CRAM can wedge trying to get a reference out of the EBI (we plan to remove this feature at some point as it's now become unreliable). I also think you it being related to #604 is a definitely possibility too. If that's correct about not handling errors, then I could imagine indefinite try-fail-try-fail loops when something breaks with the connection. I'm not sure what happened with that issue. |
Hello, I have a local cache with reference files in During my runs I encounter the following cases
Probably because the connection is down.
I wonder if this is a problem on their side, maybe the FTP server drops connections, maybe my accesses enable some kind of DDoS protection because I do many small random access to the CRAM file. Cases 1-3) may be on their side, but sometimes it gets stuck in case 4) for a region that is fairly small (query and pileup of reads intersecting +-1bp around a position), when there is no issue, my program continues rapidly, but when the issue occurs it gets stuck with the backtrace shown in the first post and download hundreds of MB without returning from Maybe there is an issue with a connection drop and it indefinitely retries here https://github.com/samtools/htslib/blob/develop/hfile_libcurl.c#L833 for some reason. There might be some edge case that is not handled. Also I checked my pileup loop against the bcftools code, it seems a bit different how multiple regions are handled, I basically do (once the file is opened as shown in first post) : for(all regions) {
if (iter) {
sam_itr_destroy(iter);
iter = NULL;
}
iter = sam_itr_querys(idx, hdr, region.c_str());
if (!iter) ... handle error
bam_plp_t s_plp = bam_plp_init(pileup_filter, (void*)&dc);
while ((v_plp = bam_plp_auto(s_plp, &curr_tid, &curr_pos, &n_plp)) != 0) {
... process reads
}
bam_plp_reset(s_plp);
bam_plp_destroy(s_plp);
} This works well with local files, but maybe I missed something or this is not the way to go for accessing remote files. Thank you for your insights |
From the stack trace, it looks like this is due to the cram reader is trying to recover from a failed The place it's got stuck in I'm not sure why If you're querying lots of locations on a single file, it may be better to download the entire file with something like If you're only looking at a few locations, you could try switching from |
Cases 1-3 are demonstrating that the ftp server is unreliable, most likely due to overloading. I recall hearing that the ftp server is actually a fuse layer behind the scenes, so it's possible this is shared with the https interface and also has the same time outs. Case 4 is clearly a bug somewhere in how we're handling error cases. We shouldn't hang indefinitely. I can't recall why I added the fallback to read-and-discard. I assume it was for some purpose involving streaming. Samtools has traditionally had a rather counterintuitive mix of index searches and stream-and-filter searches depending on command and options, so perhaps this was a (probably unwise) attempt to duplicate that behaviour. CRAM got squashed into a single ~1Mb commit on merge so most of the history for why things were done got lost. I'll have to hunt and see if I have early copies if it before it was merged. Edit: ah of course - that's io_lib. :) That function originated in jkbonfield/io_lib@a2b1660b#diff-32b45d66820284cc8463fecdc3814d2a7f3dc33d355f9760f156a2a95264fd57R172-R187, and it was always like it with the fallback. I think it's a misfeature. |
Thank you @daviesrob As my program does small accesses at pinpoint locations (a few thousands of them), I am trying to make it faster than downloading the whole CRAM, especially since I need to access all 3202 samples CRAM files. (Similarly I ran my program on datasets with 200k+ samples but with files mounted over network mount point). Thank you @jkbonfield I will still try https for 1-3 to see if it makes a difference. I will also see if I can patch out this misfeature, for example returning an error and wrapping in a try-fail loop until it seeks correctly or reached a maximum number of attempts. For the moment I only access a single sample for the tests, probably it will get much worse when I run multi-threaded with many samples... If I cannot get around it I'll have to download the whole 3202 CRAM files and run locally in our cluster. It is a pity because for each sample I only need to query a few hundred MBs from the CRAM. Thanks for the clear explanations about the code. |
Thank you @daviesrob for suggesting https, I wasn't aware the files were also available over https. For the moment https works without issues at a rate of about 2-4 pileup queries per second over 3bp regions for a single sample. (I am trying 7000 regions, so it should take about half an hour). I'll see if the processing finishes for the sample and then I'll see how it scales for multiple samples. If this works for me I'll be all set, but it may still be interesting to change behavior of htslib in case 4). |
Hello, Thanks |
This was a feature that came all the way from the initial index support added to io_lib, but I think it's a misfeature. The consequence of it is on failing with a SEEK_SET (eg network error, or file corruption) it falls back to doing a read-and-discard loop to simulate the seek via SEEK_CUR. This may perhaps be of use when querying stdin, but it's highly unlikely for us to be doing that while also having an index on disk and it's not something we support with other formats. Fixes samtools#1877
This was a feature that came all the way from the initial index support added to io_lib, but I think it's a misfeature. The consequence of it is on failing with a SEEK_SET (eg network error, or file corruption) it falls back to doing a read-and-discard loop to simulate the seek via SEEK_CUR. This may perhaps be of use when querying stdin, but it's highly unlikely for us to be doing that while also having an index on disk and it's not something we support with other formats. Fixes samtools#1877
Hello,
I am accessing CRAM files over the network and sometimes
sam_itr_querys
gets stuck indefinitely (while still downloading data).I have tested HTSLIB 1.16 and 1.21 (git checkout the tag) and get the same behaviour.
This may be related with issue : #604
I open my files the following way and iterate on regions with
sam_itr_query()
:Sometimes it works well and I can access the CRAM file data and sometimes it gets stuck and executes indefinitely. When I check network activity it downloads data continuously. If I rerun, normally the query returns quickly and downloads only little data.
When I interrupt my program I get the following backtrace :
I tested with the following CRAM file :
I have managed to execute a few thousand of queries and sometimes after a few it gets stuck.
If you have any insights what to look for I can try some debugging.
Thanks.
Rick
The text was updated successfully, but these errors were encountered: