Okay, I will be a little more specific.
With the .bai file you can randomly access any position in a BAM file. The goal is to access the region that contains reads unmapped.
samtools view -f 4 input.bam
will produce reads unmapped, but if its mate is mapped to a specific chromosome, these reads are "aligned" to a chromosome, i.e. they have a rname and a position.
I want to try and access those reads that do not contain a rname. Yes they will not be aligned (i.e. -f 4) but the set of reads that I want wont include those that have a rname.
FURTHER EXPLAINED
samtools idxstats input.bam
> ...
chrUn_gl000241 ? ? ?
chrUn_gl000242 ? ? ?
chrUn_gl000243 ? ? ?
chrUn_gl000244 ? ? ?
chrUn_gl000245 ? ? ?
chrUn_gl000246 ? ? ?
chrUn_gl000247 ? ? ?
chrUn_gl000248 ? ? ?
chrUn_gl000249 ? ? ?
* ? ? ? #<--------------------------------------I WANT THIS
I want to access the last region (i.e. the reads that do not contain a rname).
samtools view input.bam '*'
doesn't work.
These are always located at the end of a BAM file, so it seems inefficient to run throughout a BAM file to just grab these reads (i.e. I want the random access if possible).
Let me know if this makes sense.
THANKS!!!
(Additional comments: In a sorted bam file all the paired reads with both read and its pair unmapped will be moved to the bottom of the bam file. The user asked if there is any fast way to access those reads directly without going through the whole bam file. The samtools command samtools view -f 12 input.bam
takes too long.)
I don't think it's fair to say an aligner "generally" ignores unmapped reads. For instance,
bwa
doesn't do what you say.You are right. i didnt mean that they ignore it. I meant that you can pretty much control it. For bowtie you can provide --un option and it will make a new set of fastq files containing unmapped reads. I was not clear if he has used BWA.
All reads present in the FASTQ file are present in the BAM file regardless of the alignment. I'm not 100% sure of the aligner used (think its novoalign) but I hope this information helps.
I updated Question if this makes more sense
Sorry I didnt get your new question. Let me ask you:
1) Do you want reads where both reads in a pair are unmapped? 2) Do you want read where the read itself is unmapped and its paired-read can be mapped or unmapped?
I want to target a region in a BAM file that isn't aligned to anything. So yes, it would be the case that both reads in a pair are unmapped. I was just curious if this is considered a separate region in a BAM file for random access. when you run
samtools idxstats input.bam
it shows a distinct region
OK i got where is the confusion. As you want reads where both read and its pair are unmapped you thought that in a sorted bam file these reads should be at the very bottom and I think you are correct. For the other case, where one read is mapped and its pair not mapped, these reads will occur together in a sorted bam file somewhere in between depending on the coordinates of the aligned read. So if you want to filter all reads where both read and its pair are unmapped, you should use -f12 option as mentioned by "matted" . You shouldnt worry about where they exist in bam. You can get them using this command.
Yes, and I understand this. Just thought maybe samtools had some representation where you could randomly access this region in a BAM file (because all of these reads are located at the end so it is kind of time consuming to pull them all via samtools view -f 12)
I dont think so. You will have to go through the whole file it seems. If it is a big file it may take some time. Atleast I am not aware of any short cut that you can use.
Thanks for the help...this was my first post and I figured this might be a challenge to explain :)
You are always welcome. You started with a nice question. :-)
At least this old htslib supports such queries. Old samtools does not.