I want to split my bam file into paired aligned reads and unpaired aligned reads for a much easier downstream analysis.
1) Is this possible to do in one run through the file with samtools?
2) If this is not possible with samtools directly, is it possible with the pysam wrapper library?
If not, perhaps just running the two filters in parallel should reduce IO due to caching...
Ps. I want the pairs in the paired file to be on contiguous lines, so that line 0 and 1 contain the first pair, line 2 and 3 contain the second pair and so on.
3) Is sorting the file by name enough to ensure that paired reads end up next to each other?
Sorry for these possibly dumb questions, I am a complete samtools/paired end reads newb.
Wow, how have I missed the -U option!
Nice to know about, but would it work exactly like I wanted? Filtering for those that are paired would output both those that are unpaired but also unaligned to the U file I guess. Still, thanks.
Edit: no it would work with a tweak: I can pipe the U file to standard out and then filter it again for unmapped reads with a pipe.
Just pipe things if you want to also filter unmapped reads:
samtools view -bF 4 foo.bam | samtools view -bF 8 -o singletons.bam -U paired.bam -
, or something like that.that would be
depends with you mean with paired 'paired-read=1' 'properly paired=8'
I actually went for "mate unmapped", to keep discordant alignments with the properly paired.
8 is mate unmapped, 2 is
PROPER_PAIR
, afaics.I guess it's a question of whether "paired in sequence" or "both mates of a pair aligned" is needed.
I guess I was being unclear, but 8 was what I intended. I can't see what 1 should mean unless paired/unpaired reads sometimes are mixed in fastq files before alignment.