is it a good practice to remove all the unmapped reads with:
samtools view -F 4
after they've been mapped with bwa ? The bam files would be smaller and the remaining operations would be faster isn't it ?
or shall I regret it later ?
is it a good practice to remove all the unmapped reads with:
samtools view -F 4
after they've been mapped with bwa ? The bam files would be smaller and the remaining operations would be faster isn't it ?
or shall I regret it later ?
To future-proof your data, it seems reasonable to hold on to the unmapped reads. For one, future algorithms may do a better job and allow you to recover some of that data. For another, future genome builds may resolve poorly assembled regions and allow additional reads to be mapped.
Neither of these improvements is likely to enable huge discoveries, but the cost you're paying in storage is pretty minimal, compared to the costs of sample collection and sequencing. The speed hit probably isn't as bad as you think either, since the bam is indexed. Smart algorithms will make use of that information and not even have to consider those unmapped reads.
To save space, you could as well delete your FASTQ files instead, and keep the BAM file with the unaligned reads. Now that would make Peter happy, wouldn't it?
There is a tradeoff. If you want to call variants getting rid of the unaligned reads mades things go a bit faster. However, having all the reads in one place is very convient if you ever need to go back to a project.
I usually discard the unaligned reads since dedup will throw away a bunch of reads one step downstream.
What you can do also is to build a wrapper around the bwa sampe step. When this step generates the SAM file, check the flag on the fly and split into several files like a 'bam' and an 'unmapped.bam'. In Perl, it goes roughly like this :
my $command = 'bwa sampe -P -s ref.fa sai1 sai2 fq1 fq2'
my $pid = open my $COM, '-|', $command
or croak "Could not exec $command : $!";
# Splitting output stream between several files
while( my $read = <$COM> ) {
chomp $read;
next if($read =~ m!^\@!); # Skip header lines
if($read1) {
# Second read in a pair
$read2 = $read;
# Process your read1 and read2 and split between several files
# if you want. For instance pairs for which there is at least
# one read mapped on one side, and unmapped pairs on the other
# side. (By checking the flag)
($read1, $read2) = (undef,undef); # Move to next pair
}
else {
# First read in a pair
$read1 = $read;
}
}
close $COM or croak 'Failed to close command : ' . $command;
This way you keep all the reads of your sample (in several files) but you can process only the interesting reads if you want to.
You can also use the remapper of segemehl and try to map the unmapped reads.
I found a manual and a presentation about the remapper in the internet:
In order to remove all the unmapped reads, shouldn't we use the above command? :
samtools view -F12 samfile
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
GATK realigner will use some unmapped reads when doing local realignment.
What is the final answer for this question? For variant calling analysis should we remove unmapped regions in the .bam file?
no
why? Is there any use of unmapped bam file during variant calling analysis?