the title says it all - this is driving me nuts
I have Illumina shotgun data from cow manure and I am trying to remove the cow reads.
I have downloaded the latest cow genome and used it as reference to map any cow reads and keep everything that doesn't map.
I tried 2 methods:
#1 directly with Bowtie2
$bowtie2 -p 16 -x /cow_genome_db -1 *_R1_001.trim.fq.gz -2 *_R2_001.trim.fq.gz --un-conc reads_not_host
$ mv reads_not_host.1 reads_not_host.1.fq
$ mv reads_not_host.2 reads_not_host.2.fq
#2 using Bowtie2 and samtools
$bowtie2 -p 16 -x cow_genome_db -1 *_R1_001.trim.fq.gz -2 *_R2_001.trim.fq.gz -S ALL.sam
$samtools view -bS ALL.sam > ALL.bam
$samtools view -b -f 12 -F 256 ALL.bam > both_unmapped.bam
$samtools sort -n -m 5G -@ 2 both_unmapped.bam -o both_unmapped_sorted.bam
$samtools fastq -@ 8 both_unmapped_sorted.bam -1 SAMPLE_host_removed_R1.fastq.gz -2 SAMPLE_host_removed_R2.fastq.gz
According to the log
41565230 reads; of these:
41565230 (100.00%) were paired; of these:
41346602 (99.47%) aligned concordantly 0 times
128863 (0.31%) aligned concordantly exactly 1 time
89765 (0.22%) aligned concordantly >1 times
----
41346602 pairs aligned concordantly 0 times; of these:
3035 (0.01%) aligned discordantly 1 time
----
41343567 pairs aligned 0 times concordantly or discordantly; of these:
82687134 mates make up the pairs; of these:
81906525 (99.06%) aligned 0 times
7953 (0.01%) aligned exactly 1 time
772656 (0.93%) aligned >1 times
1.47% overall alignment rate
But when I check the number of reads in each file before and after (wc -l /4
) - there is more reads after removing the host!!!
2379013 before.fq.1.gz
2370650 before.fq.2.gz
using the first method
2410958 after.fq.1.gz
2407276 after.fq.2.gz
using the second method
2415432 after.fq.1.gz
2529372 after.fq.1.gz
So what is going on?
It is not logical to get more reads as you realize.
What you are likely looking at is presence of duplicate reads because of the secondary/supplementary alignments present in your BAM files. You can filter these reads using
-F 2048
SAM flag to remove those duplicates (see Remove supplementary alignments from a bam file ) .thanks, will give it a try and write back
OK,
Right after getting the unmapped.bam and before sorting the bam I added an extra step to remove duplicate reads
but I got the exact same number of reads in the end...
So either there weren't any duplicate alignments, or I didn't remove them?
Can you check
Most reads should have 2 counts. Those that have more than 2 are ones you need to investigate.
yeap, as far I can see all of them have exactly 2 counts (sorted.bam) - there is sth weird going on here but not sure what
If you are willing to try an alternative then I suggest you try using
bbsplit.sh
( BBSplit syntax for generating builds for the reference genome and how to call different builds. ) orseal.sh
(https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/seal-guide/ ) from BBMap suite to bin the reads.thanks for the suggestions - will try all this! At this point this is purely curiosity, I know the host reads are very very few that will only make a negligible difference but I d like to find out whats going on...