Hi!
I would like to highlight hg19/virus ratio for Illumina paired reads. My first step was to remove hg19 reads (using -f4 samtools
) and then align remaining reads against virus.
I have noticed that using the same parameters I got different results for same run if I process paired reads (50% hg19, 50% virus) or single reads (3% hg19, 97% virus).
bwa mem -k 20 hg19.fasta singlereads.fastq > out.sam
bwa mem -k 20 hg19.fasta paired1.fastq paired2.fastq > out2.sam
For paired reads, I have blasted hg19 aligned reads and it occurs that human reads are actually virus.
I have noticed in cigar string that a lot of human mapped are closed to seed 20. Also, if I try to rise seed to 50 (-k 50
) I got an expected ratio (3% hg19, 97% virus).
There is only limited homology between hg19 and the virus and this virus is not integrated to human genome.
Does someone can explain to me why for the same seed I haven't the same results if I give my reads single or paired??
Thanks
What are these percentage exactly ? Are they the proportion of reads mapped from the total reads ? If so, I'm surprised that these add up to 100% ; that every read that does not map to the human genome can map to the virus. Also, it could be useful to see the output of
samtools flagstat
on your reads mapped on human and your reads mapped on the virus.thank you for your comment.
You're right, percentages are ratio between mapped and total reads, and 3 and 97% are no real values but more a global average for all samples. For one random sample : 3.03% for hg19, 96.3% for virus, so 0.67% are non mapped. Here are the outputs of
samtools flagstat
:1) for paired against hg 19 (k20)
for paired (now single fastq) minus hg 19 against virus (k20)
2) for single against hg 19 (k20)
for single minus hg 19 against virus (k20)
3) for paired against hg 19 (k50)
for paired (now single fastq) minus hg 19 (k50) against virus (k20)
Ok this seems weird. I have two comments that might help you figure out what is going on:
-f4
filter.