I have paired-end RNAseq data (Illumina). I have assembled a transcriptome with SPAdes. Now I want to call SNPs. There is no reference genome available yet for the species I am actually working with. That's why I thought I map the reads against the transcriptome (with bwa) to call SNPs (with bcftools mpileup) for some preliminary tests (3 samples). This yielded ~90% heterozygous SNPs after quality filtering (QUAL>30, DP>30, MQ>30, MQB>0.4, GQ>30). Since the species is a predominantly self fertilizing plant, this is way too high.
To check if something is wrong with this specific species, I did the same procedure for another selfing species for which a reference is available (same sequencing project, again 3 samples). I got ~25% heterozygous SNPs when I mapped the reads against the transcriptome and 3% when I mapped them against the reference genome (with STAR, only uniquely mapped reads (255) retained). 3% is realistic and fits to previous results, 25% is again way too high.
So, I am wondering what is going on. The only thing I can come up with is, that the two different aligners cause this difference. But in both cases 90-95% of the reads mapped to the genome/transcriptome and in both cases only reads that were mapped with high confidence were retained. Why do the two procedures produce two so different levels of heterozygousity?
I' be happy about any input, since I am stuck.
hi i am looking forward to map SNPs from transcriptome data. refernce genome is also available. but can you point me to a beginners guide. where to start? and list of tools that i will need? i am a beginner in NGS please
Please do not post new questions as
answers
in existing threads. If you do a google search withrnaseq snp site:biostars.org
you will find existing threads. If they don't provide adequate information then please post a new question.