SNP visualization with nucleotide %

Entering edit mode

9.8 years ago

skbrimer ▴ 740

Good afternoon group,

I'm looking for a way to graph the variability of a base call across the genome as a percentage of the coverage at the base. I'm hoping to use this visual to see if there are SNPs that happen across the genome at the same frequency to see if I can see different haplotypes for my virus. I would also like to use this visual to find areas of more variability and see if in those areas I can use one of the current viral reconstruction algorithms to do some target haplotyping.

I'm know this should be able to be done. I have a good reference, I have a bam file with all my sorted reads. I'm just not sure how to put them together in a graph. I have been reading the documentation for Rsamtools and it looks like I can use it to read in the file. From there I figured I could write a loop script to tally all the of the observed nt at each bp in the reference to get the % of each nt at that position. Then I would like to make a line graph of all the nts that are not ref and against the % they show up it in.

I'm just not sure how to use the ref and the aln.bam file together to get started, can someone point me in the correct direction?

Sean

snp R alignment • 3.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by skbrimer ▴ 740

Entering edit mode

9.8 years ago

Brian Bushnell 20k

You can visualize this nicely with IGV. It just requires a sorted, indexed bam file, and the reference fasta, as input, and it functions in a GUI.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Brian Bushnell 20k

Entering edit mode

Hi Brian,

I have done this as well and you are correct it does make a nice picture and I guess that is exactly what I described in my question. However I do not know of a way to extract the information from IGV. My genome is small, viral, and haploid so I use freebayes to get the variants I can not make haplotype calls I just get a single call across the genome. So that is why I was trying to use R to see if I could make different consensuses with the different level SNPs counts.

Is this a good idea or a bad idea?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by skbrimer ▴ 740

Entering edit mode

maybe tweak your freebayes parameters, try -C 2 -F 0.01

ADD REPLY • link 9.8 years ago by apelin20 ▴ 490

Entering edit mode

Thanks for the advice, I have been playing with the parameters, but I guess I'm not understanding how to extract the individual haplotypes out.

https://goo.gl/photos/WbjkqfmiYrm4ofsH9

In the linked screenshot you can see it calls one of the SNPs but not the other. the ref is ACAC and the call by freebayes is GCAC but it should also have GTAC and I do not understand why it doesn't or how to extract that information out... other than manually (please no).

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by skbrimer ▴ 740

Entering edit mode

Maybe post the header of the freebayes generated VCF files, and show the line which has your call. It is odd behaviour.

ADD REPLY • link 9.8 years ago by apelin20 ▴ 490

Entering edit mode

Sure thing!

	##fileformat=VCFv4.1
	##fileDate=20151023
	##source=freeBayes v0.9.21-19-gc003c1e
	##reference=/home/sean/Desktop/templates/IBV_RefSeq.fasta
	##phasing=none
	##commandline="freebayes -f /home/sean/Desktop/templates/IBV_RefSeq.fasta -p1 -F 0.01 -C 2 sample3_map.sorted.bam"
	##filter="QUAL > 10"
	##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
	##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
	##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
	##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
	##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
	##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
	##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally">
	##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally">
	##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
	##INFO=<ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
	##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
	##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
	##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
	##INFO=<ID=PQA,Number=A,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
	##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
	##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
	##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
	##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
	##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
	##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=RUN,Number=A,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
	##INFO=<ID=RPP,Number=A,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=RPL,Number=A,Type=Float,Description="Reads Placed Left: number of reads supporting the alternate balanced to the left (5') of the alternate allele">
	##INFO=<ID=RPR,Number=A,Type=Float,Description="Reads Placed Right: number of reads supporting the alternate balanced to the right (3') of the alternate allele">
	##INFO=<ID=EPP,Number=A,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
	##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio. Ratio between depth in samples with each called alternate allele and those without.">
	##INFO=<ID=ODDS,Number=1,Type=Float,Description="The log odds ratio of the best genotype combination to the second-best.">
	##INFO=<ID=GTI,Number=1,Type=Integer,Description="Number of genotyping iterations required to reach convergence or bailout.">
	##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
	##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
	##INFO=<ID=NUMALT,Number=1,Type=Integer,Description="Number of unique non-reference alleles in called genotypes at this position.">
	##INFO=<ID=MEANALT,Number=A,Type=Float,Description="Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.">
	##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
	##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
	##INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
	##INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
	##INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
	##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
	##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
	##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
	##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
	##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
	##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
	##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
	##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
	#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NOSM
	gi\|9626535\|ref\|NC_001451.1\| 31 . ACAC GCAC 1071.25 . AB=0;ABP=0;AC=1;AF=1;AN=1;AO=89;CIGAR=1X3M;DP=160;DPB=161.5;DPRA=0;EPP=106.094;EPPR=0;GTI=0;LEN=1;MEANALT=5;MQM=38.7079;MQMR=0;NS=1;NUMALT=1;ODDS=53.3488;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1877;QR=0;RO=0;RPL=12;RPP=106.094;RPPR=0;RPR=77;RUN=1;SAF=89;SAP=196.271;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1:160:0:0:89:1877:-158.749,0

view raw biostars-163265.vcf hosted with ❤ by GitHub

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by skbrimer ▴ 740

Entering edit mode

Here is your problem. You set a QUAL > 10 filter. Lower frequency variants have a smaller qual value (beucase what is smaller in frequency has a higher chance to be due to chance).

ADD REPLY • link 9.8 years ago by apelin20 ▴ 490

Entering edit mode

Thank you for the help, when I look at the data is not filtered at the same spot it still does not make the call. They both do (the filtered and non), in other areas of the genome so I know its working. I will try some lower frequency parameters to see if I can get it to show up.

What is the next step after this though. How to I create a list of possible haplotypes from this vcf file?

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by skbrimer ▴ 740

Entering edit mode

You can't, you only have one call. ACAC and GCAC are your haplotypes.

ADD REPLY • link 9.8 years ago by apelin20 ▴ 490