Entering edit mode
6.2 years ago
fabian.grammes
▴
10
Hi
I want to call all possible indels from sanger reads. Since it's sanger reads I have low coverage (~6 reads) for my region. I've tried samtools+bcftools
but this approach misses some small indels
Example:
samtools mpileup -s -r ssa19:73307890-73307910 -m 1 -f $gen data/bam/$sample.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
ssa19 73307890 C 7 ,,,,,,, >B55F5= ]]]]]]]
ssa19 73307891 A 7 ,,-17ggggaggcggacctcag,-17ggggaggcggacctcag,-17ggggaggcggacctcag,,, 6<00:.8 ]]]]]]]
ssa19 73307892 G 4 ,,,, ;:49 ]]]]
ssa19 73307893 G 4 ,,,, <<76 ]]]]
ssa19 73307894 G 4 ,,,, <=76 ]]]]
ssa19 73307895 G 4 ,,,, A?B8 ]]]]
ssa19 73307896 A 4 ,,,, =<@1 ]]]]
ssa19 73307897 G 4 ,,,-2gc, =J27 ]]]]
ssa19 73307898 G 3 ,,, :C/ ]]]
ssa19 73307899 C 3 ,,, :=9 ]]]
ssa19 73307900 G 3 ,,, ?E; ]]]
ssa19 73307901 G 3 ,,, ?B5 ]]]
ssa19 73307902 A 2 ,,-1c 58 ]]
ssa19 73307903 C 2 ,-2ct,+6tgtgga 8/ ]]
ssa19 73307904 C 0
ssa19 73307905 T 3 ,,, =5. ]]]
ssa19 73307906 C 3 ,,, =<5 ]]]
ssa19 73307907 A 4 ,,,, 3:96 ]]]]
ssa19 73307908 G 4 ,,,, 6>>4 ]]]]
ssa19 73307909 G 4 ,,,, ?E=9 ]]]]
ssa19 73307910 G 4 ,,,, <B=: ]]]]
Now from the output of samtools
I can see 4 deletions [73307891, 73307897, 73307902, 73307903] - which I can also see in IGV. However, when running bcftools
(see code below) I only get the deletion at 73307891:
samtools mpileup -s -r ssa19:73307890-73307910 -m 1 -uf $gen data/bam/$sample.bam \
| bcftools call -m -Ov --skip-variants snps | grep -v "^##"
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT data/bam/FT1KO1.bam
ssa19 73307891 . AGGGGAGGCGGACCTCAGGG AGG 124 . INDEL;IDV=3;IMF=0.428571;DP=7;VDB=0.0506481;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=60 GT:PL 1/1:151,9,0
I would like all indels. Any idea how I could achieve this ?
Thanks, F
Take your Sanger sequence, and the reference sequence and perform a pairwise alignment, e.g. using EMBOSS-Needle. Do not use a next-generation-sequencing tool like samtools for Sanger sequencing.
Thanks for the reply
Hello fabian.grammes ,
next-generation-sequencing tool are designed for data that were obtained from NGS. Its algorithms take into account what's the background of the data.
Could you please describe how you converted sanger sequencing data into a bam file? It is not clear to me how you can say that you have 6 reads? The concept of reads doesn't exist in sanger sequencing.
Please read about the XY-Problem. ATpoint tried to show you, how you maybe can solve your real problem. It's a characteristic of a good community not just to answer the question that people ask, but try to get to the bottom of the real problem.
fin swimmer
Hey
The
.ab1
files were transformed 2.fq
, and quality trimmed (mott http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html) traces of the vector eliminated withcutadapt
and then mapped to my genome usingbwa
. I'm considering just mapping to the amplicons instead. 6 reads means 6.ab1
sequences from the same individual but different clones.I read about it, however I still would like to know why bcftools misses/ does not call 3 of my indels
As finswimmer said, bcftools is designed for NGS data, where a genome is fragmentated into smaller pieces, then massive-parallel-sequenced and the reads aggregated to reconstruct the underlying sequence. This technology with its involving chemistry and characteristics is taken into account by this tool, including the quality scores to decide about true variants and "false" artifacts. Sanger sequencing uses a whole different approach. It does not supply quality scores that are 1to1 interchangable with the quality values from NGS (no matter what this Biophyton module does), and most importantly, the coverage is way lower. This is acceptable because the accuracy is higher than in NGS. NGS compensates this by throughput per region. Still, the Sanger reads are not (even though it is technically possible) a proper inut for an NGS tool. Multiple- or pairwise sequence alignments as suggested above are the way to go for you. As for why bcftools misses some indels, well, because it was not trained to work with Sanger data.
I appreciate the comments. I'll try to make it more clear:
bcftools
processes the output fromsamtools mpileup
. In the out put fromsamtools mpileup
I can identify the 4 indels (subset from my original post)But after processing this with
bcftools
the 3 last indels disappear. Now this is in my view a strictbcftools
question: I want to know whybcftools
does this. If anyone knows please let me know.bcftools
assumes your sample is diploid by default. Saying this it is not possible that it will report more than 2 variants covering the same position.Furthermore in the world of variant calling in ngs data a single read showing a certain variant is just one point that goes into the calculation of the properbility for a variant. Variant Caller askes more question like: "How many read s and what fraction show this variant? Whats the mapping quality of the reads? What's the base quality of the reads? How does these value look like on reads which show the variant and those having no variant? Where in the read does the variant appears? ...."
If you have 6 sanger sequences from 6 clones than you have strictly speaking 6 different samples that have to be analyzed independently. So there is no need in your case to take ngs tools for analysing. Just take these 6 sequences and do a pairwise sequence alignment for each of them like ATpoint said before.
fin swimmer
Hey! I appreciate the discussion, and the ploidy point is interesting. But I don't think it applies in this case. - Because only the 4th indel would be polyploid - When I run
bcftools
on a smaller region, avoiding the 1st large indel I do not any other indelsWhen I worked on Sanger sequencing I used phred/phrap/consed and that was great for finding indels. You might give that a try.