Question

Indel calling from sanger reads

0

Entering edit mode

7.0 years ago

fabian.grammes ▴ 10

Hi

I want to call all possible indels from sanger reads. Since it's sanger reads I have low coverage (~6 reads) for my region. I've tried samtools+bcftools but this approach misses some small indels

Example:

samtools mpileup -s -r ssa19:73307890-73307910 -m 1 -f $gen data/bam/$sample.bam

[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
ssa19   73307890    C   7   ,,,,,,, >B55F5= ]]]]]]]
ssa19   73307891    A   7   ,,-17ggggaggcggacctcag,-17ggggaggcggacctcag,-17ggggaggcggacctcag,,, 6<00:.8 ]]]]]]]
ssa19   73307892    G   4   ,,,,    ;:49    ]]]]
ssa19   73307893    G   4   ,,,,    <<76    ]]]]
ssa19   73307894    G   4   ,,,,    <=76    ]]]]
ssa19   73307895    G   4   ,,,,    A?B8    ]]]]
ssa19   73307896    A   4   ,,,,    =<@1    ]]]]
ssa19   73307897    G   4   ,,,-2gc,    =J27    ]]]]
ssa19   73307898    G   3   ,,, :C/ ]]]
ssa19   73307899    C   3   ,,, :=9 ]]]
ssa19   73307900    G   3   ,,, ?E; ]]]
ssa19   73307901    G   3   ,,, ?B5 ]]]
ssa19   73307902    A   2   ,,-1c   58  ]]
ssa19   73307903    C   2   ,-2ct,+6tgtgga  8/  ]]
ssa19   73307904    C   0           
ssa19   73307905    T   3   ,,, =5. ]]]
ssa19   73307906    C   3   ,,, =<5 ]]]
ssa19   73307907    A   4   ,,,,    3:96    ]]]]
ssa19   73307908    G   4   ,,,,    6>>4    ]]]]
ssa19   73307909    G   4   ,,,,    ?E=9    ]]]]
ssa19   73307910    G   4   ,,,,    <B=:    ]]]]

Now from the output of samtools I can see 4 deletions [73307891, 73307897, 73307902, 73307903] - which I can also see in IGV. However, when running bcftools (see code below) I only get the deletion at 73307891:

samtools mpileup -s -r ssa19:73307890-73307910 -m 1 -uf $gen data/bam/$sample.bam \
    | bcftools call -m -Ov --skip-variants snps | grep -v "^##"

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  data/bam/FT1KO1.bam
ssa19   73307891    .   AGGGGAGGCGGACCTCAGGG    AGG 124 .   INDEL;IDV=3;IMF=0.428571;DP=7;VDB=0.0506481;SGB=-0.511536;MQ0F=0;AC=2;AN=2;DP4=0,0,0,3;MQ=60    GT:PL   1/1:151,9,0

I would like all indels. Any idea how I could achieve this ?

Thanks, F

bcftools samtools sequencing • 2.0k views

ADD COMMENT • link updated 7.0 years ago by Ram 45k • written 7.0 years ago by fabian.grammes ▴ 10

3

Entering edit mode

Take your Sanger sequence, and the reference sequence and perform a pairwise alignment, e.g. using EMBOSS-Needle. Do not use a next-generation-sequencing tool like samtools for Sanger sequencing.

ADD REPLY • link 7.0 years ago by ATpoint 89k

0

Entering edit mode

Thanks for the reply

Why not ?
That does not really answer the question

ADD REPLY • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by fabian.grammes ▴ 10

1

Entering edit mode

Hello fabian.grammes ,

Why not ?

next-generation-sequencing tool are designed for data that were obtained from NGS. Its algorithms take into account what's the background of the data.

Could you please describe how you converted sanger sequencing data into a bam file? It is not clear to me how you can say that you have 6 reads? The concept of reads doesn't exist in sanger sequencing.

That does not really answer the question

Please read about the XY-Problem. ATpoint tried to show you, how you maybe can solve your real problem. It's a characteristic of a good community not just to answer the question that people ask, but try to get to the bottom of the real problem.

fin swimmer

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

Hey

Could you please describe how you converted sanger sequencing data into a bam file? It is not clear to me how you can say that you have 6 reads? The concept of reads doesn't exist in sanger sequencing.

The .ab1 files were transformed 2 .fq, and quality trimmed (mott http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html) traces of the vector eliminated with cutadapt and then mapped to my genome using bwa. I'm considering just mapping to the amplicons instead. 6 reads means 6 .ab1 sequences from the same individual but different clones.

Please read about the XY-Problem. ATpoint tried to show you, how you maybe can solve your real problem.

I read about it, however I still would like to know why bcftools misses/ does not call 3 of my indels

ADD REPLY • link 7.0 years ago by fabian.grammes ▴ 10

1

Entering edit mode

As finswimmer said, bcftools is designed for NGS data, where a genome is fragmentated into smaller pieces, then massive-parallel-sequenced and the reads aggregated to reconstruct the underlying sequence. This technology with its involving chemistry and characteristics is taken into account by this tool, including the quality scores to decide about true variants and "false" artifacts. Sanger sequencing uses a whole different approach. It does not supply quality scores that are 1to1 interchangable with the quality values from NGS (no matter what this Biophyton module does), and most importantly, the coverage is way lower. This is acceptable because the accuracy is higher than in NGS. NGS compensates this by throughput per region. Still, the Sanger reads are not (even though it is technically possible) a proper inut for an NGS tool. Multiple- or pairwise sequence alignments as suggested above are the way to go for you. As for why bcftools misses some indels, well, because it was not trained to work with Sanger data.

ADD REPLY • link 7.0 years ago by ATpoint 89k

0

Entering edit mode

I appreciate the comments. I'll try to make it more clear:

bcftools processes the output from samtools mpileup. In the out put from samtools mpileup I can identify the 4 indels (subset from my original post)

ssa19   73307891    A   7   ,,-17ggggaggcggacctcag,-17ggggaggcggacctcag,-17ggggaggcggacctcag,,, 6<00:.8 ]]]]]]]
ssa19   73307897    G   4   ,,,-2gc,    =J27    ]]]]
ssa19   73307902    A   2   ,,-1c   58  ]]
ssa19   73307903    C   2   ,-2ct,+6tgtgga  8/  ]]

But after processing this with bcftools the 3 last indels disappear. Now this is in my view a strict bcftools question: I want to know why bcftools does this. If anyone knows please let me know.

ADD REPLY • link 7.0 years ago by fabian.grammes ▴ 10

0

Entering edit mode

bcftools assumes your sample is diploid by default. Saying this it is not possible that it will report more than 2 variants covering the same position.

Furthermore in the world of variant calling in ngs data a single read showing a certain variant is just one point that goes into the calculation of the properbility for a variant. Variant Caller askes more question like: "How many read s and what fraction show this variant? Whats the mapping quality of the reads? What's the base quality of the reads? How does these value look like on reads which show the variant and those having no variant? Where in the read does the variant appears? ...."

If you have 6 sanger sequences from 6 clones than you have strictly speaking 6 different samples that have to be analyzed independently. So there is no need in your case to take ngs tools for analysing. Just take these 6 sequences and do a pairwise sequence alignment for each of them like ATpoint said before.

fin swimmer

ADD REPLY • link 7.0 years ago by finswimmer 16k

0

Entering edit mode

Hey! I appreciate the discussion, and the ploidy point is interesting. But I don't think it applies in this case. - Because only the 4th indel would be polyploid - When I run bcftools on a smaller region, avoiding the 1st large indel I do not any other indels

ADD REPLY • link 7.0 years ago by fabian.grammes ▴ 10

0

Entering edit mode

When I worked on Sanger sequencing I used phred/phrap/consed and that was great for finding indels. You might give that a try.

ADD REPLY • link 7.0 years ago by Fabio Marroni ★ 3.0k