Distance to reads ends
1
0
Entering edit mode
9.3 years ago
Max Ivon ▴ 140

Hello everyone!
Im using SAMTOOLS, GATK UG and HC for SNP calling from target sequencing paired-end data. In contrast to all the others, UG produces massive amount of mutations near read starts/ends. Quality trimming actually is not ably to handle this problem completely since there is sometimes aligments errors near read ends. I didnt find any usefull information about root mean square distance to reads ends or something like this that GATK keeps in information fields in VCF. So im interesting if there is any opportunity to skip mutations falling near read start/end using GATK or how can i avoid this problem. And also is there any ready solution to calculate mean distance or root mean square distance to the nearest read end from single base position based on sam/bam file?

Thanks in advance

samtools GATK sequencing SNP • 2.6k views
ADD COMMENT
0
Entering edit mode

Not clear to me: one variant in a VCF != one mismatch in one read .

ADD REPLY
0
Entering edit mode

It depends on the parameters used for variant calling (dont know for sure if is it possible for GATK to report each mismatch or not, but samtools definitely can). But I understand that generally its true. Im talking about systematic error (a lot of mismatches/alignemt errors) near the read ends (usually first b.p. of the read), which are reported by GATK UG as high quality mutations, but they surely are not. And the main quastion is: is there any ready solution to calculate the mean root square distance to the closest reads ends from selected base pair position based on the sam/bam file. I think this problem is rather sufficient in target sequencing and there should be ready solutions to solve it.

ADD REPLY
2
Entering edit mode
9.3 years ago

Using https://github.com/lindenb/jvarkit/wiki/SamFixCigar to get the 'X' in the cigar string, and https://github.com/lindenb/jvarkit/wiki/BioAlcidae to print the closest distance between a 'X' and the ends of the reads.

The javascript part:

while(iter.hasNext())
    {
    var rec = iter.next();
    if(rec.getReadUnmappedFlag()) continue;
    var cigar = rec.getCigar();
    if( cigar == null ) continue;

     var varPos=-1;
    var readpos=0;
    for(var i=0;i< cigar.numCigarElements();++i)
        {
        var ce = cigar.getCigarElement(i);
        var op = ce.getOperator();
        if( ! op.consumesReferenceBases()) continue;
        for(var x=0;x < ce.getLength();++x)
            {
            if( op.name() == "X")
                {
                   var distance = Math.min(readpos, (rec.getReadLength()-1)-readpos);
                   if( varPos ==-1 ||  distance< varPos )
                       {
                       varPos= distance;
                       }
                }
            readpos++;
            }
        }
    if(varPos!=-1)
        {
        out.println(varPos);
        }
    }

invoke:

java -jar dist-1.133/samfixcigar.jar -r ref.fa input.bam | java -jar dist-1.133/bioalcidae.jar -f script.js -F sam

output:

13.0
2.0
12.0
13.0
24.0
18.0
9.0
2.0
17.0
17.0
2.0
11.0
1.0
6.0
7.0
1.0
1.0
9.0
14.0
22.0
0.0
1.0
2.0
0.0
ADD COMMENT

Login before adding your answer.

Traffic: 2477 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6