Counting Reads In Sam Format
2
1
Entering edit mode
13.5 years ago
Pdubois ▴ 30

Typically mapping output will show many reads that are very close to each other. Meaning the difference of mapping location differ within 10 positions.

Is there any existing tool to cluster such reads from SAM output and compute the frequency of reads mapped in the cluster?

Surely we can write an all-against-all read comparison, but it takes very longtime to run. Perhaps there are faster tool around to do this task?

sam map next-gen sequencing counts • 4.3k views
ADD COMMENT
0
Entering edit mode

Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type?

ADD REPLY
0
Entering edit mode

Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type? How do you want to deal with paired ends - should both ends be included in the same cluster?

ADD REPLY
3
Entering edit mode
13.5 years ago

as the BAM file are indexed I would use the following java program to count the reads in a 20pb window around one read. A first loop, reads all the SamRecord, and a second loop reads the records around the first one.

import java.io.File;

import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;


public class Biostar8724 
    {
    public static void main(String[] args)
        {
        try {
            int window_size=10;
            int optind=0;
            while(optind<args.length)
                {
                if(args[optind].equals("-h"))
                    {
                                    System.out.println("[bam file]");
                    return;
                    }
                else if(args[optind].equals("-w"))
                    {
                    window_size+=Integer.parseInt(args[++optind]);
                    }
                else if(args[optind].equals("--"))
                    {
                    optind++;
                    break;
                    }
                else if(args[optind].startsWith("-"))
                    {
                    System.err.println("Unnown option: "+args[optind]);
                    return;
                    }
                else
                    {
                    break;
                    }
                ++optind;
                }
            if(optind+1!=args.length)
                {
                System.err.println("BAM Missing");
                return;
                }
            File bamFile=new File(args[optind++]);
            SAMFileReader samFileReader1=new SAMFileReader(bamFile);
            SAMFileReader samFileReader2=new SAMFileReader(bamFile);
            for (final SAMRecord rec1 : samFileReader1)
                {
                if(rec1.getReadUnmappedFlag()) continue;
                if(rec1.getReadFailsVendorQualityCheckFlag()) continue;
                if(rec1.getDuplicateReadFlag()) continue;
                int count=0;
                SAMRecordIterator iter=samFileReader2.queryOverlapping(rec1.getReferenceName(),
                        Math.max(1,rec1.getUnclippedStart()-window_size),
                        rec1.getUnclippedEnd()+window_size);
                while(iter.hasNext())
                    {
                    SAMRecord rec2=iter.next();
                    if(rec2.getReadUnmappedFlag()) continue;
                    if(rec2.getReadFailsVendorQualityCheckFlag()) continue;
                    if(rec2.getReadName().equals(rec1.getReadName())) continue;
                    if(rec2.getDuplicateReadFlag()) continue;
                    ++count;
                    }
                iter.close();
                System.out.println(rec1.getReadName()+"\t"+rec1.getReferenceName()+"\t"+rec1.getUnclippedStart()+"\t"+rec1.getUnclippedEnd()+"\t"+count);
                }
            samFileReader1.close();
            samFileReader2.close();
            } 
        catch (Exception e)
            {
            e.printStackTrace();
            }

        }
    }

compile & execute:

javac -cp path/to/sam-1.36.jar:path/to/picard-1.36.jar:. Biostar8724.java 
java -cp path/to/sam-1.36.jar:path/to/picard-1.36.jar:. Biostar8724  /path/to/file.bam | head
IL31_4368:1:35:14014:18146    chr1    9995    10046    56
IL31_4368:1:6:19181:5899    chr1    9996    10048    60
IL31_4368:1:99:8873:5346    chr1    9996    10050    60
IL31_4368:1:106:6092:8497    chr1    9996    10047    58
IL31_4368:1:40:16308:9538    chr1    10002    10055    61
IL31_4368:1:106:14676:10848    chr1    10007    10060    60
IL31_4368:1:23:18397:11046    chr1    10007    10060    61
IL31_4368:1:42:7251:11444    chr1    10009    10062    61
IL31_4368:1:96:9905:17106    chr1    10010    10063    61
IL31_4368:1:106:14676:10848    chr1    10014    10067    60

edit: another idea, as the records are sorted you could also only use only one loop and put the records in a queue, removing the most distant records at each iteration.

ADD COMMENT
2
Entering edit mode
13.5 years ago
Ketil 4.1k

If you want single linkage clustering, it should be easy to parse SAM files to generate this with a linear traversal, just make sure the SAM file is sorted first. If you want to take paired ends into account, it becomes slightly more complicated.

For completeness, here's a single linkage solution, assuming we ignore paired reads, and that the distance of ten is between beginning of reads. The input is a list of (read,position), trivially parseable from the SAM file. ∞ is maxint for whatever type positions are.

clusters = go [] ∞ input

go c p0 ((rd,pos):rest) | pos-10 < p0 = go (rd:c) pos rest          -- add rd to the cluster
                        | otherwise   = c : go [] ∞ ((rd,pos):rest) -- output cluster and continue
go _ _ [] = []

(untested :-)

ADD COMMENT
0
Entering edit mode

Hi Ketil - sorry to appear ignorant, but I have no idea what language this is, and can't parse it as a result. Could you enlighten me?

ADD REPLY
0
Entering edit mode

This is Haskell. 'clusters' calls the recursive helper function 'go', which takes a cluster accumulator 'c', a current position 'p0', and the input list of SAM records, the first element being a pair of a read 'rd' and its position 'pos'. When 'pos'-10 < 'p0', we put 'rd' on top of the current cluster and recurse, if not we return the current cluster, and start a new one on the remaining input. The last line terminates the recursion when input is empty. Clear? :-)

ADD REPLY
0
Entering edit mode

Good to see someone using Haskell for NGS, this language never went into my brain, functional programming !!!

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6