I'm looking to iterate through an indexed BAM file using picard and perform various tests on both a read and it's mate. For some I would need the full SAMRecord for the mate so I can't just use the getMate() methods of the record. I can read through the file using the iterator but once I add in a line that creates the mate record the program slows to a crawl. I've tried the methods SAMFileReader methods queryMate(), query(). The rest of the query() methods are some variant of those (queryContained, queryAlignmentStart, queryOverlapping) and all end up calling the same path.
I've traced down that one of the instances of speed loss ocurs when the BAMFileSpan is created in BAMFileReader.createIndexIterator(). Also it appears that no matter what indexing regions are created, an iterator is created that must re-traverse the whole file rather than looking at the file by an offset as would be the case with samtools mpileup.
Is there a way to resolve this? Currently putting one line into my read loop changes the normal read time of about 20 seconds to not completing within hours.
A bare bones version of the code is below. As is, it should run very quickly, even still on the order of minutes for large bam files, while if you comment out either of the methods for finding the mate record, it does not finish under reasonable time scope.
import java.util.Date;
import net.sf.samtools.SAMFileHeader;
import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;
import net.sf.samtools.Cigar;
import org.apache.commons.io.FilenameUtils;
public class main {
public static void main(String[] args) {
Date startDate = new Date();
System.out.println("Start Time is " + startDate.toString());
String baseName = FilenameUtils.getBaseName(args[0]);
String inputPath = FilenameUtils.getFullPath(args[0]);
// Create 2 samfile readers from the BAM file
// note that this needs to be done twice because you cannot query while using the iterator
SAMFileReader inputSam = null;
SAMFileReader inputSamForQuery = null;
try
{
inputSam=new SAMFileReader(new File(args[0]), new File(args[0] + ".bai"));
inputSamForQuery = new SAMFileReader(new File(args[0]), new File(args[0] + ".bai"));
}
catch (NullPointerException e1)
{
System.out.println("bamFileIn not a valid bam file or lacking *.bai index file");
}
inputSam.setValidationStringency(ValidationStringency.SILENT);
SAMRecordIterator iter=inputSam.iterator();
SAMFileWriterFactory sf=new SAMFileWriterFactory();
sf.setCreateIndex(false);
SAMFileHeader inHeader = inputSam.getFileHeader();
long counter = 0;
while(iter.hasNext())
{
SAMRecord rec=iter.next();
int matePos = rec.getMateAlignmentStart();
int mateRefInd = rec.getMateReferenceIndex();
String mateName = rec.getMateReferenceName();
// SAMRecord mateRec = inputSamForQuery.queryMate(rec);
// ..........or..............
// SAMRecord mateRec = null;
// SAMRecordIterator iterMate = inputSamForQuery.query(mateName, matePos, 0, true);
// while (iterMate.hasNext())
// {
// mateRec = iterMate.next();
// if (mateRec.getReadName().equals(name))
// {
// break;
// }
// }
// iterMate.close();
Cigar recCigar = rec.getCigar();
String recCigStr = recCigar.toString();
// Cigar mateCigar = mateRec.getCigar();
// String mateCigStr = mateCigar.toString();
int recPos = rec.getAlignmentStart();
// int matePos = mateRec.getAlignmentStart();
}
iter.close();
inputSam.close();
Date endDate = new Date();
System.out.println("End time is " + endDate.toString());
}
}
In the line specified, I was trying to force the user to create an indexed *.bai file before running. The correction you gave will work, but if the file isn't there, it just won't include indexing. For this case sorting the file first will work as a work-around since I'm only ever looking at the read and mate pair. But, am I to assume from the response then that using picard you cannot simply look at region and grab the reads without the overhead processing of iterating back through the file to get there?
As long as your trying to get the mate for the properly paired reads (mate is on the same chromosome, not too far away), your could try to buffer the reads. See my code for Reverse read direction in a bam file https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/biostar/Biostar76892.java
Unfortunately, not necessarily; one of the conditions I'm to be looking for are discordant pairs with at least some cases where the mate is on a different chromosome. I considered two buffers, one similar to what you implemented, and one that stores the SAMRecord for those that aren't 'close enough' with detail about the mate such that every new record searches the list to see if they're a previous mate, prior to executing anything else. But not only is that convoluted but at risk of processing issues as well if the list grows large.