Extract paired-end reads with one end falls within a given region from a bam file
2
0
Entering edit mode
2.3 years ago
Wang Cong ▴ 10

I have paired-end sequencing reads and two ends of a read may not come from the same region. I want to extract those that have at least one end mapping on a certain region of the genome.

I have found the following command. But it seems that this command does not extract the read whose one end falls within the region while the other one does not.

samtools view input.bam "Chr10:18000-45500" > output.bam
samtools • 1.7k views
ADD COMMENT
1
Entering edit mode

I think bedtools pairtobed can do what you want.

ADD REPLY
0
Entering edit mode

liorglic pairtobed uses a 6 column bed because I am getting an error with a 3 col bed. I even tried creating a 6 col bed adding . however that does not work. Any clues ?

ADD REPLY
0
Entering edit mode
2.3 years ago

using samjdk http://lindenb.github.io/jvarkit/SamJdk.html to get the read or it's mate in "RF01:67-68"

 java -jar dist/samjdk.jar -e 'final String contig="RF01"; final int start = 67; final int end = 68; if(!record.getReadUnmappedFlag() && record.getContig().equals(contig) && !(record.getStart() > end || record.getEnd() < start)) { return true; } if(record.getReadPairedFlag() && !record.getMateUnmappedFlag() && record.getMateReferenceName().equals(contig)) { final int mstart = record.getMateAlignmentStart(); final int mend = SAMUtils.getMateCigar(record)!=null?SAMUtils.getMateAlignmentEnd(record):mstart; return !(mstart > end || mend < start); } return false;'  in.bam 

EDIT:

see also Extract reads within given region, and their mates ; How to extract reads in a specific region from a BAM file and also their mates mapped elsewhere ? ; Filtering out of both read pairs from BAM file if any of the two align within interval

ADD COMMENT
0
Entering edit mode
2.3 years ago
yhoogstrate ▴ 150

If it is about also extracting the corresponding mates, I think I have written an utility for that. Though it was designed for finding / subsetting fusion reads.

It works like this:

dr-disco bam-extract 'chr1:123-456' 'chr7:123-456' input.bam sorted.output.bam

That way it reports all reads and all their aligned segments and mates, that have at least one mate aligned in any of these regions and thus all fusion candidates. However, if you just use the same location twice, you will have the equivalent of what you need:

dr-disco bam-extract -o 'chr10:18000-45500' 'chr10:18000-45500' input.bam output.bam

https://github.com/yhoogstrate/dr-disco

ADD COMMENT

Login before adding your answer.

Traffic: 2534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6