Question

Extracting Read Pairs From Region So That All Pairs Remain Intact

1

Entering edit mode

12.1 years ago

Davy ▴ 410

Hi all, I have some whole exome data and I am only interested in the exons of 4 genes, which I have in a bed file.

I ,perhap naively, tried extracting only those reads using this bed file with samtools

samtools view -b -L 4genes.bed myBAM.bam > myBAM.4genes.bam

which works fine, except when I go to convert the BAM file back to fastq with pircard it throws an error because of reads with missing mates.

java -Xmx4g -jar /path/to/picard/SamToFastQ.jar I=myBAM.4genes.bam O=my4genes.fastq VALIDATION_STRNGENCY=LENIENT

Runtime.totalMemory()=2029715456
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" net.sf.picard.PicardException: Found 2638 unpaired mates
    at net.sf.picard.sam.SamToFastq.doWork(SamToFastq.java:190)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.sam.SamToFastq.main(SamToFastq.java:121)

I also tried intersectBed for this but it had the same result. I guess what I am after is all the reads in my regions plus any mates of those reads that fall outside those regions.

I am hoping there is a tool and or option in samtools or something that I can use for this, but if I must I will program something to do this using python and pysam. I'm just being lazy and trying to find out if anyone else has done this already?

Cheers, Davy

B02V0ACXX110721:2:2205:9893:149433    163    1    10024    0    76M    =    10056    107 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    CCCFFFFFHHHHGJJIJJJIJJJJJHGGIIIHHIJIJGIGJJJHIJJIJIHIJJJGECHJHHGGFFFFECEEDDD=    MD:Z:76    PG:Z:bwa.15    RG:Z:B02V0.2    AM:i:0    NM:i:0    SM:i:0    MQ:i:17    UQ:i:0
B02V0ACXX110721:4:1108:21203:4558    163    1    10039    0    66M10S    =    10188    183    ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCC    B@B7D3BD+AC;?3C??A@G:AAF@?CFF;)?DHBFDFDDGG3DH@F3=B@28)).6=>E76CE@###########    MD:Z:66    PG:Z:bwa.2    RG:Z:B02V0.4    AM:i:0    NM:i:0    SM:i:0    MQ:i:0UQ:i:0
B02FEACXX110730:8:2308:14858:37039    675    1    10042    0    63M13S    =    10184    176    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    ;?>@?BC<9BC?E<>>D=BB5BEDCB?CEF<B(AF;EC:<C+6A8AA>DD>CF7ED>?,<<C##############    MD:Z:63    PG:Z:bwa.8    RG:Z:B02FE.8    AM:i:0    NM:i:0    SM:i:0    MQ:i:0OQ:Z:@<@D?BB;?FF>D9E?F;BC9AEA@@FGGG8?)?F9DB>;B)0?9B?8=C=@F2@FA;(66?##############    UQ:i:0

samtools bam picard fastq • 7.9k views

ADD COMMENT • link updated 7.0 years ago by cmdcolin ★ 4.3k • written 12.1 years ago by Davy ▴ 410

0

Entering edit mode

Try: bamtools filter -isMateMapped true -in myBAM.4genes.bam -out myBAM.4genes.bothmapped.bam

ADD REPLY • link 12.1 years ago by skymningen ▴ 350

0

Entering edit mode

Thanks, but this gives the same result.

ADD REPLY • link 12.1 years ago by Davy ▴ 410

0

Entering edit mode

The lazy and slow way is to do the samtools filtering, then extract the read names that it gave, then grep in the original bam file for those read names. It will pull out each read that mapped in the region, or its mate, assuming the names are the same (which they should be). It will be quite slow as the bam grows, but on the other hand it is trivially parallelizable if you want to split up the bam.

ADD REPLY • link 12.1 years ago by matted 7.8k

0

Entering edit mode

Can you show us some sample reads? I'm thinking the read names might be goofy and that's throwing off some of your tools. This happened to me when working with TCGA BAM files.

ADD REPLY • link 12.1 years ago by Dan D 7.4k

0

Entering edit mode

Reads added above. Thanks for the assistance.

ADD REPLY • link 12.1 years ago by Davy ▴ 410

0

Entering edit mode

Could you figure out the solution to this problem? I am trying to do something similar: I want to extract the reads mapping to my region of interest and also extract it's pair even if is mapped out of region or unmapped. What I am trying to do is:

(i) extracting the read names which map to the region of interest samtools view coordinate-sorted-WXS.bam chr2:29907-400156 chr2:33030-33050 | awk '{print $1}' | sort | uniq > mapped_reads.txt

(ii) extracting the reads by name: For this I found these two tools but both of them are giving me bam files which have the same issue described above–not both pairs are included in bam file. These tools are:

ExtractReads: http://timoast.github.io/2015/10/12/ExtractReads/

SamGrep https://github.com/lindenb/jvarkit/wiki/SamGrep/

If anyone of you have resolved this please let me know.

Thanks, Arun

ADD REPLY • link 8.6 years ago by akhattri ▴ 50

score 0 · Answer 1 · 2013-09-16

0

Entering edit mode

12.1 years ago

Pierre Lindenbaum 166k

filter with samtools view and the required flag 'read mapped in proper pair'

or fix myBAM.4genes.bam with (? I think that should work) picard FixMateInformation http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation

ADD COMMENT • link 12.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

So I tried FixMateInformation, and that didn't work. I also tried (emphasis on tried) filtering with samtools view -f (tried 2 and 3 here) but I didn't really get anywhere. Samtools flagstat reports 0 singletons but picard now says "Found 2587 unpaired mates". Any other suggestions, let me know, but for the meantimes I think I will begin investigating how to do this in pysam.

ADD REPLY • link 12.1 years ago by Davy ▴ 410

score 0 · Answer 2 · 2013-09-16

0

Entering edit mode

12.1 years ago

Ashutosh Pandey 12k

I am not sure if I understood you correctly but why can't you extract those reads using two steps:

1) Get all the reads that are mapped with in a region that you specified. I think you have already solved it.

2) Now you can get all the unique read ids from the new bam file and extract both forward and reverse reads corresponding to that read id from the original bam file. You can use a single grep command with multiple strings or read ids. Of course , this is not an efficient way but it should work.

ADD COMMENT • link 12.1 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Would love a good script that handles step 2

ADD REPLY • link 7.0 years ago by cmdcolin ★ 4.3k

score 0 · Answer 3 · 2018-10-16

A new software called Bazam is built for doing this type of task https://github.com/ssadedin/bazam

It is designed to output a FASTQ for realignment and doesn't re-output the original BAM data as currently implemented I think but I made a request to maybe enable BAM/SAM output of the tool. Preprint here https://www.biorxiv.org/content/early/2018/10/04/433003