Extracting Read Pairs From Region So That All Pairs Remain Intact
3
1
Entering edit mode
11.3 years ago
Davy ▴ 410

Hi all, I have some whole exome data and I am only interested in the exons of 4 genes, which I have in a bed file.

I ,perhap naively, tried extracting only those reads using this bed file with samtools

samtools view -b -L 4genes.bed myBAM.bam > myBAM.4genes.bam

which works fine, except when I go to convert the BAM file back to fastq with pircard it throws an error because of reads with missing mates.

java -Xmx4g -jar /path/to/picard/SamToFastQ.jar I=myBAM.4genes.bam O=my4genes.fastq VALIDATION_STRNGENCY=LENIENT

Runtime.totalMemory()=2029715456
To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
Exception in thread "main" net.sf.picard.PicardException: Found 2638 unpaired mates
    at net.sf.picard.sam.SamToFastq.doWork(SamToFastq.java:190)
    at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
    at net.sf.picard.sam.SamToFastq.main(SamToFastq.java:121)

I also tried intersectBed for this but it had the same result. I guess what I am after is all the reads in my regions plus any mates of those reads that fall outside those regions.

I am hoping there is a tool and or option in samtools or something that I can use for this, but if I must I will program something to do this using python and pysam. I'm just being lazy and trying to find out if anyone else has done this already?

Cheers, Davy

B02V0ACXX110721:2:2205:9893:149433    163    1    10024    0    76M    =    10056    107 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    CCCFFFFFHHHHGJJIJJJIJJJJJHGGIIIHHIJIJGIGJJJHIJJIJIHIJJJGECHJHHGGFFFFECEEDDD=    MD:Z:76    PG:Z:bwa.15    RG:Z:B02V0.2    AM:i:0    NM:i:0    SM:i:0    MQ:i:17    UQ:i:0
B02V0ACXX110721:4:1108:21203:4558    163    1    10039    0    66M10S    =    10188    183    ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCC    B@B7D3BD+AC;?3C??A@G:AAF@?CFF;)?DHBFDFDDGG3DH@F3=B@28)).6=>E76CE@###########    MD:Z:66    PG:Z:bwa.2    RG:Z:B02V0.4    AM:i:0    NM:i:0    SM:i:0    MQ:i:0UQ:i:0
B02FEACXX110730:8:2308:14858:37039    675    1    10042    0    63M13S    =    10184    176    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
ACCCTAACCCTAA    ;?>@?BC<9BC?E<>>D=BB5BEDCB?CEF<B(AF;EC:<C+6A8AA>DD>CF7ED>?,<<C##############    MD:Z:63    PG:Z:bwa.8    RG:Z:B02FE.8    AM:i:0    NM:i:0    SM:i:0    MQ:i:0OQ:Z:@<@D?BB;?FF>D9E?F;BC9AEA@@FGGG8?)?F9DB>;B)0?9B?8=C=@F2@FA;(66?##############    UQ:i:0
samtools bam picard fastq • 7.1k views
ADD COMMENT
0
Entering edit mode

Try: bamtools filter -isMateMapped true -in myBAM.4genes.bam -out myBAM.4genes.bothmapped.bam

ADD REPLY
0
Entering edit mode

Thanks, but this gives the same result.

ADD REPLY
0
Entering edit mode

The lazy and slow way is to do the samtools filtering, then extract the read names that it gave, then grep in the original bam file for those read names. It will pull out each read that mapped in the region, or its mate, assuming the names are the same (which they should be). It will be quite slow as the bam grows, but on the other hand it is trivially parallelizable if you want to split up the bam.

ADD REPLY
0
Entering edit mode

Can you show us some sample reads? I'm thinking the read names might be goofy and that's throwing off some of your tools. This happened to me when working with TCGA BAM files.

ADD REPLY
0
Entering edit mode

Reads added above. Thanks for the assistance.

ADD REPLY
0
Entering edit mode

Could you figure out the solution to this problem? I am trying to do something similar: I want to extract the reads mapping to my region of interest and also extract it's pair even if is mapped out of region or unmapped. What I am trying to do is:

(i) extracting the read names which map to the region of interest samtools view coordinate-sorted-WXS.bam chr2:29907-400156 chr2:33030-33050 | awk '{print $1}' | sort | uniq > mapped_reads.txt

(ii) extracting the reads by name: For this I found these two tools but both of them are giving me bam files which have the same issue described aboveā€“not both pairs are included in bam file. These tools are:

ExtractReads: http://timoast.github.io/2015/10/12/ExtractReads/

SamGrep https://github.com/lindenb/jvarkit/wiki/SamGrep/

If anyone of you have resolved this please let me know.

Thanks, Arun

ADD REPLY
0
Entering edit mode
11.3 years ago

filter with samtools view and the required flag 'read mapped in proper pair'

or fix myBAM.4genes.bam with (? I think that should work) picard FixMateInformation http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation

ADD COMMENT
0
Entering edit mode

So I tried FixMateInformation, and that didn't work. I also tried (emphasis on tried) filtering with samtools view -f (tried 2 and 3 here) but I didn't really get anywhere. Samtools flagstat reports 0 singletons but picard now says "Found 2587 unpaired mates". Any other suggestions, let me know, but for the meantimes I think I will begin investigating how to do this in pysam.

ADD REPLY
0
Entering edit mode
11.3 years ago

I am not sure if I understood you correctly but why can't you extract those reads using two steps:

1) Get all the reads that are mapped with in a region that you specified. I think you have already solved it.

2) Now you can get all the unique read ids from the new bam file and extract both forward and reverse reads corresponding to that read id from the original bam file. You can use a single grep command with multiple strings or read ids. Of course , this is not an efficient way but it should work.

ADD COMMENT
0
Entering edit mode

Would love a good script that handles step 2

ADD REPLY
0
Entering edit mode
6.2 years ago
cmdcolin ★ 4.0k

A new software called Bazam is built for doing this type of task https://github.com/ssadedin/bazam

It is designed to output a FASTQ for realignment and doesn't re-output the original BAM data as currently implemented I think but I made a request to maybe enable BAM/SAM output of the tool. Preprint here https://www.biorxiv.org/content/early/2018/10/04/433003

ADD COMMENT

Login before adding your answer.

Traffic: 2222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6