Question

Extracting reads mapped to any of multiple references

2

Entering edit mode

9.2 years ago

novice ★ 1.1k

I can extract reads mapped to a single reference sequence by first aligning the reads using, say, BWA, and processing with SAMtools:

samtools view -b -f 0x2 alignment.sam | samtools fastq - -1 mapped_1.fastq -2 mapped_2.fastq

Is there a similar pipeline I can use to extract reads mapped to multiple references? That is, if a read was found in any of the given references, it should be included in the output.

Here's an idea I have:

align the reads to each reference separately (producing multiple SAM files)
extract the reads mapped to each reference using the above pipeline
sort and uniq the names of the reads from all output files, storing the names in file.txt
grep the names in file.txt from one of the SAM files and store in a new SAM file
convert SAM to FASTQ

I'm guessing this is probably not the most efficient way to solve the problem. It also doesn't catch when a read might be mapped to one of the references without its pair. In this case, I'd like to extract both reads.

samtools alignment • 6.0k views

ADD COMMENT • link updated 9.2 years ago by thackl ★ 3.0k • written 9.2 years ago by novice ★ 1.1k

0

Entering edit mode

9.2 years ago

thackl ★ 3.0k

If you store your bwa output in bam, rather than sam, you an use samtools view -b -f 0x2 alignment.bam REGION, with REGION being a single or multiple reference sequence IDs

 samtools view -b -f 0x2 alignment.bam ref1 ref2 refx | samtools fastq ...

ADD COMMENT • link 9.2 years ago by thackl ★ 3.0k

score 3 · Accepted Answer · 2016-06-27

The most efficient way would be to map to all the references at once. You can do this with any aligner if you just concatenate the references. Alternatively you could map to all at once using BBSplit, which can produce one fastq output file per reference, containing all the reads (or pairs) that best map to that reference. But if you simply want all reads that map to any reference in a single fastq file, along with their unmapped mates, you can do this:

1) Concatenate the references with cat.
2) Map the reads with this command:
bbmap.sh ref=references.fa in1=read1.fq.gz in2=read2.fq.gz outm=mapped.fq.gz outu=unmapped.fq.gz

If one or both reads in a pair map to anything, both will go to "outm"; if neither map, both will go to "outu". The output files will be interleaved, but if you use "outm1" and "outm2" (and etc.) you can get the output in dual files.