Question

Extract unmapped reads as pairs from (S/B)AM and create corresponding split read files.

0

Entering edit mode

3.6 years ago

jens ▴ 40

I have a set of 23 fastq files concerning different treatments (each is a pair of /1 and /2 files so 46 overall). I got hold of a transcriptome assembly for the organism.

Mapping the fastq files to the assembly revealed mapping rates between 50% and 90%.

My aim then was to amend the assembly. Approach:

extract all the unmapped pairs (even if one of the mate pairs has actually been mapped).
Merge all samples
Create separate fastq files for the mates
perform a transcriptome assembly
merge assemblies

All this sounds easy, however, I am already stuck at 1-3.

1) I used samtools to extract the unmapped read pairs using '-f 4' but there are other options (8, 12) that might be relevant and it is not clear to me whether I actually extracted both mates for each pair.

2) Merging the samples should be as easy as 'cat *fastq > merged.fq'. However, it is unclear whether the order of mates in the file is relevant to downstream tools and I don't think that the order was preserved in 1) whether I used unprocessed SAM files, but clearly not when using position sorted BAM files.

3) I tried a bunch of tools to get this to work. seqtk seq requires the input file to be interleaved and that doesn't seem to be the case. Hydra bamtofastq actually extracted two separate files and there I saw that step 1) is probably off already.

4) I tried using Trinity in the first attempt but then the fastq files were not correct so it failed.

5) no idea how to do that yet. I might just append the files and deal with duplications further down the road.

Does anyone know a proper workflow for this? The size of the dataset might matter here: 80GB unmapped read.

bam sam unmapped paired fastq • 1.7k views

ADD COMMENT • link updated 3.5 years ago by GenoMax 147k • written 3.6 years ago by jens ▴ 40

0

Entering edit mode

I wrote my own code to solve the problem. For what it is worth, it is available on NPM: https://www.npmjs.com/package/samfilter. It was also fast enough for my purpose (10 min for ~30GB SAM files using 1 core). Not multithreaded or intended to become at any point, but you could just start multiple instances to work on several files at the same time.

ADD REPLY • link 3.5 years ago by jens ▴ 40

score 1 · Answer 1 · 2021-04-14

1

Entering edit mode

3.6 years ago

GenoMax 147k

You may want to try reformat.sh from BBMap suite.

reformat.sh in=your.bam out1=R1.fq.gz out2=R2.fq.gz unamppedonly=t pairedonly=t primaryonly=t

That should extract reads that are unmapped, properly paired and are single copy. Set pairedonly=f if you want single reads that may be unmapped.

This assumes that your aligned file contains all the necessary information. You should name sort your BAM file before doing this conversion.

ADD COMMENT • link 3.6 years ago by GenoMax 147k

0

Entering edit mode

I just tried to do this for a bam file that contained previously extracted reads (name sorted it first) but reformat failed for that:

Input is being processed as paired java.lang.AssertionError: E00461:128:GW170713312:6:1101:991:8622 0 -1 + 629 779 00001000000000000000000 10 0 NGCGGGCATAGGTAATAGTACCCTGATATGCGGCTTGTGAGCGTCTGGCGGTAGGACTCTGTAGATGTAGATGTCGGTGGTATCCGTATCAGTAAGATATGTTGGTAGTGGAGGTGGGGTTTTGTGTTATTCGGCGGAGTTGGGATGCTTG !A<<-A7-----<FJ-<--<-FAJ--77<<<-77---<----7F-7--7--7-----7--<7F--<<<FJ<---<7---7A--7-7AA-77---7-7-7--------7<77-)A))-<-))---7-------7--)-)))-7-7--))--7 . . . . E00461:128:GW170713312:6:1101:991:8622 133 Transcript_15180 630 0 * = 630 0 NGCGGGCATAGGTAATAGTACCCTGATATGCGGCTTGTGAGCGTCTGGCGGTAGGACTCTGTAGATGTAGATGTCGGTGGTATCCGTATCAGTAAGATATGTTGGTAGTGGAGGTGGGGTTTTGTGTTATTCGGCGGAGTTGGGATGCTTG !A<<-A7-----<FJ-<--<-FAJ--77<<<-77---<----7F-7--7--7-----7--<7F--<<<FJ<---<7---7A--7-7AA-77---7-7-7--------7<77-)A))-<-))---7-------7--)-)))-7-7--))--7 YT:Z:UP at stream.SamReadInputStream.toReadList(SamReadInputStream.java:126) at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:90) at stream.SamReadInputStream.hasMore(SamReadInputStream.java:54) at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:668) at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:657)

So I guessed 'your.bam' needs to be the complete bam file, and reformat needs to extract the unmapped reads itself. I name_sorted the unextracted bam file and tried again but to no avail. The error I received:

Set INTERLEAVED to true Could not find sambamba. Found samtools 1.10 Input is being processed as paired java.lang.AssertionError: There is something wrong with the read pairing. 0, 0, true, true, 0, 0 at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:440) at stream.ConcurrentGenericReadInputStream.run0(ConcurrentGenericReadInputStream.java:207) at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:183) at java.base/java.lang.Thread.run(Thread.java:834)

Is that a known problem?

ADD REPLY • link 3.6 years ago by jens ▴ 40

0

Entering edit mode

There is something wrong with the read pairing.

Based on that error there is something inconsistent with your original alignments. Perhaps they were done with reads that were not in sync across R1/R2 files.

ADD REPLY • link 3.5 years ago by GenoMax 147k