I have a set of 23 fastq files concerning different treatments (each is a pair of /1 and /2 files so 46 overall). I got hold of a transcriptome assembly for the organism.
Mapping the fastq files to the assembly revealed mapping rates between 50% and 90%.
My aim then was to amend the assembly. Approach:
- extract all the unmapped pairs (even if one of the mate pairs has actually been mapped).
- Merge all samples
- Create separate fastq files for the mates
- perform a transcriptome assembly
- merge assemblies
All this sounds easy, however, I am already stuck at 1-3.
1) I used samtools to extract the unmapped read pairs using '-f 4' but there are other options (8, 12) that might be relevant and it is not clear to me whether I actually extracted both mates for each pair.
2) Merging the samples should be as easy as 'cat *fastq > merged.fq'. However, it is unclear whether the order of mates in the file is relevant to downstream tools and I don't think that the order was preserved in 1) whether I used unprocessed SAM files, but clearly not when using position sorted BAM files.
3) I tried a bunch of tools to get this to work. seqtk seq requires the input file to be interleaved and that doesn't seem to be the case. Hydra bamtofastq actually extracted two separate files and there I saw that step 1) is probably off already.
4) I tried using Trinity in the first attempt but then the fastq files were not correct so it failed.
5) no idea how to do that yet. I might just append the files and deal with duplications further down the road.
Does anyone know a proper workflow for this? The size of the dataset might matter here: 80GB unmapped read.
I wrote my own code to solve the problem. For what it is worth, it is available on NPM: https://www.npmjs.com/package/samfilter. It was also fast enough for my purpose (10 min for ~30GB SAM files using 1 core). Not multithreaded or intended to become at any point, but you could just start multiple instances to work on several files at the same time.