Hi. I have paired end reads (75 bp) from a Nextseq run. Read 1 is 18 bp and Read 2 is 52 bp. R2 is where all the cDNA information is, and R1 is UMI (8) and Barcode (10bp). I've already run umi_tools
to detect the UMIs and append those to read file info (https://github.com/CGATOxford/UMI-tools). So R1 now just has the 10 bp barcode left. I think the most efficient way is to take that 10 bp barcode and join it directly to the R2 read on the 5' end. Then I can take the joined R2+R1 fastqs through a program likereaper
to demultiplex based on barcodes.
So my question is, what tool can I use for a simple join like that? I've seen tools like ea-utils that seem to do a join and merge based on overlapping bps between R1 and R2, but that's not what I want. I just want to add R1 to R2, no merge or overlap. Any guidance would be great.
Which of the two read headers would you want to keep?
Hi. Thanks for the reply. R2 header.
I am sure there is a
paste
solution that will work. @Pierre, the master of these one liners should be by soon.Here is a bad solution until someone improves on it (you will have to uncompress the files, replace
YOUR_SEQUENCER_ID
)Thanks. I'll give it a try. But can I use .fastq.gz? Sorry, I'm not a Unix pro, so this is my attempt:
join <(nl gzip -cd |cat AC1-10_S34_R1_001.fastq.gz) <(nl gzip -cd |cat AC1-10_S34_R2_001.fastq.gz) | awk -F ' ' '{if (@machineID ) {print $4" "$5;} else {print $2$3;}}' > new.fastq
Sorry I meant to say
sequencer ID
(not your machine ID in the sense of computer). It should be the@STRING
at beginning of all read headers. Changed.It worked! Very nice. I need to spend more time learning the power of
awk
. Here is my update for for.gz
input and output. I will probably now make some kind of loop to process all my files. Thank you!