Question

Tool to append Illumina Read 1 to Read 2 for downstream demultiplexing

0

Entering edit mode

8.4 years ago

achamess ▴ 90

Hi. I have paired end reads (75 bp) from a Nextseq run. Read 1 is 18 bp and Read 2 is 52 bp. R2 is where all the cDNA information is, and R1 is UMI (8) and Barcode (10bp). I've already run umi_tools to detect the UMIs and append those to read file info (https://github.com/CGATOxford/UMI-tools). So R1 now just has the 10 bp barcode left. I think the most efficient way is to take that 10 bp barcode and join it directly to the R2 read on the 5' end. Then I can take the joined R2+R1 fastqs through a program likereaper to demultiplex based on barcodes.

So my question is, what tool can I use for a simple join like that? I've seen tools like ea-utils that seem to do a join and merge based on overlapping bps between R1 and R2, but that's not what I want. I just want to add R1 to R2, no merge or overlap. Any guidance would be great.

sequencing RNA-Seq • 4.5k views

ADD COMMENT • link 8.4 years ago by achamess ▴ 90

1

Entering edit mode

Which of the two read headers would you want to keep?

ADD REPLY • link 8.4 years ago by GenoMax 153k

0

Entering edit mode

Hi. Thanks for the reply. R2 header.

ADD REPLY • link 8.4 years ago by achamess ▴ 90

1

Entering edit mode

I am sure there is a paste solution that will work. @Pierre, the master of these one liners should be by soon.

ADD REPLY • link 8.4 years ago by GenoMax 153k

1

Entering edit mode

Here is a bad solution until someone improves on it (you will have to uncompress the files, replace YOUR_SEQUENCER_ID)

join <(nl cat R1.fastq) <(nl cat R2.fastq) | awk -F ' ' '{if ($2 ~ /^@YOUR_SEQUENCER_ID/ ) {print $4" "$5;} else {print $2$3;}}' > new.fastq

ADD REPLY • link 8.4 years ago by GenoMax 153k

0

Entering edit mode

Thanks. I'll give it a try. But can I use .fastq.gz? Sorry, I'm not a Unix pro, so this is my attempt:

join <(nl gzip -cd |cat AC1-10_S34_R1_001.fastq.gz) <(nl gzip -cd |cat AC1-10_S34_R2_001.fastq.gz) | awk -F ' ' '{if (@machineID ) {print $4" "$5;} else {print $2$3;}}' > new.fastq

ADD REPLY • link 8.4 years ago by achamess ▴ 90

0

Entering edit mode

Sorry I meant to say sequencer ID (not your machine ID in the sense of computer). It should be the @STRING at beginning of all read headers. Changed.

ADD REPLY • link 8.4 years ago by GenoMax 153k

0

Entering edit mode

It worked! Very nice. I need to spend more time learning the power of awk. Here is my update for for .gz input and output. I will probably now make some kind of loop to process all my files. Thank you!

ADD REPLY • link 8.4 years ago by achamess ▴ 90

score 3 · Accepted Answer · 2017-04-10

3

Entering edit mode

8.4 years ago

achamess ▴ 90

Moving your answer here (with my modifications).

join <(zcat AC1-10_S34_R1_001.fastq.gz | nl ) <(zcat AC1-10_S34_R2_001.fastq.gz|nl) | awk -F ' ' '{if ($2 ~ /^@NB501800/ ) {print $4" "$5;} else {print $2$3;}}' | gzip > new.fastq.gz

Here is the output:

@NB501800:5:HKYVYAFXX:1:11111:19375:5952 2:N:0:TCCGGCTTAT+GGACTCATTG
GTGCGGATGATCTTACGCTTGTAGGCCAGCCTGGGTGGATATATATTGTGTTCCAAGCCAACTTGGTCTA
++
AAAAAEEEEEEEEEEEEEAAAAAE/EEE/<EEE/<E/EEEAEEEEEA/EE<EEEEE/EEEEEEA/AEEEE

ADD COMMENT • link 8.4 years ago by achamess ▴ 90

0

Entering edit mode

Perfect. Go ahead and accept this answer (green check mark) to provide closure to the thread.

ADD REPLY • link 8.4 years ago by GenoMax 153k