Question

different read names in paired-end data

0

Entering edit mode

3.7 years ago

debitboro ▴ 270

Dear all,

I've paired-end reads generated by ABI-Solid system 4. I've two fastq files R1.fastq and R2.fastq. I've looked at the content of the two files and I found that the reads didn't match in names (header) as follows which generates some issues for the analysis (for example when trimming the reads using cutadapt).

R1.fastq:

@SRR3159522.1 2_33_78 length=50
GGGATCAAAGGTGCCTAAGAAAGTTCTCACTAAGGGNATCTTCTACGCC
+SRR3159522.1 2_33_78 length=50
CCCDFFFFHHHHHJJJGJJJJJJJJIIGIIIIIJJJ#1?CGHDHHGIJI
@SRR3159522.2 2_36_51 length=50
CTGGTGCGAAAAGGTGAAATAAAAAAGAAGAACGAAGAAGCCGGTGCCA
+SRR3159522.2 2_36_51 length=50
BBCFDFFFHHHHHJGHHIJIJJJJJJIGIIJJIIIJJIGGIJJJHIHHH
@SRR3159522.3 2_36_551 length=50
CCACACCGGGTAAGCTGGTTTGGCGATGCGGGATGATCCGAACGTGGAG
...
...

R2.fastq

@SRR3159522.27470956 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+SRR3159522.27470956 2_33_78 length=35
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.27470957 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+SRR3159522.27470957 2_36_51 length=35
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ
@SRR3159522.27470958 2_36_551 length=35
GTCCTGNTNNNNNNNTGAACCAACACCTTTTGTG
...
...

As you can see the headers of the reads are different and don't match each other.

When I used cutadapt to trim the reads, I got a name matching error. I've tried to replace the headers of R2.fastq with the headers of R1.fastq to get the same headers and get rid of the issue but I don't know how to do it. I want to transform R2.fastq as follows:

@SRR3159522.1 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+SRR3159522.1 2_33_78 length=35
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.2 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+SRR3159522.2 2_36_51 length=35
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ
@SRR3159522.3 2_36_551 length=35
GTCCTGNTNNNNNNNTGAACCAACACCTTTTGTG
...
...

Someone can help me?

fastq file read name header ABI-Solid • 1.3k views

ADD COMMENT • link updated 3.7 years ago by rpolicastro 13k • written 3.7 years ago by debitboro ▴ 270

score 1 · Answer 1 · 2021-03-18

1

Entering edit mode

3.7 years ago

rpolicastro 13k

seqkit solution

seqkit replace -p '(^SRR[0-9]+\.)[0-9]+' -r '${1}{nr}' R2.fastq

@SRR3159522.1 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.2 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ

I'm not sure having anything after the + in a fastq file is strictly necessary, so you can probably remove it from the R1 file too. Someone pelase correct me if I'm wrong.

ADD COMMENT • link 3.7 years ago by rpolicastro 13k

0

Entering edit mode

The plus had to either match the name exactly, or be empty.

ADD REPLY • link 3.7 years ago by swbarnes2 14k

0

Entering edit mode

Thank you rpolicastro, it works fine

ADD REPLY • link 3.7 years ago by debitboro ▴ 270