Question

Simple FASTQ/A manipulation... how to add a single adapter sequence to 5' of all reads?

0

Entering edit mode

6.4 years ago

quickquark • 0

Hi everyone, and thanks in advance! I'm used to doing lots of trimming, substituting, etc on large FASTQ/A files, but now I need to add sequence arbitrarily at the beginning of all reads and I'm coming up short! Been searching a couple hours for a method via toolkit (fastx_toolkit, BBmap, etc.) or simple command (sed, awk, etc.).

So I'm looking to go from something like this:

>header
GTCTCAGATCGGAAGAGCACACGT
>header
CCGGTCCTGGTTGCAGATCGGAAG
>header
GTATCTCCTAAGATATAACAGGTTG
>header
AGGTACAGGTTGGATGATAAGTCC

to this:

>header
AAAAAAGTCTCAGATCGGAAGAGCACACGT
>header
AAAAAACCGGTCCTGGTTGCAGATCGGAAG
>header
AAAAAAGTATCTCCTAAGATATAACAGGTTG
>header
AAAAAAAGGTACAGGTTGGATGATAAGTCC

Alternatively, I can do the same with FASTQ files (also extending the quality lines to match), if there's already a tool out there for that. I'm not interested at quality at this point, as I've already merged paired-end reads with PandaSeq and filtered out anything but the highest quality reads.

FASTA FASTQ • 2.8k views

ADD COMMENT • link 6.4 years ago by quickquark • 0

0

Entering edit mode

While you have been given possible solutions below, you would be breaking fastq format if you do not add corresponding scores on the quality line. Example you showed above is neither valid fasta or fastq format.

ADD REPLY • link 6.4 years ago by GenoMax 152k

0

Entering edit mode

Ah yes, sorry, I should have been more accurate with that in case others come across this. I'll edit it to look like a real FASTA.

ADD REPLY • link 6.4 years ago by quickquark • 0

1

Entering edit mode

quickquark : Please test @Pierre's solution. It should work and if it does you should accept that too. You can accept more than one answer if they work.

ADD REPLY • link 6.4 years ago by GenoMax 152k

score 3 · Accepted Answer · 2019-03-06

3

Entering edit mode

6.4 years ago

mbelmadani ★ 1.4k

sed will do that:

$ sed 's|^[^@>]\(.*\)|AAAAAA\1|g' fastq.fq 
@header
AAAAAATCTCAGATCGGAAGAGCACACGT
@header
AAAAAACGGTCCTGGTTGCAGATCGGAAG
@header
AAAAAATATCTCCTAAGATATAACAGGTTG
@header
AAAAAAGGTACAGGTTGGATGATAAGTCC

$ sed 's|^[^@>]\(.*\)|AAAAAA\1|g' fasta.fa
>header
AAAAAATCTCAGATCGGAAGAGCACACGT
>header
AAAAAACGGTCCTGGTTGCAGATCGGAAG
>header
AAAAAATATCTCCTAAGATATAACAGGTTG
>header
AAAAAAGGTACAGGTTGGATGATAAGTCC

The first part between separators (|^[^@>]\(.*\)|) means match anything that does not start with @ or >, and capture the rest of the line in a group (parenthesis). The second part is the replacement, which means replace with AAAAAA followed by group 1 which was captured by the parenthesis.

Update: Added > to the non-matching character class part so it also works for FASTA files as well. See also comment below about FASTQ and multi-line FASTA files.

ADD COMMENT • link 6.4 years ago by mbelmadani ★ 1.4k

1

Entering edit mode

manuel.belmadani : You should update your solution to reflect the change OP made to the original question when you have a chance.

ADD REPLY • link 6.4 years ago by GenoMax 152k

0

Entering edit mode

I added the > in the character class. Just be careful that your FASTA files don't have reads over multiple lines, or it'll break (and add AAAAAA at each non-header begining of line, even if multiple lines are part of the same contiguous reads.) This use case is a bit more complicated than the provided input in the original question. Same thing if you have a complete FASTQ file (e.g. with the quality score); then you'd have to avoid editing the quality header and the quality line. Something like what Pierre suggested would work to only edit every 2nd line: sed '2~4 s/^/AAAAAAA/' fastq.fq

ADD REPLY • link 6.4 years ago by mbelmadani ★ 1.4k

score 2 · Accepted Answer · 2019-03-06

2

Entering edit mode

6.4 years ago

Pierre Lindenbaum 166k

sed '2~2 s/^/AAAAAAA/' input.txt

ADD COMMENT • link 6.4 years ago by Pierre Lindenbaum 166k