Question

how to add the sample name to the end read headers

0

Entering edit mode

6.2 years ago

juan.galarza • 0

I would need to add the sample name at the end of all the read headers in that fasta sample. For example I have

#Sample1
#>read1
#ATGC
#Sample2
#>read1
#ATGC

Desire output:

#Sample1
#>read1/Sample1 
#ATGC 
#Sample2
#> read1/Sample2 
#ATGC

I can do it one by one using sed

sed 's/read1/read1\/Sample1/g' Sample1.fasta > Sample1_tagged.fasta

However I have hundreds of fasta samples. Any tips on how to do it all at once will be highly appreciated.

fasta relablel header • 2.2k views

ADD COMMENT • link updated 17 months ago by GenoMax 147k • written 6.2 years ago by juan.galarza • 0

1

Entering edit mode

are these Sample1 and Sample2 file names? If you do not provide sufficient information, it would be xy problem and solutions posted here will be of no use. juan.galarza. If they are in different files:

$ awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' Sample*

or

$ sed -e ' />/ F' Sample* | paste  - - - | awk '{print $2"/"$1"\n"$3}'

>read1/Sample1
ATGC
>read1/Sample2
ATGC

input files (Sample1 and Sample2)

$ tail -n+1 Sample*
==> Sample1 <==
>read1
ATGC

==> Sample2 <==
>read1
ATGC

ADD REPLY • link 6.2 years ago by cpad0112 21k

0

Entering edit mode

sed '/^>/s/$/\/SAMPLE/' in.fa > out.fa

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

This would append string "SAMPLE" to each header of fasta and is different from OP intended output. OP wants to append sample names (sample 1, sample 2, sample 3 etc) to each sequence. From OP's post, it seems OP has several files name Sample1, Sample 2 etc and each file has a fasta sequence.

ADD REPLY • link 6.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your answers cpad0112 and Pierre Lindenbaum. Indeed, I have several files named Sample1.fa, Sample2.fa etc...each with sequences in fasta format. I would like to append the file name to the sequences IDs within those files. For example the seq IDs from file Sample1.fa would be

>read1/Sample1 
>read2/sample1

and IDs from file Sample2.fa would be

>read1/Sample2
>read2/Sample2

The awk solution does this, however it produces a single output. Ideally I would like to get the relabelled sequences printed to their corresponding file. I.e. all sequences from file Sample1.fa printed to Sample1_relabel.fa and sequences from Sample2.fa printed to Sample2_relabel.fa etc...

ADD REPLY • link updated 6.2 years ago by GenoMax 147k • written 6.2 years ago by juan.galarza • 0

0

Entering edit mode

juan.galarza :

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

I know this thread is old but I have been trying to find a oneliner to append a single word to the end of all fasta headers in a file and for some reason, none of the ones I've found actually work.... for example if I have a file with format

>header1
ACTG
>header2
CGTA

This sed line outputs:

/SAMPLEheader1
ACTG
/SAMPLEheader2
CGTA

....kinda useless. something must be amiss but I have no earthly idea what it is...

ADD REPLY • link 17 months ago by clmattson • 0

0

Entering edit mode

$  more te.fa
>header1
ACTG
>header2
CGTA

If you simply want to append SAMPLE at end

$ sed '/^>/s/$/\SAMPLE/' te.fa 
>header1SAMPLE
ACTG
>header2SAMPLE
CGTA

You may want to separate the word by a _ and then in that case

$ sed '/^>/s/$/\_SAMPLE/' < te.fa > ot.fa

$ more ot.fa 
>header1_SAMPLE
ACTG
>header2_SAMPLE
CGTA

These examples use sed on a unix/linux system. What OS are you using?

ADD REPLY • link 17 months ago by GenoMax 147k

score 2 · Answer 1 · 2018-09-01

2

Entering edit mode

6.2 years ago

cpad0112 21k

try this juan.galarza :

> for i in *.fa ; do awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' $i > ${i%%.*}"_relabel.fa" ;done

Note: As a precaution, take a back up of your files, run the script on few samples.

If you have GNU-parallel, on your machine, you can try:

$ parallel  "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa

you can also dry-run the code:

$ parallel  --dry-run "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa

ADD COMMENT • link 6.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you!. The for loop did the trick. I didn't try the parallel options since I don't have GNU-parallel in my machine.

ADD REPLY • link 6.2 years ago by juan.galarza • 0

0

Entering edit mode

Can you elaborate in why you do not have that? Is your reason covered on https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/

ADD REPLY • link 6.2 years ago by ole.tange ★ 4.5k