Hi,
I have very little experience with scripts. I want to change my FASTA sequence headers (I have 100's of FASTA sequences per file) from very long headers to headers with the sample name (CM1) and then ascending numbers. Basically, I want to go from this:
>M00505:63:000000000AWRE0:1:1101:11224:1105_1:N:0:NCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1101:11224:1105_2:N:0:NCTACGCT+CTAAGCCT;size=290797;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>M00505:63:000000000AWRE0:1:1107:23836:6960_1:N:0:GCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1107:23836:6960_2:N:0:GCTACGCT+CTAAGCCT;size=2;
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>M00505:63:000000000AWRE0:1:1103:16981:11028_1:N:0:GATACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1103:16981:11028_2:N:0:GATACGCT+CTAAGCCT;size=1;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
To this:
>CM1_1
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>CM1_2
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>CM1_3
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
I am able to do this with two separate scripts using:
sed 's/>.*/&CM1/' file.fa > output.fa
cat output.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > output2.fa
But I would like to do it all in one step. Any ideas?
THANK YOU!!
Molly
im bit new to scripting so can you explain what does the for loop after initialization does from (1..100) i understood since the total no of lines perhaps , so what after that loop does if you can explain it would be really good....
Check the updated answer.
This was a cool approach! When I did your suggestion, it got the header name correct (finally!) but it cut huge portions of my sequences... How do I get it to not do that? (I first linearized all the fasta files and then ran your scripts...but I still get this issue of cutting sequences). Now it looks like:
But my sequences are supposed to be a lot longer than that. Thanks so much for your input! Any suggestion for this problem?