Hi,
I would like to know how to remove the newline from a certain part of my file, but not all of it.
I am piping the result of my program into sed in order to convert the file to a specific format. The input file looks like that:
>sctg_0002_0001 length=2745
TCCCCCTCCCGTACCGGTTTGCGCTATTATACCGGCCTTGAATCGAGCAAAGGCTCCAAACAATTTCATTACAAACAGATTGGGGATGTATGACGTGGCT
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TTGACACGCTTGTTTCTGATGTCATCACCCATGAAGAGCTGTTATTTGGCCACCTGGCGTTCCTGCCTAAGCGTTGAGTGAATATTAAACACCTCTGCCC
>sctg_0003_0001 length=2175
CAACAACCACTCTTAGCGCTGCTTGCCGCTGCCGATACCGAACGGGATGCGGTAGTCGCTGCTCTGCTCACCCAGACTCACGGTCAGGTTGCCCTGAGTA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...
This is what I need to do:
convert the ">" symbol into the string "SEQUENCE_ID="
remove everything after the double spaces in the header, e.g. " length=2745"
add part of the next header into the actual one - this should looks like that sctg00020001e00030001b
- 0002_0001 is part of the first header
- 0003_0001 is part of the second header.
delete the newline symbols from the sequence itself as if to make the fastA in one single line.
add the string "SEQUENCE_TEMPLATE=" to the sequence line.
add the symbol "=" after each sequence line
This is what I have done so far
perl convert_FastA.pl ScaffoldContigs.fasta | sed -e '/^>/ s/>/SEQUENCE_ID=/' | sed -e ':a;N;$!ba;/^SEQUENCE_ID=/ ! s/\n//'
the results of the first part looks like the sample above. rst sed command replace the ">" with the pattern needed.
At the end it should look like that:
SEQUENCE_ID=sctg_0002_0001e_0003_0001b
SEQUENCE_TEMPLATE=CCCCCTCCCGTACCGGTTTGCGCTA...
=
SEQUENCE_ID=sctg_0003_0001e_0001_0001b
SEQUENCE_TEMPLATE=CAACAACCACTCTTAGCGCTGCTTG...
=
I tried to delete the newline with sed, but it didn't work how I imgined it. It delete either all of them or none. Besides I couldn't find any way to "save" the next line in order to put it in the header of the sequence before that.
I would appreciate any help I can get.
Thanks, Assa
I don't completely understand the part where you are adding the next header into the current one. So you just want to add the next header in the current header separated by the letter 'e' and append a letter 'b' to the end?
It might be better to just write a script for something complicated like this.
The title and tags associated with your post do not cover the load at all. For future reference, a better use of tags and titles helps to get the right people looking at your question :)
Your specs say, add part of the next header into the actual one. What do you do when there is no next header, i.e., you're processing header
n
?the last one takes the second part of the first header, but I can probably do it manually, if not available by script.