For the header every info in parenthesis are continuous and are only separated by a single space each (just as written above).
All I want to do is retain "ABCD" (the very first info) in the header corresponding to every sequence . I want to loop through all the headers that are present in the file and return something like this :
I dont know why the sequence is showing next to the header when i posted this here!
Of course it is a fasta file and hence the sequences are directly below the headers.
Thanks for all your replies ! However, I was looking to retain the actual sequences also along with the shortened header. Infact
I was myself able to shorten the header to the first word before posting this here. BUt the problem was I could not figure out a way to shorten the header and
keep the sequences intact. So, I was wondering if there is any way I could also retain the actual sequences along with the headers.
In short- I want to retain the entire sequence intact for all the entries and just shorten the name of the headers.
So you don't have a standard fasta format sequence file where first line is an identifier >some_id and the sequence follows on the second line? If you had said this yesterday then I would have reset the formatting. My apologies.
Your sequence is present on the same line as the identifier and you want to keep it on that line after shortening the header. Is that correct?
If the length of the extra stuff is always the same in all sequences then see if this works cut -d ' ' -f1,7 your.fa > new.fa
I am leaving this post here since child posts will disappear if I delete this. Content is no longer applicable.
sorry again genomax2, the formatting got disrupted again in my posting today. It is exactly the way it was posted yesterday. (top of my post) the sequence IS on the second line (beneath the header) as it should be in a regular fasta file
and so on for all thee headers and sequences... for the rest of the whole fasta file.
I am hoping the post will come up correctly formatted this time. Otherwise, please know that my file looks just like regular fasta file (as correctly formatted by genomax2 yesterday).Hope this helps!
For future reference, it is safer to use the "code" formatting tool (101010 button) when formatting things like code/file formats. I have done this for you (and also reset the format original question).
So you do have a normal fasta file (since there is no formal spec for fasta this would do).
Edit: @Alex's solution as posted above did not work with the example data posted today (which is not the same as the original post).
Thanks genomax2 ! Infact your posted asnwer (cut command) from yesterday is doing the job fine. I must have erred on something, which gave me a different result last night and I appreciate you pointing out the formatting tool button. I am new to the forum and I will surely take care of these stuff before posting next time.
My issue is solved- and thanks to all who took time in contributing your answers. I learnt several new ways in managing such scripting situations for the future. Thanks everyone!
My awk statement should shorten headers and preserve sequences in FASTA files. I'm unclear what the issue is on your side, but if you want to post a snippet of your file and what results you're getting, I'd be happy to try to help.
import os
#give input filename on path i.e main file to change
file = open("/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/spikenuc0810_seqkit_output.fasta",'r')
#give header index size e.g here first 47 characeters of header will be printed
header_index_upto = 47
#default output filename is "spikenuc0810_Final.fasta" but you can replace...
if "spikenuc0810_Final.fasta" not in os.listdir("/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/"):
print("[Creating file...]\n[Please wait...]\n")
#output file name
create_file = open('/home/ankit/RWork/GISAID/hCoV-19_spikenuc0810/spikenuc0810_FINAL.fasta','a')
for i in file.readlines():
# print(i.strip())
if i[0:1] == '>':
create_file.write("\n"+i[0:header_index_upto]+"\n")
# print(i[0:49])
else:
create_file.write(i.strip())
# print(i.strip())
create_file.close()
print("[Done...]")
else:
print("File already existed...\nTerminating the process")
Copy paste this code in your text editor and save the file with .py extension.
With
reformat.sh
from BBMap suite:reformat.sh in=your.fa out=new.fa trd=t
I dont know why the sequence is showing next to the header when i posted this here! Of course it is a fasta file and hence the sequences are directly below the headers.
I have reformatted your post to show the correct format of fasta files.
Okay. Thanks for that...