Entering edit mode
5.9 years ago
jamesdong
•
0
I have a file containg multiple DNA fasta sequences(fasta format),like this :
>XM_123456
ACTGTATGC
>XM_298778
ATACACA
...
I want to get a fixed length of the DNA sequences, for example, the defined length is 5~9 nucleic acids from each DNA,with the output file like this (fasta format):
>XM_123456(1-5)
ACTGT
>XM_123456(1-6)
ACTGTA
...
>XM_123456(1-9)
ACTGTATGC
>XM_123456(2-6)
CTGTA
...
>XM_123456(2-9)
CTGTATGC
>XM_123456(3-7)
TGTATG
...
>XM_123456(3-9)
TGTATGC
>XM_123456(4-8)
GTATG
>XM_123456(4-9)
GTATGC
....
>XM_298778(1-5)
ATACA
>XM_298778(1-6)
ATACAC
>XM_298778(1-7)
ATACACA
>XM_298778(2-6)
TACAC
>XM_298778(2-7)
TACACA
anyone can help me? Thanks for the help in advance.
What input files do you have? You mentioned you have a fasta file, but do you also have a file which maps the names to the lengths you need? What does this file look like?
Reasonably certain that
seqkit
should be able to do this ( https://github.com/shenwei356/seqkit ). You can take a look at the manual.@ jrj.healey, thank you for reply. I didn't have the file maps the names to the lengths, I only have the fasta file and the specified or defined lengths (5~9 nucleic acids) required, these are start point, I want to get the results.
Are these lengths in the correct corresponding order to the sequences in the fasta?
Basically what I'm getting at is there is no way to match up what length you require with the sequence you need it from based on your question at the moment.
Thanks for your good question about map parameter, if you use seqkit sliding -s 1 -W 5, it will also achieve that goal partially.
@ genomax,many thanks, I will try it. I use piped seqkit sliding -s 1 -W 5, it works very well, though I must to define the length each time, thanks buddy.
@jamesdong: Please use
ADD REPLY/ADD COMMENT
when responding to existing posts to keep threads logically organized. Using Chrome browser helps if you are posting from china and can't access these buttons in a different browser.Yes, you are right, thank you for your help!
Please avoid terms such as "buddy". This is a professional forum where a certain etiquette needs to be followed.
Thank you for your reminding me of this matter.
It seems like a homework, or XY Problem. You should at least tell us what have you tried, or why do you want to this, what's the original purpose.
Thank you for reply. It is simple problem and from an original idea, that there are many metabolic fragment of DNA or proteins in body, I want to get the predicted DNA or protein s fragments before I evaluate the properties of these fragments.