I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.
Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.
I am using: $seqkit concat File*.fas -o File3.fas, and I get file 3 with different lengths according to the missing data.
(of course i have more than two fasta files, and there are missing sequences on all of them)
I forgot to specify that in my study each file correspond to a different pcr amplify genetic marker, and that seq1-3 correspond to different species. The source of N's in seq2 is the missing marker included in File 2 (e.g. we couldn't amplify it or it is missing in the reference database).
amartinez.ull : Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. This comment/question should have gone under @shenwei's answer.
Thanks! Seems like a potential solution, but as far as I see, this still requieres that I manually include the name of the missing markers in "File 2" (following the names in my original example) and use that function. It must be a quicker way to do it...
You don't have to fill in. Concat the files using seqkit and then use this script. Input would be stdin. Concern i have is if the lengths are not fixed, then take the maximum length of the sequence, store that in a variable, use it to pad the sequence, for each id
In this example (posted in oP), length of each sequence is fixed 15bp. Then it is easier to pad. If the padding is done as per length of largest sequence, then you need to find largest sequence, it's length, store it in a variable, then apply above code.
Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.
Example code to retain first 2 and last two bases of fasta sequences in a single file:
my fasta file:
code:
output:
Download seqkit from here
Another solution:
Dear Vinayjrao and Pierre,
Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.
Best!
s_bio
If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.
Thanks for the useful answer. I have an extra problem since, since some sequences are missing on my files. Departing from:
Is it possible to get the following using seqkit concat
I am using: $seqkit concat File*.fas -o File3.fas, and I get file 3 with different lengths according to the missing data. (of course i have more than two fasta files, and there are missing sequences on all of them)
Thanks!!!
source of Ns in seq2? padding ? @OP
I forgot to specify that in my study each file correspond to a different pcr amplify genetic marker, and that seq1-3 correspond to different species. The source of N's in seq2 is the missing marker included in File 2 (e.g. we couldn't amplify it or it is missing in the reference database).
amartinez.ull : Please use
ADD REPLY/ADD COMMENT
when responding to existing posts to keep threads logically organized. This comment/question should have gone under @shenwei's answer.I will do next time. It is my first post here, and I am still not very familiar with the rule. I apologize.
@ amartinez.ull : If length of the sequence is fixed, you can follow this post What is the fastest way to add 'Ns' to variable length sequences in a .fasta such that they have the same length. See the post by Petr Ponomarenko and upvote the OP.
Thanks! Seems like a potential solution, but as far as I see, this still requieres that I manually include the name of the missing markers in "File 2" (following the names in my original example) and use that function. It must be a quicker way to do it...
You don't have to fill in. Concat the files using seqkit and then use this script. Input would be stdin. Concern i have is if the lengths are not fixed, then take the maximum length of the sequence, store that in a variable, use it to pad the sequence, for each id
In this example (posted in oP), length of each sequence is fixed 15bp. Then it is easier to pad. If the padding is done as per length of largest sequence, then you need to find largest sequence, it's length, store it in a variable, then apply above code.