Entering edit mode
7.4 years ago
Varshney
▴
20
Hello everyone,
I have a genome assembly file in the fasta format. I have to trim that sequences based on specific positions from that file.
How can i do this by Perl or shell script ?
I have almost 2000 sequences in my fasta file and I have the required positions in a tab delimited file containing id, start and end.
It will be great if anyone could help me on this.
Thanks in Advance !!
Hey Varshney, could you post a small example of your data and required output...
Thank you for the answers, but how can i remove the seqs based on their positions from fasta file.
Again, can you please provide some sample data and output. Do you want the sequence to be cut out completely and the two leftover ends joined together, or do you want it to be masked in someway?
Varshney : Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.Hey James,
I have multi sequences in one fasta file and another tab delimited file which containing the ids, start, end. Like this:
Seq ID Start End
jcf713497 1 374
jcf713573 1 2268
jcf7123620 17 474
jcf7123620 5675 5707
jcf7123757 1 507
So, how can I remove these positioned sequences from fasta file ?
Do you want the sequence to be cut out completely and the two leftover ends joined together, or do you want it to be masked in someway? Also, when you reply, don't make a new post just click on the add comment box below my response.
I want the sequence to be cut out completely and the two leftover ends joined together.
This answer requires you to be working on a unix machine and have the bedtools and sed command installed:
I would advise being careful when running sed command with
'-i'
parameter as a unix novice, as it applies changes "in place", meaning it edits the input file directly. So the suggested line of code will produce an empty output file (result.fasta
), as replacing of zeros with no characters is done directly in themasked.fasta
file.Additionally, the important thing to note, if your sequence IDs contain any zero(s), these zeros will also get replaced in this case. If you would like to replace stretches of zeros of a certain minimum length that you are sure does not occur in your seq. IDs, then this might be a better solution:
where
'N'
is to be replaced with the minimum number of occurrences, and backslashes are needed to 'escape' special characters such as{}
.