Need help : cut the fasta sequences from a specific region
2
2
Entering edit mode
7.4 years ago
Varshney ▴ 20

Hello everyone,

I have a genome assembly file in the fasta format. I have to trim that sequences based on specific positions from that file.

How can i do this by Perl or shell script ?

I have almost 2000 sequences in my fasta file and I have the required positions in a tab delimited file containing id, start and end.

It will be great if anyone could help me on this.

Thanks in Advance !!

Assembly • 8.2k views
ADD COMMENT
0
Entering edit mode

Hey Varshney, could you post a small example of your data and required output...

ADD REPLY
0
Entering edit mode

Thank you for the answers, but how can i remove the seqs based on their positions from fasta file.

ADD REPLY
0
Entering edit mode

Again, can you please provide some sample data and output. Do you want the sequence to be cut out completely and the two leftover ends joined together, or do you want it to be masked in someway?

ADD REPLY
0
Entering edit mode

Varshney : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

Hey James,

I have multi sequences in one fasta file and another tab delimited file which containing the ids, start, end. Like this:

Seq ID Start End

jcf713497 1 374

jcf713573 1 2268

jcf7123620 17 474

jcf7123620 5675 5707

jcf7123757 1 507

So, how can I remove these positioned sequences from fasta file ?

ADD REPLY
0
Entering edit mode

Do you want the sequence to be cut out completely and the two leftover ends joined together, or do you want it to be masked in someway? Also, when you reply, don't make a new post just click on the add comment box below my response.

ADD REPLY
0
Entering edit mode

I want the sequence to be cut out completely and the two leftover ends joined together.

ADD REPLY
3
Entering edit mode

This answer requires you to be working on a unix machine and have the bedtools and sed command installed:

# Mask your regions with a zero character
bedtools maskfasta -mc 0 -fi input.fasta -bed regions.bed -fo masked.fasta

# Replace the masked regions with no characters
sed -i 's/0//g' masked.fasta > result.fasta
ADD REPLY
0
Entering edit mode

I would advise being careful when running sed command with '-i' parameter as a unix novice, as it applies changes "in place", meaning it edits the input file directly. So the suggested line of code will produce an empty output file (result.fasta), as replacing of zeros with no characters is done directly in the masked.fasta file.

Additionally, the important thing to note, if your sequence IDs contain any zero(s), these zeros will also get replaced in this case. If you would like to replace stretches of zeros of a certain minimum length that you are sure does not occur in your seq. IDs, then this might be a better solution:

sed 's/0\{N,\}//g' masked.fasta > result.fasta

where 'N' is to be replaced with the minimum number of occurrences, and backslashes are needed to 'escape' special characters such as {}.

ADD REPLY
4
Entering edit mode
7.4 years ago
trausch ★ 1.9k

You can use samtools to extract regions from a FASTA sequence

samtools faidx hg19.fa chr1:20000-20100

ADD COMMENT
1
Entering edit mode
7.4 years ago
Chadi Saad ▴ 110

You can use bedtools getfasta :

bedtools getfasta -fi input_file.fa -bed regions_file.bed

ADD COMMENT

Login before adding your answer.

Traffic: 1150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6