how to extract particular regions of a sequence from fasta file if its range for each sequence from fasta file is given in another txt file for all the sequences contains in fasta file ?
1
1
Entering edit mode
7.9 years ago
ak93sharma ▴ 20

Hi folks, I have a fasta file ( 716_seq.fasta ) with 716 sequences I want a specific region which is different in all 716 sequences and I have the range in a different txt file (range.txt ) for each 716 sequence. I am able to do this for only for a single range for all the sequences by providing its range in code. I want it takes second range for second sequence and 3rd range for 3rd sequence, and so on. Please help.

716_seq.fasta file like this:

>header_seq1_>xyz
ATGACATGACGATC
ATCGTGACGTACGT
ATCGA
>header_seq2_>mno
ATCGGCGGTATTTA
ACGGTGGA
>header_seq3_>pqr
ATCGATCAGTACGA
ACGATGACGAT

range.txt file like this:

4 -6
7- 8
2- 5

the possible outcome should be like this:

>header_seq1_>xyz
ACA
>header_seq2_>mno
GG
>header_seq3_>pqr
TCGA

i write a code which follows as

grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384

but it gives same region ( takes same range 34655 to 36384) for all the sequeces and output follows as :

TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC
RNA-Seq sed fasta sequence extract sequence • 4.8k views
ADD COMMENT
3
Entering edit mode

This looks like one of the problems from your classroom assignments.

What do you mean by "I am able to do this for only a single sequence by providing its range its takes huge time if I extract particular region one by one" ? Are you doing it manually ? What else did you try? Did you try to write a program, if yes, share the code and the exact error message. Or, did you try to google?

Biostars is not meant for providing ready answers.

ADD REPLY
1
Entering edit mode

i have already written a code which gives me same region from all sequence , my code follows as

grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384

and gives output :

TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC

ADD REPLY
1
Entering edit mode

samtools faidx is your friend (as are biopython and pyfasta).

ADD REPLY
0
Entering edit mode

For basic tasks like this, there are already a couple of tools present which are open source. For your task, checkout, bedtools getfasta here.

ADD REPLY
1
Entering edit mode

bedtools getfasta isn't going to help OP. He needs a different interval from each fasta entry.

ADD REPLY
1
Entering edit mode

This could easily be done using a (Bio)python script. Let me know if you can't figure it out.

ADD REPLY
4
Entering edit mode
7.9 years ago
Paul ★ 1.5k

Hi,

what about to try some pure bash scripting. Lets assume your range is called range.txt a fasta file is fasta.fa. Put all files in one directory and make solution.sh executable (chmod +x solution.sh):

cat solution.sh

#!/bin/bash
lines=$(awk 'END {print NR}' range.txt)
for ((a=1; a<= $lines ; a++))
 do 
 number=$(awk -v lines=$a 'NR == lines' range.txt)
 grep -v ">" fasta.fa | awk -v lines=$a 'NR == lines' | cut -c$number
done;

run it like

./solution.sh

This would be probably slow for very huge fasta files.

Cheers.

ADD COMMENT
0
Entering edit mode

@Paul , Thank you so much, its works well. I made a sample file before running it to a huge fasta file containing 716 genomes.

sample fasta and range file.

cat fasta.fa
>header1
ABCDEFGHIJGGGGGGGGGGG
>header2
KLMNOPQRSTUVWCCCCCCC


cat range.txt 
4-6
2-5

I made code like this

cat fasta.fa |sed -n 2~2p |sed '1q;d'|cut -c4-6
output:
DEF
cat fasta.fa |sed -n 2~2p |sed '2q;d'|cut -c2-5
output:
LMNO

I want to pass variable $L and $R in such way in my code that desired result will come, please help me out.

  R in `cat range.txt`
       L in {1..2}


cat fasta.fa |sed -n 2~2p |sed '$Lq;d'|cut -c$R
ADD REPLY
0
Entering edit mode

You are welcome. I tried it on the huge fasta file and it seems to me it works fast. Did my solution works?

ADD REPLY

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6