Hi folks, I have a fasta file ( 716_seq.fasta ) with 716 sequences I want a specific region which is different in all 716 sequences and I have the range in a different txt file (range.txt ) for each 716 sequence. I am able to do this for only for a single range for all the sequences by providing its range in code. I want it takes second range for second sequence and 3rd range for 3rd sequence, and so on. Please help.
716_seq.fasta file like this:
>header_seq1_>xyz
ATGACATGACGATC
ATCGTGACGTACGT
ATCGA
>header_seq2_>mno
ATCGGCGGTATTTA
ACGGTGGA
>header_seq3_>pqr
ATCGATCAGTACGA
ACGATGACGAT
range.txt file like this:
4 -6
7- 8
2- 5
the possible outcome should be like this:
>header_seq1_>xyz
ACA
>header_seq2_>mno
GG
>header_seq3_>pqr
TCGA
i write a code which follows as
grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384
but it gives same region ( takes same range 34655 to 36384) for all the sequeces and output follows as :
TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC
This looks like one of the problems from your classroom assignments.
What do you mean by "I am able to do this for only a single sequence by providing its range its takes huge time if I extract particular region one by one" ? Are you doing it manually ? What else did you try? Did you try to write a program, if yes, share the code and the exact error message. Or, did you try to google?
Biostars is not meant for providing ready answers.
i have already written a code which gives me same region from all sequence , my code follows as
grep -v ">" pv.fasta |sed ':a;N;$!ba;s/\n//g' | cut -c34655-36384
and gives output :
TGAAAACCTTATATTCCCTGAGGAGGTTCTACCCCGTGGAAACGCTCTTTAATGGAACTTTAGCTTTAGCTGGTCGTGACCAAGAAACCACCGGTTTCGCTTGGTGGGCCGGGAATGCCCGACTTATCAATTTATCTGGTAAACTACTTGGGGCTCATGTAGCCCATGCCGGATTAATCGTATTCTGGGCCGGAGCAATGACCCTATTTGAAGTGGCTCATTTCCTACCAGAGAAACCCATGTATGAACAAGGCTTGATTTTACTTCCGCACCTAGCGACTCTAGGTTGGGGGGTAGGTCCTGGTGGGGAAGTTATAGATACCTTTCCATACTTTGTATCTGGAGTACTTCACCTAATTTCCTCCGCAGTATTGGGCTTTGGCGGTATTTATCATGCACTTCTGGGCCCCGAGACTCTTGAAGAATCTTTTCCATTCTTCGGTTATGTATGGAAAGATAGAAATAAAATGACCACAATTTTAGGTATTCACTTAGTCTTGTTAGGTATAGGTGCTTTTCTTCTAGTATTCAAGGCTCTTTATTTTGGAGGCGTATATGATACCTGGGCTCCGGGCGGGGGGGATGTAAGAAAAATAACCAACTTAACCCTCAGTCCAGGCGTTATATTTGGTTATTTACTAAAATCGCCCTTTGGAGGAGAAGGCTGGATTGTTAGTGTGGATGATTTGGAAGATATAATTGGAGGGCATGTATGGTTAGGCTCCATTTGTATACTTGGTGGAGTTTGGCATATCTTAACCAAACCTTTTGCATGGGCTCGCCGTGCACTTGTATGGTCTGGTGAGGCTTACTTGTCTTATAGTTTAGGGGCTTTATCTGTCTTTGGCTTCATTGCTTGTTGCTTTGTATGGTTCAACAACACCGCCTATCCTAGTGAGTTTTACGGGCCAACTGGGCCAGAAGCTTCTCAAGCTCAAGCATTTACTTTTCTAGTTAGAGACCAACGTCTTGGGGCTAATGTAGGATCTGCTCAAGGACCTACTGGTTTAGGTAAATATCTAATGCGCTCTCCGACTGGAGAAGTGATTTTTGGAGGAGAAACTATGCGCTTTTGGGATCTGCGCGCTCCTTGGTTAGAACCTCTAAGGGGTCCAAATGGTTTGGACTTGAGTAGGCTGAAAAAAGACATACAACCTTGGCAAGAACGGCGTTCCGCAGAATATATGACTCATGCTCCTTTAGGTTCTTTAAATTCCGTGGGTGGCGTAGCTACCGAGATCAATGCAGTCAACTATGTCTCTCCTAGAAGTTGGTTAGCTACTTCGCATTTTGTTCTCGGGTTCTTCCTATTCGTAGGTCATCTGTGGCACGCGGGAAGGGCTCGTGCAGCTGCAGCAGGATTTGAAAAAGGAATCGATCGCGATTTTGAACCTGTTCTTTCCATGACCCCTCTTAACTGAGACAGGCGATCAGATGTTTGACATAGGAATCTCCAACATACAATACATATTGGGACCGGGTCATACTTAAAAAGTATTCGTTATTCCTTATCTTTTTTTTTTCAATCTATATCTAAATCGAATCTATTTTTTCTGGCTCGGCTATTCCACCTAGCCGAGCCATTCCGCCTTTTGGCCGGGCAAAACCGATAAAGAAATCTATTCGTCGAGCAAAAAAAGGAGAGAGAGGGATTCGAACCCTCGATAGTTCTTTGTTTAGAACTATACCGGTTTTCAAGACCGGAGCTATCAACCGCTCGGCCATCTCTC
samtools faidx
is your friend (as are biopython and pyfasta).For basic tasks like this, there are already a couple of tools present which are open source. For your task, checkout, bedtools getfasta here.
bedtools getfasta isn't going to help OP. He needs a different interval from each fasta entry.
This could easily be done using a (Bio)python script. Let me know if you can't figure it out.