Entering edit mode
11 months ago
sapuizait
▴
10
Hi all
I am trying to extract specific regions from sequences in fasta files
Example file:
>oneseq GAATATGCTTTTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq TCTGGAGGGACTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA
>threeseq CATCCGGTAGTTTCTTTCCCATATCCTGCACCGCCGGAACCTTCTTCTCATAGGTCTGTAACTGTGTATTGTTTTCTGCCATACAAACACCGCCCTTTCTATGATATTTCAGATATTTCAAGCAATATTTCAAAAAATTAAATCTAATCTTAACTTTATTCCAACCCTT
I often use samtools to do it using sth like this:
samtools faidx test.fas oneseq:2-10 twoseq:3-10
However, this time, instead of this, I would like to extract the non-specified regions In the above example, I would like to get as output this:
>oneseq
G-TTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq
TC-CTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA
OR even better :)
>oneseq-upstream
G
>oneseq-downstream TTCGCTATTTTGGTGGCAGAACAAAAGCAATTATGTATGACCAGGATAAAGTTTTTGTTGTAAGTGAGAATTTCGGCAATGTAATCTTTGTTCCGGAGT
>twoseq-upstream
TC
>twoseq-downstream CTGCCGGCGCAAGCCGTGAGGAAGGATGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGACACACGTGTTACAATGGTCGGCACAGCGGGAAGCCATATGGTGACATAGAGCGGAACCCGAAAGCCGGTCTCAGTTCGGATCGGAGTCTGCAA
I am aware it is possible if I first first get the sequence length and then subtract from it the selected region - But, I was just wondering if there is an easier way to do this?
Thanks