Hi,
I am trying to eliminate a range of bases from sequences within a FASTA file in multiple places based on the header ID and positions that I mention.
For example; I have file; A.fa
>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
>ID2
ATGGTCGTCCGTTGAATTGT**TACTCAAAAT**TGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT
>ID3
TCTGCA**TTCT**GTCCA**TTGTC**ATCTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA
I have another file with FASTA headers and with specified positions (X.txt
) that looks like;
ID start end
ID2 20...30
ID3 6...10, 15...20
I would like to modify the file A.fa in such a way that in the sequence ID2, I exclude bases between 20 and 30, in ID3 i exclude bases between 6 to 10 & 15 to 20 to create B.fa which looks like below;
>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
>ID2
ATGGTCGTCCGTTGAATTGTTGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT
>ID3
TCTGCAGTCCATTTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA
I have more than 100 IDs and different positions in X.txt
to modify A.fa
. Any help would be appreciated.
Thank you very much
Hi, Thank you for your response. I created the bed file for X.txt and it looks as given below;
When i run the step 3 from your answer, it excludes the entire node present in X.txt from A.bed and not just the regions (start - end) mentioned in the file.
Could you please let me know if there is a workaround for it ?
I figured out a way to do it. Instead of 'intersect', 'subtract' works fine for this problem.