Exclude specified range of bases from multiple sequences in a FASTA file
1
0
Entering edit mode
3.1 years ago

Hi,

I am trying to eliminate a range of bases from sequences within a FASTA file in multiple places based on the header ID and positions that I mention.

For example; I have file; A.fa

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
   >ID2
ATGGTCGTCCGTTGAATTGT**TACTCAAAAT**TGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT
>ID3
TCTGCA**TTCT**GTCCA**TTGTC**ATCTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA

I have another file with FASTA headers and with specified positions (X.txt) that looks like;

ID start end 
ID2 20...30 
ID3  6...10, 15...20

I would like to modify the file A.fa in such a way that in the sequence ID2, I exclude bases between 20 and 30, in ID3 i exclude bases between 6 to 10 & 15 to 20 to create B.fa which looks like below;

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
>ID2 
ATGGTCGTCCGTTGAATTGTTGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT  
>ID3
TCTGCAGTCCATTTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA

I have more than 100 IDs and different positions in X.txt to modify A.fa. Any help would be appreciated.

Thank you very much

FASTA Assembly • 1.3k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
nickp60 ▴ 60

I'd probably do something like the following, assuming you can convert the X.txt positions file you describe into a 1-feature-per-line bed file:

1) Index your sequences (taken from

samtools faidx A.fa

2) make a bed file of the original sequences using the index:

awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' A.fa.fai > A.bed

3) remove the bad regions from the original bed file (note the -v)

bedtools intersect -a A.bed -b X.txt -v > A.goodregions.bed

4) pull out the good regions

bedtools getfasta  -fi A.fa -bed A.goodregions.bed > A.goodregions.fa
ADD COMMENT
0
Entering edit mode

Hi, Thank you for your response. I created the bed file for X.txt and it looks as given below;

NODE_1138     1535     4521
NODE_11674     1119    2587
NODE_11674     3000    3043
NODE_120      60144   62167

When i run the step 3 from your answer, it excludes the entire node present in X.txt from A.bed and not just the regions (start - end) mentioned in the file.

Could you please let me know if there is a workaround for it ?

ADD REPLY
0
Entering edit mode

I figured out a way to do it. Instead of 'intersect', 'subtract' works fine for this problem.

ADD REPLY

Login before adding your answer.

Traffic: 2229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6