Question

Exclude specified range of bases from multiple sequences in a FASTA file

0

Entering edit mode

3.1 years ago

Inquisitive8995 ▴ 280

Hi,

I am trying to eliminate a range of bases from sequences within a FASTA file in multiple places based on the header ID and positions that I mention.

For example; I have file; A.fa

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
   >ID2
ATGGTCGTCCGTTGAATTGT**TACTCAAAAT**TGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT
>ID3
TCTGCA**TTCT**GTCCA**TTGTC**ATCTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA

I have another file with FASTA headers and with specified positions (X.txt) that looks like;

ID start end 
ID2 20...30 
ID3  6...10, 15...20

I would like to modify the file A.fa in such a way that in the sequence ID2, I exclude bases between 20 and 30, in ID3 i exclude bases between 6 to 10 & 15 to 20 to create B.fa which looks like below;

>ID1
TTGTTCAACGGATCCACCTGTTGCCAAGAGTGCTTCAGTACATTGCTCACGGCTGAATCCCATATCCATCAAAGCACAAGATTTGAATTCACTCGAGGATCTGCTTCGTCGACCATTGGAAATGAAAAAATTACAATTACACATTGAATTTGTAAAGCTTGAAATTAATGAACTTACCAAAATAGATTTGCACACAGAAGCAACAGCTTGGCCGTGTTACAACTTGTAACGGGTAAAGACAAAATCGCTAACAACGGTTGTAGGCCACCATGTTCCACAAATTCACGACA
>ID2 
ATGGTCGTCCGTTGAATTGTTGCGTCGACAAATTTCATCACGTTCATAATGTAGTCAATGAGAACGATTGGAATGCGTTCGGAAGTAGATGATGAAGTCTGTGCAGATTCTTGTTCTGTATTCCCAGTTGCATTT  
>ID3
TCTGCAGTCCATTTCTGTGATTGTTGTACGGTGACGTACTTGCTTCTTCTTAGTCTTCATCTTCATCATCATTGCTACCTGCATTCATATCCGGATTATTTGTATAAGATTATTGGAAATGCCTAGCTACACAAATCCTTAAAATAAAAATAGGAAAAAAGTGTAAAAAAATAAAAGAAAAAAAATATTGAATGTAACTCACCTAAAGTAATA

I have more than 100 IDs and different positions in X.txt to modify A.fa. Any help would be appreciated.

Thank you very much

FASTA Assembly • 1.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 3.1 years ago by Inquisitive8995 ▴ 280

score 1 · Answer 1 · 2021-10-05

1

Entering edit mode

3.1 years ago

nickp60 ▴ 60

I'd probably do something like the following, assuming you can convert the X.txt positions file you describe into a 1-feature-per-line bed file:

1) Index your sequences (taken from

samtools faidx A.fa

2) make a bed file of the original sequences using the index:

awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' A.fa.fai > A.bed

3) remove the bad regions from the original bed file (note the -v)

bedtools intersect -a A.bed -b X.txt -v > A.goodregions.bed

4) pull out the good regions

bedtools getfasta  -fi A.fa -bed A.goodregions.bed > A.goodregions.fa

ADD COMMENT • link 3.1 years ago by nickp60 ▴ 60

0

Entering edit mode

Hi, Thank you for your response. I created the bed file for X.txt and it looks as given below;

NODE_1138     1535     4521
NODE_11674     1119    2587
NODE_11674     3000    3043
NODE_120      60144   62167

When i run the step 3 from your answer, it excludes the entire node present in X.txt from A.bed and not just the regions (start - end) mentioned in the file.

Could you please let me know if there is a workaround for it ?

ADD REPLY • link 3.1 years ago by Inquisitive8995 ▴ 280

0

Entering edit mode

I figured out a way to do it. Instead of 'intersect', 'subtract' works fine for this problem.

ADD REPLY • link 3.1 years ago by Inquisitive8995 ▴ 280