Remove the last few bases from fasta sequences
2
1
Entering edit mode
7.4 years ago
jjrin ▴ 40

Hello

I was looking through my created fasta files and realized that the number of bases are consistently 6 above the number that it should be according to the coordinates. I am not a strong scripter so how would I remove the last 6 bases from every sequence in my fasta file? Note: each line in the sequence is on a new line so I can't just remove 6 from the end of every other, I would need to iterate through it as a fasta file format.

>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAA
AAAAAACCTCACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAG
TGTAGCACTTCTAGCCCAATTCTTTAAACTTTGGTTGAAGGCTTCTGCAT
AGAGACGCGGGAAAGACAGTTTTACTGTTTAGCACTCTATGGAGCAACAT
CTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGACACAAAACAT
CATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTAGCCCCT

I would like to remove "GCCCCT" from the end. This is a multifasta file so there are hundreds of similar cases within the same fasta file. Thank you!

RNA-Seq fasta • 6.4k views
ADD COMMENT
1
Entering edit mode

each line in the sequence is on a new line so I can't just remove 6 from the end of every other

You can linearize your fasta to make it a single line fasta instead of multi-line fasta.

ADD REPLY
1
Entering edit mode

This would be most trivial with a Biopython script. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.

ADD REPLY
0
Entering edit mode

I went over my data again and realized that it's not consistent with 6 bases being out of the coordinates but rather a certain number for each gene (that isn't consistent with all of the bases). How would I go about writing a script that determines how many should be in the sequence by subtracting these two coordinates and then splicing out any bases are past the threshold.

For example:

>MSTRG.5.1_Scaffold100_65415_65755_+
TATATTTATTAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCT
GACACATTGTTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCA
TGGGAAAGCTGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCAC
ATCACACACACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGA
CCTGAAGGGTTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGT
GAGCAAAGCAATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAG
AGTGACCATCTATAGGAGTGACCAACTATAAGAGTGATCAT

If you subtract 65755 and 65415 you get 340 but there are actually 346 bases here. How would I splice out bases based on the number from the coordinates.

ADD REPLY
1
Entering edit mode

I think you should try to understand why this is happening, instead of blindly removing some bases. What if the leading bases are to be removed? What if no bases are to be removed? What if the numbers are wrong, not the sequences?

Which brings us the question, how have these files been created?

ADD REPLY
0
Entering edit mode

These bases were created using galaxy and a tool called "Extract Genomic DNA". I contacted the writers of it and they said that the tool is likely buggy and that the additional bases are probably due to overlapping reads/exons. So once I spliced out exons and looked at just transcripts I still have a couple of additional bases left, that are just the next couple few from the reference genome. For now, it would be best if I just had the bases that are contained within the coordinates.

ADD REPLY
0
Entering edit mode

not consistent with 6 bases being out of the coordinates but rather a random number for each gene.

I'm not sure if I understand what's going on. But if you can somehow deduce how many nucleotides need to be removed you can put that too in the script for cutting. It obviously makes matters a bit more complicated.

ADD REPLY
0
Entering edit mode

The number of bases that need to be removed are the additional ones that aren't within the coordinates "reading frame". So if I have coordinates of 0-700 and have 750 bases, the last 50 need to be spliced out. I'm sorry for the confusion. So I have coordinates within the name of the gene how would I splice out genes based on it?

ADD REPLY
1
Entering edit mode

So the script would read the header of the fasta, determine the desired number of nucleotides and remove everything that is longer than that, and then move on to the next record.

ADD REPLY
1
Entering edit mode

my fasta file

cat test.fa
>seq
actgactgactgn

fasta file without last 3 bases

seqkit subseq -r 1:-4 test.fa
>seq
actgactgact

fasta file without last 7 bases

seqkit subseq -r 1:-8 test.fa
>seq
actgact

works on fasta file with multiple sequences/records.

ADD REPLY
1
Entering edit mode
7.4 years ago
st.ph.n ★ 2.7k

If you have the gene name, and they match the header in the fasta file, it is still fairly trivial to use python or biopython to trim your sequences.

  1. Linearize your fasta file from multi-line to single line. Pierre's awk statement is the easiest option in the link posted in my comment above. This will make it easier to work with.
  2. Make a file, tab-delimited with the gene name, start coordinate, and end coordinate
  3. Save the code below as trim_fasta.py, and run as trim_fasta.py coords.txt input.fasta output.fasta

See code below:

#!/usr/bin/env python
import sys

    with open(sys.argv[1], 'r') as f:
        coords = {}
        for line in f:
            coords[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

    with open(sys.argv[2], 'r') as fasta: 
        with open(sys.argv[3], 'w') as out:
            for line in fasta:
                if line.startswith('>'):
                    header = line.strip().split('>')[1]
                    out.write('>' + header + '\n' + next(fasta).strip()[coords[header][0]:coords[header[1]])
ADD COMMENT
0
Entering edit mode

I added the See code below: line to make sure the markdown was rendered correctly between your numbered list and code block.

ADD REPLY
0
Entering edit mode

Shouldn't the second for line in f: be for line in fasta?

ADD REPLY
0
Entering edit mode

It should. I changed it. Thanks for editing the code block, the 101010 button wasn't working correctly. I'll leave it to the OP to figure out how it works, since I didn't put comments in.

ADD REPLY
0
Entering edit mode

Thank you so much! I had to fix the last line and include some brackets. I also created the linearized original fasta file.

But now I run into this error:

Traceback (most recent call last):
  File "/Users/margaretsaha/Desktop/trim_fasta.py", line 13, in <module>
    out.write('>' + header + '\n' + next(f).strip()[coords[header][0]:coords[header][1]])
ValueError: I/O operation on closed file

Do you have an idea what could be wrong?

This is the format of my original fasta file (but it is linearized, just ignore the bases' formatting right now, this is more for the header):

>MSTRG.5.1_Scaffold100_65415_65755_+
TATATTTATTAATTTATAAAAGTGTAATGATAACTACTTTTAAGACAGCT
GACACATTGTTTCATCCCTAGCATGGCAGGCCTAAGAGATATGTTCATCA
TGGGAAAGCTGTCCTCATTAGAGTCCCTCATCATACAGGAGAAGTCCCAC
ATCACACACACTATTCTTCCATTACTCACACGTCACATGAAGAAAACAGA
CCTGAAGGGTTCAGGACTGGACGTCCCTTGCTACCCATAAAAAATGAGGT
GAGCAAAGCAATTAGCATTTTGACCATCTATGGGAGTGACTAACTATAAG
AGTGACCATCTATAGGAGTGACCAACTATAAGAGTGATCAT
ADD REPLY
1
Entering edit mode

You're right, I missed a closing bracket. The issue is next(f) should be next(fasta), see edit.

If this answer solves your problem, please accept.

ADD REPLY
1
Entering edit mode
7.4 years ago
Charles Plessy ★ 2.9k

If you do not mind that the number of bases per lines gets changed, you can use the seqret command from the EMBOSS package.

$ cat gene1.fa 
>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAA
AAAAAACCTCACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAG
TGTAGCACTTCTAGCCCAATTCTTTAAACTTTGGTTGAAGGCTTCTGCAT
AGAGACGCGGGAAAGACAGTTTTACTGTTTAGCACTCTATGGAGCAACAT
CTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGACACAAAACAT
CATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTAGCCCCT

$ seqret -filter gene1.fa[:-7]
>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAAAAAAAACCTC
ACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAGTGTAGCACTTCTAGCCCAAT
TCTTTAAACTTTGGTTGAAGGCTTCTGCATAGAGACGCGGGAAAGACAGTTTTACTGTTT
AGCACTCTATGGAGCAACATCTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGA
CACAAAACATCATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTA
ADD COMMENT

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6