Hello
I was looking through my created fasta files and realized that the number of bases are consistently 6 above the number that it should be according to the coordinates. I am not a strong scripter so how would I remove the last 6 bases from every sequence in my fasta file? Note: each line in the sequence is on a new line so I can't just remove 6 from the end of every other, I would need to iterate through it as a fasta file format.
>gene1
CACTGATTTGAGTTTTTTTCATAAATCAGAAACCGTTTGAATTATAAAAA
AAAAAACCTCACACAGCTCAAACTCAACCTCTGAAAAATAACCCCCAAAG
TGTAGCACTTCTAGCCCAATTCTTTAAACTTTGGTTGAAGGCTTCTGCAT
AGAGACGCGGGAAAGACAGTTTTACTGTTTAGCACTCTATGGAGCAACAT
CTGTAGCAACACTACTGGGGGGCCAGCGCGAGTATATGGACACAAAACAT
CATGTAGTGTAGGATTTCTGAATAGCAATACACCCTTTGTGGTGATGTAA
CAATAAAGGAAAGGGCATATTTTTGATGATCATGAGGTGTAGCCCCT
I would like to remove "GCCCCT" from the end. This is a multifasta file so there are hundreds of similar cases within the same fasta file. Thank you!
You can linearize your fasta to make it a single line fasta instead of multi-line fasta.
This would be most trivial with a Biopython script. You should have a look at the tutorial and cookbook. Feel free to ask for help if you get stuck (but show us the code and what goes wrong). For sure there are also multiple other solutions.
I went over my data again and realized that it's not consistent with 6 bases being out of the coordinates but rather a certain number for each gene (that isn't consistent with all of the bases). How would I go about writing a script that determines how many should be in the sequence by subtracting these two coordinates and then splicing out any bases are past the threshold.
For example:
If you subtract 65755 and 65415 you get 340 but there are actually 346 bases here. How would I splice out bases based on the number from the coordinates.
I think you should try to understand why this is happening, instead of blindly removing some bases. What if the leading bases are to be removed? What if no bases are to be removed? What if the numbers are wrong, not the sequences?
Which brings us the question, how have these files been created?
These bases were created using galaxy and a tool called "Extract Genomic DNA". I contacted the writers of it and they said that the tool is likely buggy and that the additional bases are probably due to overlapping reads/exons. So once I spliced out exons and looked at just transcripts I still have a couple of additional bases left, that are just the next couple few from the reference genome. For now, it would be best if I just had the bases that are contained within the coordinates.
I'm not sure if I understand what's going on. But if you can somehow deduce how many nucleotides need to be removed you can put that too in the script for cutting. It obviously makes matters a bit more complicated.
The number of bases that need to be removed are the additional ones that aren't within the coordinates "reading frame". So if I have coordinates of 0-700 and have 750 bases, the last 50 need to be spliced out. I'm sorry for the confusion. So I have coordinates within the name of the gene how would I splice out genes based on it?
So the script would read the header of the fasta, determine the desired number of nucleotides and remove everything that is longer than that, and then move on to the next record.
my fasta file
fasta file without last 3 bases
fasta file without last 7 bases
works on fasta file with multiple sequences/records.