I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?
Thanks in advance
I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?
Thanks in advance
Hi check the link in the comments and read the biopython documentation. Until then, try this:
#!/usr/bin/env python from Bio import SeqIO for fasta_path in ['/path/to/fasta1', '/path/to/fasta2']: count_nucletides = dict([(i,0) for i in "ACTG"]) for record in SeqIO.parse(fasta_path, 'fasta'): for nucleotide in count_nucletides: count_nucletides[nucleotide] += record.seq.count(nucleotide) print(fasta_path) print('\n'.join(['{}: {}'.format(i,count_nucletides[i]) for i in count_nucletides]))
Try this:
#!/usr/bin/env python3
import os
import sys
from Bio import SeqIO
def main(input_path: str, output_path: str) -> None:
mono = [(i,0) for i in "ACTG"]
di = [(i+y,0) for i in "ACTG" for y in "ACTG"]
count_nucletides = dict(mono + di)
with open(input_path, 'r') as input_handle:
with open(output_path, 'w') as output_handle:
output_handle.write(
'records,{}\n'.format(
','.join(map(lambda x: x[0], sorted(count_nucletides.items(), key=lambda x: x[0])))
)
)
for record in SeqIO.parse(input_handle, 'fasta'):
for nucleotide in count_nucletides:
count_nucletides[nucleotide] += record.seq.count(nucleotide)
count = list(map(lambda x: x[1], sorted(count_nucletides.items(), key=lambda x: x[0])))
output_handle.write("{},{}\n".format(
record.id,
','.join(map(str, count))
))
return None
if __name__ == "__main__":
main(
input_path=sys.argv[1],
output_path=sys.argv[2]
)
Save to a file "count_mono_di.py". Run:
python3 count_mono_di.py input.fasta output.csv
I use this sample to test:
>record_1
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
CGATATAGCGCTAG
>record_2
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
ATATAGCGCTAGAGATAGATCGATAGCATCGATACG
This was the output:
records,A,AA,AC,AG,AT,C,CA,CC,CG,CT,G,GA,GC,GG,GT,T,TA,TC,TG,TT
record_1,50,4,12,14,18,25,8,0,10,7,32,16,9,2,2,27,19,4,4,0
record_2,105,6,15,29,51,51,11,0,28,12,69,41,20,2,2,65,41,16,8,0
Hugo, I really appreciate your effort. This script helps! Would it be possible that you modify this script so that the result is the frequencies of mono- or dinucleotides instead of counts? For example, would it be possible that the output looks like below:
records,Freq_A, Freq_C, Freq_G, Freq_T, Freq_AA, Freq_AC, Freq_AG, Freq_AT, Freq_CA, Freq_CC, ....
record_1, values of corresponding frequencies
record_2, values of corresponding frequencies
Besides, I would like to acknowledge your effort in my publications when I use this script, please let me know how can I cite this script (your papers, or your GitHub website, etc) ?
Hi ! Well, i made a few mistakes so first i will correct then and format the output as you want:
#!/usr/bin/env python3
import os
from sys import argv
from collections import OrderedDict
from itertools import product
from Bio import SeqIO
def get_word_dict(word_size: int = 1, recursive = True):
count_dict = []
if not recursive:
for y in product("ACGT", repeat=word_size):
count_dict.append(''.join(y))
else:
for i in range(1, word_size + 1):
for y in product("ACGT", repeat=i):
count_dict.append(''.join(y))
return OrderedDict.fromkeys(count_dict, 0)
def freq_from_dict(nucleotide_dict, record):
for nucleotide in nucleotide_dict:
if len(nucleotide_dict) == 1:
nucleotide_dict[nucleotide] += record.seq.count(nucleotide)
else:
nucleotide_dict[nucleotide] += record.seq.count_overlap(nucleotide)
sum_dict = {}
for nucleotide in nucleotide_dict:
length_kmer = len(nucleotide)
if not length_kmer in sum_dict:
sum_dict[length_kmer] = sum([y for i,y in nucleotide_dict.items() if len(i) == length_kmer])
for nucleotide in nucleotide_dict:
nucleotide_dict[nucleotide] = nucleotide_dict[nucleotide]/sum_dict[len(nucleotide)]
return nucleotide_dict
def main(input_path: str, output_path: str, word_size: int = 1, recursive = True) -> None:
result = []
with open(input_path, 'r') as input_handle:
for record in SeqIO.parse(input_handle, 'fasta'):
freq_dict = freq_from_dict(get_word_dict(word_size=word_size, recursive=recursive), record=record)
result.append('{},{}\n'.format(record.id, ','.join(map(str ,freq_dict.values()))))
with open(output_path, 'w') as output_handle:
output_handle.write('records,{}\n'.format(','.join(freq_dict.keys())))
for i in result:
output_handle.write(i)
if __name__ == "__main__":
main(
input_path=argv[1],
output_path=argv[2],
word_size=(int(argv[3])),
recursive=(True if argv[4] else False)
)
I used the same sample as above, check the results, maybe I'm wrong again. To obtain the frequency, i divide the total count by the sum of all k mers of the same length.
I made other changes, now you can pass a word length and if you want to build the dictionary count recursively, for example if your word length is 3 the script will count all 3-mer, 2-mer and 1-mer.
You can run like this (mind the order of args):
# python3 count_uni_di.py input_path output_path word_size true
# or
# python3 count_uni_di.py input_path output_path word_size
python3 count_uni_di.py sample.fasta output.csv 2 true
As for acknowledgement, a positive vote is more than enough. Check out this EMBOSS tool, I think it's best suited for heavy use and citation. After that, if you still want to use this script i can post it on github as a snippet.
Hi, Hugo,
I really appreciate your effort in writing this script! I test your script with a fasta file containing 98,913 sequences, and your script works without any problem of generating a table containing mono- and dinucleotide frequencies of each sequence in the fasta file! I used EMBOSS tool to check a couple of sequences from the fasta file and the results I got from EMBOSS match the results from your script. I would say this is amazing! I hope you can put this on github so that other scientists (especially someone like me, who does not have computational background) could benefit from it. I am sure I will keep using your script and I will acknowledge it in my publications.
Regards,
Xiao
Hi, Hugo, I acknowledged you in my paper recently published. Thanks for your help! Xiao PS, paper link https://elifesciences.org/articles/83548#info
Hi Xiao,
Thanks for getting back to me! I'm so glad to hear that my script was able to make a difference for you.
I wanted to also thank you for including me in the acknowledgment section of your publication. I haven't had a chance to read the paper in full yet, but it looks very cool. I'm honored to be a part of your work and I'm glad my orcid information is there to help others find me.
Once again, thanks for your support. I hope your research continues to be successful and I wish you all the best !
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The rationale behind getting nucleotides' relative frequencies is very simple: read each fasta file, discard headers, split sequences into nucleotides, and count each one independently. If you need help coding it you'd better show what you've tried so far.
duplicate: nucleotides content determination using python
Hi Hugo, I found this script super helpful. How can I change it to give the frequencies for triplets? I ultimately want to know the frequencies between two or more sequences of aminoacid changes. Thanks in advance!
Hello :v
Try this:
If that doesn't work, post a sample input and desired output so we can figure out how to do what you want ;)