Relative frequencies of the nucleotides in Fasta files

0

Entering edit mode

4.2 years ago

Amr ▴ 180

I have 2 fasta files (File A, File B) each one contains sequence, How can I report the relative frequencies of the nucleotides in set A and set B with python?

Thanks in advance

fasta • 6.2k views

ADD COMMENT • link updated 2.3 years ago by xiaoleiusc ▴ 140 • written 4.2 years ago by Amr ▴ 180

1

Entering edit mode

The rationale behind getting nucleotides' relative frequencies is very simple: read each fasta file, discard headers, split sequences into nucleotides, and count each one independently. If you need help coding it you'd better show what you've tried so far.

ADD REPLY • link 4.2 years ago by Jorge Amigo 14k

0

Entering edit mode

duplicate: nucleotides content determination using python

ADD REPLY • link 4.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Hugo, I found this script super helpful. How can I change it to give the frequencies for triplets? I ultimately want to know the frequencies between two or more sequences of aminoacid changes. Thanks in advance!

ADD REPLY • link 2.5 years ago by cmavian • 0

0

Entering edit mode

Hello :v

Try this:

python3 count_uni_di.py sample.fasta output.csv 3

If that doesn't work, post a sample input and desired output so we can figure out how to do what you want ;)

ADD REPLY • link 2.5 years ago by hugo.avila ▴ 540

1

Entering edit mode

4.2 years ago

hugo.avila ▴ 540

Hi check the link in the comments and read the biopython documentation. Until then, try this:

#!/usr/bin/env python
from Bio import SeqIO
for fasta_path in ['/path/to/fasta1', '/path/to/fasta2']:
    count_nucletides = dict([(i,0) for i in "ACTG"])
    for record in SeqIO.parse(fasta_path, 'fasta'):
        for nucleotide in count_nucletides:
            count_nucletides[nucleotide] += record.seq.count(nucleotide)
    print(fasta_path)
    print('\n'.join(['{}: {}'.format(i,count_nucletides[i]) for i in count_nucletides]))

ADD COMMENT • link 4.2 years ago by hugo.avila ▴ 540

0

Entering edit mode

Thanks, does this script count mono-nucleotide frequencies (not di-nucleotide frequencies) of a fasta file?

ADD REPLY • link 3.6 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

Hi, it counts only mono-nucleotide frequencies (A, C, G and T). If you want to count di-nucleotides, replace the 4° line for this:

dict([(i+y,0) for i in "ACTG" for y in "ACTG"])

it is not very pretty but i think will do the job.

ADD REPLY • link 3.6 years ago by hugo.avila ▴ 540

0

Entering edit mode

Thanks, Hugo, if I have a fasta file containing thousands of sequences, and I wonder if the script could generate a table containing mono- and di-nucleotides frequencies of each sequence in the fasta file?

ADD REPLY • link 3.6 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

Try this:

#!/usr/bin/env python3

import os
import sys
from Bio import SeqIO

def main(input_path: str, output_path: str) -> None:
    mono = [(i,0) for i in "ACTG"]
    di = [(i+y,0) for i in "ACTG" for y in "ACTG"]
    count_nucletides = dict(mono + di)
    with open(input_path, 'r') as input_handle:
        with open(output_path, 'w') as output_handle:
            output_handle.write(
                'records,{}\n'.format(
                    ','.join(map(lambda x: x[0], sorted(count_nucletides.items(), key=lambda x: x[0])))
                )
            )
            for record in SeqIO.parse(input_handle, 'fasta'):
                for nucleotide in count_nucletides:
                    count_nucletides[nucleotide] += record.seq.count(nucleotide)
                count = list(map(lambda x: x[1], sorted(count_nucletides.items(), key=lambda x: x[0])))
                output_handle.write("{},{}\n".format(
                    record.id,
                    ','.join(map(str, count))
                ))
    return None

if __name__ == "__main__":
    main( 
        input_path=sys.argv[1],
        output_path=sys.argv[2]
    )

Save to a file "count_mono_di.py". Run:

python3 count_mono_di.py input.fasta output.csv

I use this sample to test:

>record_1
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
ACGTAGGGATACACATAGACATGACATGAAAACTCGATAGCTAGATCGATATAGCGCTAG
CGATATAGCGCTAG
>record_2
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
AAATGCTGATCGATCGATCGATATATATAGCGCTAGAGATAGATCGATAGCATCGATACG
ATATAGCGCTAGAGATAGATCGATAGCATCGATACG

This was the output:

records,A,AA,AC,AG,AT,C,CA,CC,CG,CT,G,GA,GC,GG,GT,T,TA,TC,TG,TT
record_1,50,4,12,14,18,25,8,0,10,7,32,16,9,2,2,27,19,4,4,0
record_2,105,6,15,29,51,51,11,0,28,12,69,41,20,2,2,65,41,16,8,0

ADD REPLY • link 3.6 years ago by hugo.avila ▴ 540

0

Entering edit mode

Hugo, I really appreciate your effort. This script helps! Would it be possible that you modify this script so that the result is the frequencies of mono- or dinucleotides instead of counts? For example, would it be possible that the output looks like below:

records,Freq_A, Freq_C, Freq_G, Freq_T, Freq_AA, Freq_AC, Freq_AG, Freq_AT, Freq_CA, Freq_CC, ....

record_1, values of corresponding frequencies

record_2, values of corresponding frequencies

Besides, I would like to acknowledge your effort in my publications when I use this script, please let me know how can I cite this script (your papers, or your GitHub website, etc) ?

ADD REPLY • link 3.6 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

Hi ! Well, i made a few mistakes so first i will correct then and format the output as you want:

#!/usr/bin/env python3

import os
from sys import argv
from collections import OrderedDict
from itertools import product
from Bio import SeqIO

def get_word_dict(word_size: int = 1, recursive = True):
    count_dict = []
    if not recursive:
        for y in product("ACGT", repeat=word_size):
            count_dict.append(''.join(y))
    else:
        for i in range(1, word_size + 1):
            for y in product("ACGT", repeat=i):
                count_dict.append(''.join(y))
    return OrderedDict.fromkeys(count_dict, 0)

def freq_from_dict(nucleotide_dict, record):
    for nucleotide in nucleotide_dict:
        if len(nucleotide_dict) == 1:
            nucleotide_dict[nucleotide] += record.seq.count(nucleotide)
        else:
            nucleotide_dict[nucleotide] += record.seq.count_overlap(nucleotide)
    sum_dict = {}
    for nucleotide in nucleotide_dict:
        length_kmer = len(nucleotide)
        if not length_kmer in sum_dict:
            sum_dict[length_kmer] = sum([y for i,y in nucleotide_dict.items() if len(i) == length_kmer])
    for nucleotide in nucleotide_dict:
        nucleotide_dict[nucleotide] = nucleotide_dict[nucleotide]/sum_dict[len(nucleotide)]
    return nucleotide_dict

def main(input_path: str, output_path: str, word_size: int = 1, recursive = True) -> None:
    result = []
    with open(input_path, 'r') as input_handle:
        for record in SeqIO.parse(input_handle, 'fasta'):
            freq_dict = freq_from_dict(get_word_dict(word_size=word_size, recursive=recursive), record=record)
            result.append('{},{}\n'.format(record.id, ','.join(map(str ,freq_dict.values()))))
    with open(output_path, 'w') as output_handle:
        output_handle.write('records,{}\n'.format(','.join(freq_dict.keys())))
        for i in result:
            output_handle.write(i)


if __name__ == "__main__":
    main( 
        input_path=argv[1],
        output_path=argv[2],
        word_size=(int(argv[3])),
        recursive=(True if argv[4] else False)
    )

I used the same sample as above, check the results, maybe I'm wrong again. To obtain the frequency, i divide the total count by the sum of all k mers of the same length.

I made other changes, now you can pass a word length and if you want to build the dictionary count recursively, for example if your word length is 3 the script will count all 3-mer, 2-mer and 1-mer.

You can run like this (mind the order of args):

# python3 count_uni_di.py input_path output_path word_size true
# or
# python3 count_uni_di.py input_path output_path word_size

python3 count_uni_di.py sample.fasta output.csv 2 true

As for acknowledgement, a positive vote is more than enough. Check out this EMBOSS tool, I think it's best suited for heavy use and citation. After that, if you still want to use this script i can post it on github as a snippet.

ADD REPLY • link 3.6 years ago by hugo.avila ▴ 540

1

Entering edit mode

Hi, Hugo,

I really appreciate your effort in writing this script! I test your script with a fasta file containing 98,913 sequences, and your script works without any problem of generating a table containing mono- and dinucleotide frequencies of each sequence in the fasta file! I used EMBOSS tool to check a couple of sequences from the fasta file and the results I got from EMBOSS match the results from your script. I would say this is amazing! I hope you can put this on github so that other scientists (especially someone like me, who does not have computational background) could benefit from it. I am sure I will keep using your script and I will acknowledge it in my publications.

Regards,

Xiao

ADD REPLY • link 3.6 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Hi, Hugo, May I know your full name? I would like to put your name into the acknowledgment session of my paper. Thanks!

ADD REPLY • link 2.9 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

Hi, I'm glad this snippet helped you. Of course you can have my name, here is my orcid, it contains my information. Please attach it next to my name, it helps me to show up in Google searches.

ADD REPLY • link 2.9 years ago by hugo.avila ▴ 540

0

Entering edit mode

Hi, Hugo, I acknowledged you in my paper recently published. Thanks for your help! Xiao PS, paper link https://elifesciences.org/articles/83548#info

ADD REPLY • link 2.3 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Hi Xiao,

Thanks for getting back to me! I'm so glad to hear that my script was able to make a difference for you.

I wanted to also thank you for including me in the acknowledgment section of your publication. I haven't had a chance to read the paper in full yet, but it looks very cool. I'm honored to be a part of your work and I'm glad my orcid information is there to help others find me.

Once again, thanks for your support. I hope your research continues to be successful and I wish you all the best !

ADD REPLY • link 2.3 years ago by hugo.avila ▴ 540

1

Entering edit mode

Thank you Hugo!

ADD REPLY • link 2.3 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

3.6 years ago

hugo.avila ▴ 540

Thanks !

	#!/usr/bin/env python3

	import os
	from sys import argv
	from collections import OrderedDict
	from itertools import product
	from Bio import SeqIO

	def get_word_dict(word_size: int = 1, recursive = True):
	"""Get a orderedDict of k-mers with zero values: {"A": 0, "C": 0 ...}
	Word size: k-mer length (3-mer = "AAA", "AAC" ...).
	recursive: If True will create k-1 k-mers as well: word size = 3 (3-mer, 2-mer, 1-mer).
	"""
	count_dict = []
	if not recursive:
	for y in product("ACGT", repeat=word_size):
	count_dict.append(''.join(y))
	else:
	for i in range(1, word_size + 1):
	for y in product("ACGT", repeat=i):
	count_dict.append(''.join(y))
	return OrderedDict.fromkeys(count_dict, 0)

	def freq_from_dict(nucleotide_dict, record):
	"""Get frequence of k-mers in a biopython SeqIO object.
	For k > 1, the count is overlapped: 2-mer and record="AAATGC" -> AA=2, AT=1, TG=1, GC=1.
	frequence = pattern/sum of all same length patterns. Ex: AA/(AA+AC+AG+AT+CA+CC+CG+CT+GA+GC+GG+GT+TA+TC+TG+TT)
	"""
	for nucleotide in nucleotide_dict:
	if len(nucleotide_dict) == 1:
	nucleotide_dict[nucleotide] += record.seq.count(nucleotide)
	else:
	nucleotide_dict[nucleotide] += record.seq.count_overlap(nucleotide)
	sum_dict = {}
	for nucleotide in nucleotide_dict:
	length_kmer = len(nucleotide)
	if not length_kmer in sum_dict:
	sum_dict[length_kmer] = sum([y for i,y in nucleotide_dict.items() if len(i) == length_kmer])
	for nucleotide in nucleotide_dict:
	nucleotide_dict[nucleotide] = nucleotide_dict[nucleotide]/sum_dict[len(nucleotide)]
	return nucleotide_dict

	def main(input_path: str, output_path: str, word_size: int = 1, recursive = True) -> None:
	""" Reads a fasta file and outputs a csv with fasta ids and k-mer frequencies.
	input_path and output_path: strings with paths.
	Word size: k-mer length (3-mer = "AAA", "AAC" ...).
	recursive: If True will create k-1 k-mers as well: word size = 3 (3-mer, 2-mer, 1-mer).
	"""
	result = []
	with open(input_path, 'r') as input_handle:
	for record in SeqIO.parse(input_handle, 'fasta'):
	freq_dict = freq_from_dict(get_word_dict(word_size=word_size, recursive=recursive), record=record)
	result.append('{},{}\n'.format(record.id, ','.join(map(str ,freq_dict.values()))))
	with open(output_path, 'w') as output_handle:
	output_handle.write('records,{}\n'.format(','.join(freq_dict.keys())))
	for i in result:
	output_handle.write(i)


	if __name__ == "__main__":
	main(
	input_path=argv[1],
	output_path=argv[2],
	word_size=(int(argv[3])),
	recursive=(True if argv[4] else False)
	)

view raw get_nucl_freq_from_fasta.py hosted with ❤ by GitHub

ADD COMMENT • link 3.6 years ago by hugo.avila ▴ 540

0

Entering edit mode

Anyone finding this thread by search, please use this script since it is hosted on GitHub and is the latest version.

ADD REPLY • link 2.5 years ago by GenoMax 151k

Login before adding your answer.