Question

Count duplicate sequence in fasta file using python

0

Entering edit mode

4.9 years ago

jiseon824 • 0

Hello

I am new for python and bioinformatics.

for some reason, I have to analyze the data from a massive fasta file.

I want to count to repeat sequence using python.

test.fasta

>1234
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>456
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>67
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>57
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>35
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>222
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

Because I am new for Python I couldn't make any code unfortunately. I searched website but I couldn't fine any example code what I can copy and follow.

Does someone can help me to count the duplicate number of sequence?

if I need a reference I can make a file (CSV or fasta)

[what I want is..in csv file] sequence and repeated number

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac    5    
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca    3

or display ID of reference file and repeated number

ref#1       5
ref#2       3
.
.

Thank you in advance

rna-seq • 3.4k views

ADD COMMENT • link updated 4.9 years ago by Ram 45k • written 4.9 years ago by jiseon824 • 0

0

Entering edit mode

Are these full length sequences that you want to know if are repeated, or are you interested in the number of occurrences of a specific set of subsequence patterns?

ADD REPLY • link 4.9 years ago by Joe 22k

0

Entering edit mode

Hi

I want to check the number of occurrences of specific reference sequence in reference file. for example, if i make a reference file as bleow

 > ref#1  

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>ref#2

gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

than, it count the frequency based on the reference. the actual reference sequence is longer then example. it is usually more than 500bp. I've got a fasta file and I have to analyze it to count the sequence reads number based on the reference.

ADD REPLY • link updated 4.9 years ago by GenoMax 151k • written 4.9 years ago by jiseon824 • 0

0

Entering edit mode

duplicates by sequences or by IDs?

ADD REPLY • link 4.9 years ago by cpad0112 21k

0

Entering edit mode

Hi

You should go through this link: https://stackoverflow.com/questions/55226949/how-to-get-the-count-of-duplicated-sequences-in-fasta-file-using-python

You can easily redirect the output to csv or as you want

ADD REPLY • link 4.9 years ago by gayachit ▴ 200

0

Entering edit mode

Thank you so much. it is working well. :) I hope it is working well with my massive data.

ADD REPLY • link 4.9 years ago by jiseon824 • 0