I have a lot of fasta files, each one with thousand of reads containing the hexameric motif "CCCTCT". The hexameric motif is highly continuous in most cases but interruptions may occur. I need to count the hexameric motif keeping the read ID and identify their structure (CCCTCT)n. These interruptions are real and do not come from Sequencing error
Example Sequence:(interruption in bold) ACATTTTTTTTCCACATCTGATGTGGAAAAAAAAAAAAATGAAATAGCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCTCCCCACGGTCTCCCTCTCATGCGGAGCCAAAGCTGGACTGTACTGCT
Strategy 1: Count the CCCTCT motif grep -on "CCCTCT" fasta.fa | cut -f 1 -d ":" | sort | uniq -c
Output: 23 CCCTCT As you can see we have one interruption (CCT) in this sequence which the codes doesn't detect.
Strategy 2: Identify reads with the same structure and count repeats in each read
from Bio import SeqIO from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("fasta.fa", "fasta"): # Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True): # Join the ids and write them out as the fasta
output.write(">{}_counts{}\n".format(ids[0], len(ids))) output.write(seq + "\n")
Output: list ranking of reads with identical structures but does not show the structure
ID/ccs_counts2 CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT.....
Can you please give me an idea of how to obtain the following output? example of a sequence of 23 hexamers with interruptions.
(CCCTCT)20 [INTERRUPTION] (CCCTCT)3
I have used expansion hunter, repeatFinder and repeatAnalysis-tool by PacBio but they were not particularly sensitive for this case. Any help would be appreciated.
Thanks
Please use
101010
button to format the code portion of your post properly. I would have done it but it looks like you wrapped your code around on multiple lines so I don't want to mess up the formatting.Take a look at
seqkit
(LINK).seqkit grep
(LINK) would help you to identify these patterns.Writing a script is the best way, by searching with a regular expression
(CCCTCT)+
. If two or more matches are returned, then count and extract the spacing sequences.seqkit locate
can find all the hexamers, but it is not easy to get the "spacing" sequences, using bedtools?I couldn't replicate this second example.
Output: [ERRO] Undefined function len [ERRO] xopen: no content [ERRO] xopen: no content
Oh, the feature is not released yet. download here.
If you only need to filter these sequences, use:
For details:
Thanks. This tutorial was really helpful