Counting hexamers in fasta sequence and identify its structure (and interruptions)
0
0
Entering edit mode
3.1 years ago
AMARU • 0

I have a lot of fasta files, each one with thousand of reads containing the hexameric motif "CCCTCT". The hexameric motif is highly continuous in most cases but interruptions may occur. I need to count the hexameric motif keeping the read ID and identify their structure (CCCTCT)n. These interruptions are real and do not come from Sequencing error

Example Sequence:(interruption in bold) ACATTTTTTTTCCACATCTGATGTGGAAAAAAAAAAAAATGAAATAGCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCTCCCCACGGTCTCCCTCTCATGCGGAGCCAAAGCTGGACTGTACTGCT

Strategy 1: Count the CCCTCT motif grep -on "CCCTCT" fasta.fa | cut -f 1 -d ":" | sort | uniq -c

Output: 23 CCCTCT As you can see we have one interruption (CCT) in this sequence which the codes doesn't detect.

Strategy 2: Identify reads with the same structure and count repeats in each read

from Bio import SeqIO from collections import defaultdict

dedup_records = defaultdict(list) 
for record in SeqIO.parse("fasta.fa", "fasta"): # Use the sequence as the key and then have a list of id's as the value 
     dedup_records[str(record.seq)].append(record.id) 
with open("Output.fasta", 'w') as output: 
     for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True): # Join the ids and write them out as the fasta 
          output.write(">{}_counts{}\n".format(ids[0], len(ids))) output.write(seq + "\n")

Output: list ranking of reads with identical structures but does not show the structure

ID/ccs_counts2 CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCTCCCTCTCCCTCT.....

Can you please give me an idea of how to obtain the following output? example of a sequence of 23 hexamers with interruptions.

(CCCTCT)20 [INTERRUPTION] (CCCTCT)3

I have used expansion hunter, repeatFinder and repeatAnalysis-tool by PacBio but they were not particularly sensitive for this case. Any help would be appreciated.

Thanks

Sequence Biopython hexamers structure • 1.8k views
ADD COMMENT
0
Entering edit mode

Please use 101010 button to format the code portion of your post properly. I would have done it but it looks like you wrapped your code around on multiple lines so I don't want to mess up the formatting.

Take a look at seqkit (LINK). seqkit grep (LINK) would help you to identify these patterns.

ADD REPLY
0
Entering edit mode

Writing a script is the best way, by searching with a regular expression (CCCTCT)+. If two or more matches are returned, then count and extract the spacing sequences.

seqkit locate can find all the hexamers, but it is not easy to get the "spacing" sequences, using bedtools?

$ seqkit locate -i -r -p "(CCCTCT)+" test.fasta 
seqID   patternName     pattern strand  start   end     matched
test    (CCCTCT)+       (CCCTCT)+       +       52      171     CCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCTCCCTCT
test    (CCCTCT)+       (CCCTCT)+       +       175     186     CCCTCTCCCTCT
test    (CCCTCT)+       (CCCTCT)+       +       198     203     CCCTCT


$ seqkit locate -i -r -p "(CCCTCT)+" test.fasta \
    | csvtk mutate2 -t -n len -e 'len($matched)' \
    | csvtk mutate2 -t -n count -e '$len / 6' -L 0 \
    | csvtk cut -t -f seqID,count
seqID   count
test    20
test    2
test    1
ADD REPLY
0
Entering edit mode

I couldn't replicate this second example.

$seqkit locate -i -r -p "(CCCTCT)+" file.fa \
    | csvtk mutate2 -t -n len -e 'len($matched)' \
    | csvtk mutate2 -t -n count -e '$len / 6' -L 0 \
    | csvtk cut -t -f seqID,count

Output: [ERRO] Undefined function len [ERRO] xopen: no content [ERRO] xopen: no content

ADD REPLY
1
Entering edit mode

Oh, the feature is not released yet. download here.

If you only need to filter these sequences, use:

$ seqkit locate -i -r -p "(CCCTCT)+" test.fasta \
    | csvtk freq -t -f seqID
seqID   frequency
test    3

For details:

$ seqkit locate -i -r -p "(CCCTCT)+" test.fasta \
    | csvtk mutate2 -t -n len -e 'len($matched)' \
    | csvtk mutate2 -t -n count -e '$len / 6' -L 0  \
    | csvtk cut -t -f seqID,count \
    | csvtk fold -t -f seqID -v count
seqID   count
test    20; 2; 1
ADD REPLY
0
Entering edit mode

Thanks. This tutorial was really helpful

ADD REPLY

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6