How to identify contigs containing the degenerate primer sequences using python?
0
0
Entering edit mode
6.2 years ago
xupbuy ▴ 30

I have a txt file with DNA sequences, each DNA sequence starting with ">Contig..." is called a contig. The file looks like this:

>Contig4679
CGGCGACGCCGGTGAGCCCACCGTTCCAGCGCAATGACAACAGCTGTAGCCCGCCCGAGA
GCGCCGTGAGGAACACGGCGGCGGGCACGAGGATGATGCGGCGGCCCACGGTCATGAGCA
CGATGGATGCGAACGAGGACAGCGCGATGCCGGTGGCGCCGAACGCCAGCACGCGCATGG
CGTCCACGCTCGGGTCATATTTGGGCAACAGGAGGTGCACCATCTCCGGGGCCCACACCG
CGCACAGCCCGGCCACGAGCGGCAACCCCACGGCGACGCCACGCACCAGTCGGTCGACGC
GCTCGCGGATCGCCGCGGGGTCCTGCCCGGCCTCGCTGTAGCGCTTCACCAACTGCGGGT
AGCTCACGTA
>Contig4680
ACCACTCACCCTACCACCTAGTCCTACAGCGTTATGTGGTTGGGCGGGTTGAGATGTTTT
TTAGAGACAACTCGAACTTCTCGCGCTGCTGGGCGGCTAAGTCTGGCTCCGCGTCGGCGA
GTTCGAGAAGCGCCAACTCGATCCGGTCGGCCGACAGCACGAGCTCCCGGGTCGGAATGA
GCTGCACCCGGTCGAGCGGCCGGATCGACCGCTGGTTCATGACGTCGAACTCGCGCATCG
ACAGGATCACGTCGTCGTCGATTTCCACGCGGATCGGATTCGGCGCCGAGGGGGGATAGA
CGGCGCTAATACACATCTCAGAGCCAACAAAAAAGGCAGAAACAACGAAACACATCCTCT
CCTAGAAAAA
>Contig4681
CACTCCTGCCGTCCCATCATCAGTAGCTCCTCGGGGGCGTAGGGCAACAGGGCGACGTTG
CGCAGGAAGAAGAGATAGGCGTCGCGGCCGACCGCGGTCTGCACCGGCATCGCGGCGACC
CGTGCGCTCAGCCACCCGCGGAAGCCTTCGAGCGCGGTCGCGGCACGATCGACCGCCGAG
TCGAGGCGCGGCTCTGCGTCGGCCGAGAGCCGCGGCTTCAGCTCGCGCGCCGTCTGCTTG
AGCCGCGGGCGGACGGTCTCGAGATCGGCGATGGCGAGCCGCGCAAATGAGCCGACCGCG
TCGGTGAGGTTGGCTTCCGCATGCTCCACCGTAATCGGGATGCGAGCGAGCTGTCTCTTA
TACACAACAC
>Contig4682
AGTCATGCTTGACGGTCGCTCTGTGGGTCAATTGGGGATATGCGCTCGTGCTCCTGGCTT
ATCCCCACGTTCTGCACAACACACGGCACGAGCAGTTCTCCGATGCGAAATTGCCCTACT
GCACGAGATGGATCTGACCTGCTACCGTTAACACATGGACACGCCCCTGACGCCGATGCC
ACCTGAAGCAGACGCGATTCGTGAAATCGCGCGCCTGCTCGTGGAGCAAGCCGAGGAAGC
GCTCCAGCGACACGACGCGCCTCTCCCGTAGCGAATCGCATTCGCGATCCCGGCCCTGTT
TTCTCGTTCTTTCAGAAAGGAGTCGACGTGTGTACGACAAAGAACTCCACGCGCGGAATC
GACTGCCCCG

I want to find out which contigs contain my degenerate primer sequences (and their reverse complementary sequences should also be considered) using python scripts, but don't know how. Any expert help me, please? Thank you so much!

I'd like some scripts that I can run like this:

Primer_finder.py -P1 GGRTCNCCIARYTGIGTICCIGTICCRTGIGC -P2 MGIGARGCIYTICARATGGAYCCICARCARMG -input contig.txt -output contigs_with_primer.txt

-P1 -P2: the degenerate primers. The output file should only contain contigs with both P1 and P2 (or their reverse complementary sequences).

Some explanation of the terms:

Degenerate primer sequences are the patterns to search. They are short DNA sequences, but with degeneracy. It means, normally DNA sequences contain A/G/C/T, but, for example, if there is an R, it means at this position, it can be either A or G. For example, AATRTGC means AATATGC or AATGTGC. Here's the table: R = A or G; Y = C or T; M = A or C; K = G or T; S = G or C; W = A or T; H = A or T or C; D = G or A or T; B = G or T or C; V = G or A or C; N = A or T or G or C.

Complementary sequences mean: A and T are complementary, G and C are complementary. Reverse complementary means first convert the sequences into complementary sequences and the reverse it, put the end to the front and put the front to the end. Example: ATTCCG reverse complementary: CGGAAT

sequence assembly python • 2.1k views
ADD COMMENT
0
Entering edit mode

Partly match between PCR primer and target sequence is enough for PCR amplication. Do you still want only full match?

ADD REPLY
0
Entering edit mode

(Only for full match)

# positive strand
$ cat seqs.fa | seqkit grep -s -d -p NACTGN | seqkit seq -n -i | tee 1.txt
Contig4679
Contig4682

# negative strand
$ cat seqs.fa | seqkit seq -r -p | seqkit grep -s -d -p TTAM | seqkit seq -n -i | tee 2.txt
Contig4681
Contig4682

# intersection
$ csvtk inter -H 1.txt 2.txt | tee id.txt
Contig4682

# retrieving sequences
$ seqkit grep -f id.txt seqs.fa > result.fa

ADD REPLY
0
Entering edit mode

Thank you. What if I want to add the mismatch function? For example, '-mismatch 2' means it can tolerate 2 mismatches, continuous mismatch or non-continuous match. Thank you very much.

ADD REPLY
0
Entering edit mode

Oh, mismatch supported using the last version.

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6