I have a txt file with DNA sequences, each DNA sequence starting with ">Contig..." is called a contig. The file looks like this:
>Contig4679
CGGCGACGCCGGTGAGCCCACCGTTCCAGCGCAATGACAACAGCTGTAGCCCGCCCGAGA
GCGCCGTGAGGAACACGGCGGCGGGCACGAGGATGATGCGGCGGCCCACGGTCATGAGCA
CGATGGATGCGAACGAGGACAGCGCGATGCCGGTGGCGCCGAACGCCAGCACGCGCATGG
CGTCCACGCTCGGGTCATATTTGGGCAACAGGAGGTGCACCATCTCCGGGGCCCACACCG
CGCACAGCCCGGCCACGAGCGGCAACCCCACGGCGACGCCACGCACCAGTCGGTCGACGC
GCTCGCGGATCGCCGCGGGGTCCTGCCCGGCCTCGCTGTAGCGCTTCACCAACTGCGGGT
AGCTCACGTA
>Contig4680
ACCACTCACCCTACCACCTAGTCCTACAGCGTTATGTGGTTGGGCGGGTTGAGATGTTTT
TTAGAGACAACTCGAACTTCTCGCGCTGCTGGGCGGCTAAGTCTGGCTCCGCGTCGGCGA
GTTCGAGAAGCGCCAACTCGATCCGGTCGGCCGACAGCACGAGCTCCCGGGTCGGAATGA
GCTGCACCCGGTCGAGCGGCCGGATCGACCGCTGGTTCATGACGTCGAACTCGCGCATCG
ACAGGATCACGTCGTCGTCGATTTCCACGCGGATCGGATTCGGCGCCGAGGGGGGATAGA
CGGCGCTAATACACATCTCAGAGCCAACAAAAAAGGCAGAAACAACGAAACACATCCTCT
CCTAGAAAAA
>Contig4681
CACTCCTGCCGTCCCATCATCAGTAGCTCCTCGGGGGCGTAGGGCAACAGGGCGACGTTG
CGCAGGAAGAAGAGATAGGCGTCGCGGCCGACCGCGGTCTGCACCGGCATCGCGGCGACC
CGTGCGCTCAGCCACCCGCGGAAGCCTTCGAGCGCGGTCGCGGCACGATCGACCGCCGAG
TCGAGGCGCGGCTCTGCGTCGGCCGAGAGCCGCGGCTTCAGCTCGCGCGCCGTCTGCTTG
AGCCGCGGGCGGACGGTCTCGAGATCGGCGATGGCGAGCCGCGCAAATGAGCCGACCGCG
TCGGTGAGGTTGGCTTCCGCATGCTCCACCGTAATCGGGATGCGAGCGAGCTGTCTCTTA
TACACAACAC
>Contig4682
AGTCATGCTTGACGGTCGCTCTGTGGGTCAATTGGGGATATGCGCTCGTGCTCCTGGCTT
ATCCCCACGTTCTGCACAACACACGGCACGAGCAGTTCTCCGATGCGAAATTGCCCTACT
GCACGAGATGGATCTGACCTGCTACCGTTAACACATGGACACGCCCCTGACGCCGATGCC
ACCTGAAGCAGACGCGATTCGTGAAATCGCGCGCCTGCTCGTGGAGCAAGCCGAGGAAGC
GCTCCAGCGACACGACGCGCCTCTCCCGTAGCGAATCGCATTCGCGATCCCGGCCCTGTT
TTCTCGTTCTTTCAGAAAGGAGTCGACGTGTGTACGACAAAGAACTCCACGCGCGGAATC
GACTGCCCCG
I want to find out which contigs contain my degenerate primer sequences (and their reverse complementary sequences should also be considered) using python scripts, but don't know how. Any expert help me, please? Thank you so much!
I'd like some scripts that I can run like this:
Primer_finder.py -P1 GGRTCNCCIARYTGIGTICCIGTICCRTGIGC -P2 MGIGARGCIYTICARATGGAYCCICARCARMG -input contig.txt -output contigs_with_primer.txt
-P1 -P2: the degenerate primers. The output file should only contain contigs with both P1 and P2 (or their reverse complementary sequences).
Some explanation of the terms:
Degenerate primer sequences are the patterns to search. They are short DNA sequences, but with degeneracy. It means, normally DNA sequences contain A/G/C/T, but, for example, if there is an R, it means at this position, it can be either A or G. For example, AATRTGC means AATATGC or AATGTGC. Here's the table: R = A or G; Y = C or T; M = A or C; K = G or T; S = G or C; W = A or T; H = A or T or C; D = G or A or T; B = G or T or C; V = G or A or C; N = A or T or G or C.
Complementary sequences mean: A and T are complementary, G and C are complementary. Reverse complementary means first convert the sequences into complementary sequences and the reverse it, put the end to the front and put the front to the end. Example: ATTCCG reverse complementary: CGGAAT
Partly match between PCR primer and target sequence is enough for PCR amplication. Do you still want only full match?
(Only for full match)
Thank you. What if I want to add the mismatch function? For example, '-mismatch 2' means it can tolerate 2 mismatches, continuous mismatch or non-continuous match. Thank you very much.
Oh, mismatch supported using the last version.