Hi EveryOne,
I have a multifasta file which is converted from BWA bam file. I want to extract only sequences contains specific forward primer on the start and reverse primer at the end. How can i do it with awk or sed or grep. Thanks in advance.
The Input file looks like this :
>M01015:63:000000000-D2M18:1:1101:17027:1479
TTCTCTCTTCTCTCTTCTTCCTCTTTTCTTTTCTCTCTCTTTTTTTTCTTCTTTTTCTTCTTTTTTTCTT
TTCCTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTCCTTTTTCCTTCTTTCTTTTTCTTTTTT
CTCCTCTTCTTTTTTCTTTTTTTTCTTTCTTTTTTTCTCCTTTTTTTTTTTTCTTTTTTTCTTCCCTTTT
TTTTTTTCCCTTTTTTCTTTTTTTTTTTCTTCCTTTTTTT
>M01015:63:000000000-D2M18:1:1101:17027:1479
TCCTCTCTCTCTCTTCTCCCTCCTCCCTTTCTCTCTTCTCTCTTTTCTCTTTCTTTTCTTTTTTCTCTTT
TCCCTTTTTCCCTTCTTTCTTTTCTTTTTTTTTTTCTTTTTTCTTTTCTTTTTTTTTTCTCTTTTTTTCT
TCTTTTTTTTTCTCCTCTTTTTTTTTTTCTTTTTTTTTTTTTTTCTCTTTTTCTTCTTTTTTTTTTTTTT
CTTTTTTTCTTTTTTTTTTTTTTTTTTCTCTCTCTTTTTT
>M01015:63:000000000-D2M18:1:1101:15901:1612
GGCACTCGTATCGATGCGGCCGCGTTCGTTTGTTTATACACCTGCTCGTGCTTGTTTATGCATCTGCCAT
CTCCCTTCTGCTTATTTCTGTCTCCGATGCCTCTGTACTCCTTAGCCTTTCAGCTCCTGCCGCCTGTTTC
CCTGTGATGCAACAAGCTTACTCTGCACCAATGATGCAGCAGCCAGCTCAATCTAACGCAGCCAGTGATT
AGTTAGACGCGTGCCTGTGATTAGTTAGACGCGTGCCAGT
>M01015:63:000000000-D2M18:1:1101:15901:1612
GCCTCTGTCCCTCTTCTACCTATTCCTTGCCCCCCTCTTCCTTATTCCTTCCCCGCCTCTTCCTTATCTC
TGCCTTCTTTCTTTTGACCTCTCTCCTTCCTCATTGGTGCAGCGTTAGCTTGTTGCTTCACTGGGAAACT
TGCGGCAGGAGCTGCCCTGCTTCTGCGTCCTGACTCTTCGCCTTCCGTAATTTCCCGTTCGGTGTTGCCT
GTTTCTTCTACCAGCTCGCTCAGTTTTTTATTCTTTCGA
>M01015:63:000000000-D2M18:1:1101:16395:1620
GGCACTCGTATCGATGCGGCCGCGGTTATCTCTTCCCGCTGCACTGCCTTTTAGGCGTTCTTTTGTTCCG
GCCCCCTCTCCCCCCGGGTTCCCTGCTTTCCCCTGTGCGCTATTCCTGTTCTAGATGCTTTACTGTCCCC
CTCCGCTCCCGGCTTCTCGGTCAGTTTCCCCGTGCTTAGTTAGACGCGTGCTTCTGGC
>M01015:63:000000000-D2M18:1:1101:16395:1620
GCCTCTAGCACGCGTCTAACTAATCACTTTCCCCCTCCCCGTTAATCCGGGTTCTGTCTTGTTCAGTCAT
TCCTCTCGCCCCGCCCTCGCTCACTGGCTCTTGCTGCCTACCCGGGTTCAGTACTCGCCGTCCCTTATGA
ACCCCTCTTTGGCCTTGCTCCGGGTGGTGTTTCCCGCGGCCGCATCGATACGAGTGCCCTGTTTCTTATA
CACTTCTGACGCTGCCGCCGAATATAGCGGTGTCGTTCTT
>M01015:63:000000000-D2M18:1:1101:15366:1643
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGACCAACGCCAAATAGTGTTTCATAAGGTACT
TCCCTTACTCCCCCCGTGTAGGCTGCTTTTGCCCCTCTTCTCTTGCTGGCCTAGATGAATTACTGTCCTC
TACCTAACCCTTCTTATCTGTCAGTTTCACCGTTTTTTGTTAGTCGCGTGCTCTTTGCCTTTTTCTTCTA
CCTCTCTCCTCTCTCACTATACTTCTGTCCATCTTTTTTT
>M01015:63:000000000-D2M18:1:1101:15366:1643
GCCTCTAGCACGCGTCTCACTAATCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTT
TCCTCTTTTCTTTCCTTTTCTCTCTTTTTTTTTCTCCTTTCCCTTTTTCTGTTCCTTCCGTCCCTTTTGT
TCCCCTTTTTTTCCTTTCTCCTTTTTTTTTTTTCCCTTGCTCTTTCTTTTCTCTTCTCCTTTTTCTTTTA
CTTATCTTCCGTTTCCTTCGTCTTTCTCTTTTTATATTTT
>M01015:63:000000000-D2M18:1:1101:17421:1643
GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCGTGCCAGTGAT
TAGTTAGACGCGTGCTAGAGGC
>M01015:63:000000000-D2M18:1:1101:17421:1643
GCCTCTAGCACGCGTCTAACTAATCACTGGCCCGCGTCTCTCTAATCTCGGCTCGCGTCTCACTTCCCCG
CGGCCGCATCGATACGAGTGCC
>M01015:63:000000000-D2M18:1:1101:16505:1648
GGCACTCGTATCGATGCGGCCGCGTGTGATTTCTTCGACTTGTCCTAGCGTCCTCTCTCTTATCTACTTC
TTCGACCCCTCTCGACTCCTTTTCATCTCCTATTCCCTTTTCTGCTTCCCTATATTCTCTTCTTTTTTCT
TTTTTTTTTTTTGCTTATTCTTCCTTATCACTTTTTTTTTTCTACTCTATGCTTCCTGTCTGTCTCGTTT
CTGCCTCGTTGGTTTATTTTTCCTGCCTCTTTCTTTTTTT
>M01015:63:000000000-D2M18:1:1101:16505:1648
GCCTCTAGCACGCGTCTAACTAATCACTCTCTTCCTTTTCTTTTCTTTTGCCTTGTCTCTTCTTCCCCTC
TCTTGCTTCCCTCTACTTCTTTTTTTTTTTTTTCTTCCGTCTCCTTCTTTTTTTCTTCTCTACTTTTTTT
TCTTCTTTTTTTTTCTTTCTCTTTTTTTCTTTCTTTTTTCTTTCTTTTTCTTCTTCTTTTTTTCTATTTT
CTTCTCTTCTACTCTCTTTTCTTTTTTCTTCTTTTTCTTT
>M01015:63:000000000-D2M18:1:1101:17397:1654
GGCACTCGTATCGATGCGGCCGCGGGTGATGTGATTAGTTATACGCGTGCTAGTGGC
>M01015:63:000000000-D2M18:1:1101:17397:1654
TCCTCTAGCACGCGTCTAACTAATCACATCACCCGCGGCCGCATCGATACGAGTGCC
I want to extract only sequences(With headers) contains "ggcactcgtatcgatgcggccgcg" sequnces at the beginning and "gtgattagttagacgcgtgctagaggc" at the end. Output sequences must be like this
M01015:63:000000000-D2M18:1:1101:17421:1643 GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCGTGCCAGTGAT TAGTTAGACGCGTGCTAGAGGC
@ k.kathirvel93 try:
or
Thanks @cpad0112, it worked nice, but i need one more help. I got few reads after this filter, which contains more reads after the reverse primer sequence 'gtgattagttagacgcgtgctagaggc', So i want to eliminate those reads (reads after the reverse primer sequence) and keep only reads present in between Forward and Reverse primer sequence. How can succeed this?
Input reads are like
I want output Like
@cpad0112 Can you help with this thread?
k.kathirvel93 Use seqkit command:
Your updated reverse primer has one
c
extra. If it is a typo, that is fine. If it is not, add it to the code. Example fasta and output:input:
output:
Thanks @cpad0112, by the mistake you have taken both my input and expected output in one single file and you filtered the expected output sequence alone. but clearly....
Input reads are like
M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCCTGTCTCTTATACAAATCCCCGAGCCCACGAGACTCCTGAGCATCTCGTATG
I want output Like
M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCC
Note : I want to keep reads only in between the F 'ggcactcgtatcgatgcggccgcg' and R 'gtgattagttagacgcgtgctagaggc' primer sequences, the reads after the Reverse primer has to be eliminated.
Thanks
got it. First reverse translate reverse primer (from
gtgattagttagacgcgtgctagaggc
togcctctagcacgcgtctaactaatcac
). Try following:input:
output:
Run this code against OP fasta file and you would get only one sequence.