How to extract sequences from multi fasta file, which contains specific primers?
1
0
Entering edit mode
4.7 years ago
k.kathirvel93 ▴ 310

Hi EveryOne,

I have a multifasta file which is converted from BWA bam file. I want to extract only sequences contains specific forward primer on the start and reverse primer at the end. How can i do it with awk or sed or grep. Thanks in advance.

The Input file looks like this :

>M01015:63:000000000-D2M18:1:1101:17027:1479
TTCTCTCTTCTCTCTTCTTCCTCTTTTCTTTTCTCTCTCTTTTTTTTCTTCTTTTTCTTCTTTTTTTCTT
TTCCTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTCCTTTTTCCTTCTTTCTTTTTCTTTTTT
CTCCTCTTCTTTTTTCTTTTTTTTCTTTCTTTTTTTCTCCTTTTTTTTTTTTCTTTTTTTCTTCCCTTTT
TTTTTTTCCCTTTTTTCTTTTTTTTTTTCTTCCTTTTTTT

>M01015:63:000000000-D2M18:1:1101:17027:1479
TCCTCTCTCTCTCTTCTCCCTCCTCCCTTTCTCTCTTCTCTCTTTTCTCTTTCTTTTCTTTTTTCTCTTT
TCCCTTTTTCCCTTCTTTCTTTTCTTTTTTTTTTTCTTTTTTCTTTTCTTTTTTTTTTCTCTTTTTTTCT
TCTTTTTTTTTCTCCTCTTTTTTTTTTTCTTTTTTTTTTTTTTTCTCTTTTTCTTCTTTTTTTTTTTTTT
CTTTTTTTCTTTTTTTTTTTTTTTTTTCTCTCTCTTTTTT

>M01015:63:000000000-D2M18:1:1101:15901:1612
GGCACTCGTATCGATGCGGCCGCGTTCGTTTGTTTATACACCTGCTCGTGCTTGTTTATGCATCTGCCAT
CTCCCTTCTGCTTATTTCTGTCTCCGATGCCTCTGTACTCCTTAGCCTTTCAGCTCCTGCCGCCTGTTTC
CCTGTGATGCAACAAGCTTACTCTGCACCAATGATGCAGCAGCCAGCTCAATCTAACGCAGCCAGTGATT
AGTTAGACGCGTGCCTGTGATTAGTTAGACGCGTGCCAGT

>M01015:63:000000000-D2M18:1:1101:15901:1612
GCCTCTGTCCCTCTTCTACCTATTCCTTGCCCCCCTCTTCCTTATTCCTTCCCCGCCTCTTCCTTATCTC
TGCCTTCTTTCTTTTGACCTCTCTCCTTCCTCATTGGTGCAGCGTTAGCTTGTTGCTTCACTGGGAAACT
TGCGGCAGGAGCTGCCCTGCTTCTGCGTCCTGACTCTTCGCCTTCCGTAATTTCCCGTTCGGTGTTGCCT
GTTTCTTCTACCAGCTCGCTCAGTTTTTTATTCTTTCGA

>M01015:63:000000000-D2M18:1:1101:16395:1620
GGCACTCGTATCGATGCGGCCGCGGTTATCTCTTCCCGCTGCACTGCCTTTTAGGCGTTCTTTTGTTCCG
GCCCCCTCTCCCCCCGGGTTCCCTGCTTTCCCCTGTGCGCTATTCCTGTTCTAGATGCTTTACTGTCCCC
CTCCGCTCCCGGCTTCTCGGTCAGTTTCCCCGTGCTTAGTTAGACGCGTGCTTCTGGC

>M01015:63:000000000-D2M18:1:1101:16395:1620
GCCTCTAGCACGCGTCTAACTAATCACTTTCCCCCTCCCCGTTAATCCGGGTTCTGTCTTGTTCAGTCAT
TCCTCTCGCCCCGCCCTCGCTCACTGGCTCTTGCTGCCTACCCGGGTTCAGTACTCGCCGTCCCTTATGA
ACCCCTCTTTGGCCTTGCTCCGGGTGGTGTTTCCCGCGGCCGCATCGATACGAGTGCCCTGTTTCTTATA
CACTTCTGACGCTGCCGCCGAATATAGCGGTGTCGTTCTT

>M01015:63:000000000-D2M18:1:1101:15366:1643
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGACCAACGCCAAATAGTGTTTCATAAGGTACT
TCCCTTACTCCCCCCGTGTAGGCTGCTTTTGCCCCTCTTCTCTTGCTGGCCTAGATGAATTACTGTCCTC
TACCTAACCCTTCTTATCTGTCAGTTTCACCGTTTTTTGTTAGTCGCGTGCTCTTTGCCTTTTTCTTCTA
CCTCTCTCCTCTCTCACTATACTTCTGTCCATCTTTTTTT

>M01015:63:000000000-D2M18:1:1101:15366:1643
GCCTCTAGCACGCGTCTCACTAATCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTT
TCCTCTTTTCTTTCCTTTTCTCTCTTTTTTTTTCTCCTTTCCCTTTTTCTGTTCCTTCCGTCCCTTTTGT
TCCCCTTTTTTTCCTTTCTCCTTTTTTTTTTTTCCCTTGCTCTTTCTTTTCTCTTCTCCTTTTTCTTTTA
CTTATCTTCCGTTTCCTTCGTCTTTCTCTTTTTATATTTT

>M01015:63:000000000-D2M18:1:1101:17421:1643
GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCGTGCCAGTGAT
TAGTTAGACGCGTGCTAGAGGC

>M01015:63:000000000-D2M18:1:1101:17421:1643
GCCTCTAGCACGCGTCTAACTAATCACTGGCCCGCGTCTCTCTAATCTCGGCTCGCGTCTCACTTCCCCG
CGGCCGCATCGATACGAGTGCC

>M01015:63:000000000-D2M18:1:1101:16505:1648
GGCACTCGTATCGATGCGGCCGCGTGTGATTTCTTCGACTTGTCCTAGCGTCCTCTCTCTTATCTACTTC
TTCGACCCCTCTCGACTCCTTTTCATCTCCTATTCCCTTTTCTGCTTCCCTATATTCTCTTCTTTTTTCT
TTTTTTTTTTTTGCTTATTCTTCCTTATCACTTTTTTTTTTCTACTCTATGCTTCCTGTCTGTCTCGTTT
CTGCCTCGTTGGTTTATTTTTCCTGCCTCTTTCTTTTTTT

>M01015:63:000000000-D2M18:1:1101:16505:1648
GCCTCTAGCACGCGTCTAACTAATCACTCTCTTCCTTTTCTTTTCTTTTGCCTTGTCTCTTCTTCCCCTC
TCTTGCTTCCCTCTACTTCTTTTTTTTTTTTTTCTTCCGTCTCCTTCTTTTTTTCTTCTCTACTTTTTTT
TCTTCTTTTTTTTTCTTTCTCTTTTTTTCTTTCTTTTTTCTTTCTTTTTCTTCTTCTTTTTTTCTATTTT
CTTCTCTTCTACTCTCTTTTCTTTTTTCTTCTTTTTCTTT

>M01015:63:000000000-D2M18:1:1101:17397:1654
GGCACTCGTATCGATGCGGCCGCGGGTGATGTGATTAGTTATACGCGTGCTAGTGGC

>M01015:63:000000000-D2M18:1:1101:17397:1654
TCCTCTAGCACGCGTCTAACTAATCACATCACCCGCGGCCGCATCGATACGAGTGCC

I want to extract only sequences(With headers) contains "ggcactcgtatcgatgcggccgcg" sequnces at the beginning and "gtgattagttagacgcgtgctagaggc" at the end. Output sequences must be like this

M01015:63:000000000-D2M18:1:1101:17421:1643 GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCGTGCCAGTGAT TAGTTAGACGCGTGCTAGAGGC

Assembly genome next-gen sequencing sequence • 3.1k views
ADD COMMENT
1
Entering edit mode

@ k.kathirvel93 try:

$ seqkit seq -w 0 test.fa| grep  -EiB 1 'ggcactcgtatcgatgcggccgcg.*gtgattagttagacgcgtgctagaggc'                                
>M01015:63:000000000-D2M18:1:1101:17421:1643
GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCGTGCCAGTGATTAGTTAGACGCGTGCTAGAGGC

or

$ seqkit grep -srip ^ggcactcgtatcgatgcggccgcg test.fa | seqkit grep -srip gtgattagttagacgcgtgctagaggc$                         
>M01015:63:000000000-D2M18:1:1101:17421:1643
GGCACTCGTATCGATGCGGCCGCGGTGATGTTAGTCGCGTGCCGTGTTTTGTTACACGCG
TGCCAGTGATTAGTTAGACGCGTGCTAGAGGC
ADD REPLY
0
Entering edit mode

Thanks @cpad0112, it worked nice, but i need one more help. I got few reads after this filter, which contains more reads after the reverse primer sequence 'gtgattagttagacgcgtgctagaggc', So i want to eliminate those reads (reads after the reverse primer sequence) and keep only reads present in between Forward and Reverse primer sequence. How can succeed this?

Input reads are like

M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCCTGTCTCTTATACAAATCCCCGAGCCCACGAGACTCCTGAGCATCTCGTATG

I want output Like

M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCC

ADD REPLY
0
Entering edit mode

@cpad0112 Can you help with this thread?

ADD REPLY
0
Entering edit mode

k.kathirvel93 Use seqkit command:

seqkit grep -srip ^ggcactcgtatcgatgcggccgcg test.fa | seqkit grep -srip gtgattagttagacgcgtgctagaggc$

Your updated reverse primer has one c extra. If it is a typo, that is fine. If it is not, add it to the code. Example fasta and output:

input:

cat test.fa                                                                                                                  
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCCTGTCTCTTATACAAATCCCCGAGCCCACGAGACTCCTGAGCATCTCGTATG
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGC

output:

seqkit grep -srip ^ggcactcgtatcgatgcggccgcg test.fa | seqkit grep -srip gtgattagttagacgcgtgctagaggc$                         
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTC
ATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGC
CTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGT
TAGACGCGTGCTAGAGGC
ADD REPLY
0
Entering edit mode

Thanks @cpad0112, by the mistake you have taken both my input and expected output in one single file and you filtered the expected output sequence alone. but clearly....

Input reads are like

M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCCTGTCTCTTATACAAATCCCCGAGCCCACGAGACTCCTGAGCATCTCGTATG

I want output Like

M01015:63:000000000-D2M18:1:1102:14195:28796 GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCC

Note : I want to keep reads only in between the F 'ggcactcgtatcgatgcggccgcg' and R 'gtgattagttagacgcgtgctagaggc' primer sequences, the reads after the Reverse primer has to be eliminated.

Thanks

ADD REPLY
1
Entering edit mode

got it. First reverse translate reverse primer (from gtgattagttagacgcgtgctagaggc to gcctctagcacgcgtctaactaatcac). Try following:

input:

cat test.fa                                                                                                                  

>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGCCTGTCTCTTATACAAATCCCCGAGCCCACGAGACTCCTGAGCATCTCGTATG
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGC

output:

seqkit amplicon  -F ggcactcgtatcgatgcggccgcg -R gcctctagcacgcgtctaactaatcac test.fa   
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGC
>M01015:63:000000000-D2M18:1:1102:14195:28796 
GGCACTCGTATCGATGCGGCCGCGGTAAACTCCACCCGGAGCAAGGCCAAATAGGGGTTCATAAGGTACGGCCCGTACTGAACCCGGGGAGGCTGCTTGAGCCAGGGAGCGATTGCTGGCCTAGATGAATGACTGTCCACGACAGAACCCGGCTTATCGGTCAGTTTCACCGTGATTAGTTAGACGCGTGCTAGAGGC

Run this code against OP fasta file and you would get only one sequence.

ADD REPLY
2
Entering edit mode
4.7 years ago

try "grep" http://linuxcommand.org/lc3_man_pages/grep1.html or https://ss64.com/osx/grep.html

very important command line program to know. grep simply searches the file for you.

You asked "How can i do it with awk or sed or grep" and that's a homework question. Look up what each tool does. They are so popular there is going to be a Wikipedia page for each.

The Wikipedia page for AWK starts off quote "AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems."

That sounds like just exactly what you want to do.

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6