Hi. I have a fasta file containing 18S and 16S sequences. Can I use sed or grep to find all headers containing 18S and print a new fasta file that includes only these headers and the respective sequences?
Edit: What I have:
>GQ118297.1 Tricorythus sp. BYU IGCEP206 16S ribosomal RNA gene, partial sequence; mitochondrial
TTGTCTCTTGGAAATATAGGAGATAGGGCCTGCCCAATGAAATTTCAATGGCCGCAGTAATTTGACTGTG
CAAAGGTAGCATAATCATTAGTTTTTTAATTGAGGACTGGTATGAAAGGCATAATGAGGTACTTGTTTTC
TTAAATAAAAGAATAAAATTTTACTTTTTAGTTAAAAGGCTAAAGTAATATAAAGGGACGAGAAGACCCT
ATAGAGTTTTATAAATTAAATTAATTTATTTTAGTAAAATTAAAGAATTGATGGAATTTATTTAGTTGGG
GAGATTTTGTAATAAAACTTATAATTATATAAACATTTATATATGATTTTAAGATCCATAAATTGATTAA
AAAATTAAATTACCTTAGGGATAACAGCGTTATTTTCTTGGAGAGTTCTTATCAATAGGAAAGTTTGCGA
CCTCGATGTTGGATTAAGAAAATAGTTAAATGAAGCCGTTTAATTAATAGGTCTGTTCGACCTTTAAAAT
CTTACA
>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATGGTTCATTAGA
TGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCAAATTGAGTTCCCATCGG
TGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGTTTGCTTCGGCATACGATTTCAATGGTGATT
CTGAATAACTTTTTGATGATCGTACGGTCCTTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAA
CTGTCGATGGTAGGCTCTGCGCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAG
AGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGAGMACACTTTA
AATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCATTGG
CGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTGGTATCATGTGTMTCGGACGGTCGGTTCGCC
NCNCNCGGTGTTCAACTGACCGGTCCGGACGTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT
What I would like to have after processing by identifying all sequences wit "18S" in the header and printing them into a new file:
>GQ118277.1 Teloganella sp. BYU IGCEP161 18S ribosomal RNA gene, partial sequence
GCTGTCTCAGTGCAAGCCTAATTAAAGTGAAACCGCAAATGGCTCATTAAATCAGTTATGGTTCATTAGA
TGATACATTATTTTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCAAATTGAGTTCCCATCGG
TGACGGTAGGAACGCTTTTATTAGATCAAAACCAATACGTTTGCTTCGGCATACGATTTCAATGGTGATT
CTGAATAACTTTTTGATGATCGTACGGTCCTTGTATCGACGACAAATCTTTCAAATGTCTGCCTTATCAA
CTGTCGATGGTAGGCTCTGCGCCTACCATGGTTGTAACGGGTAACGGGGAATCAGGGTTCGATTCCGGAG
AGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGGCACG
GGGAGGTAGTGACGAAAAATAACGATACGGGACTCATCCGAGGCCCCGTAATCGGAATGAGMACACTTTA
AATMYTTTAACRAGTAYCYAWTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCATTGG
CGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGGTAGTTGGTATCATGTGTMTCGGACGGTCGGTTCGCC
NCNCNCGGTGTTCAACTGACCGGTCCGGACGTCCTGCCGGTGGGACCCGGTTCGCGCCGGGCCCCGT
no, you need something like awk.
Whether or not you can do this depends on the contents of the file, and how precise you need to be. In the general case, no, it is not possible. I suggest you post the headers of some of your sequences, along with as much other data as possible. And clarify whether you wish to separate sequences based on their header or contents.
if you know a little bit of python, Biopython SeqIO would be an easy and intuitive solution.
Unfortunately, I don't (yet), so Python does not help me at the moment
Agreed, BioPython is totally the way to go for this
Tmatamatamas : You have multiple options of correct answers below. Please "accept" (green check mark) as many as you wish to provide closure to this thread.
Hi, if you want to use grep, just try:
parameter
A
means, that grep print one lines of trailing context. Parameterw
match only whole word -in our case 18S.EDIT: This works only if you have in header 18S and no-where else. Also you need correct format of FASTA file - first line started with
>
is header and second line nucleotides.OP wants to keep the
entire
sequence where the header contains 18S word.Edit: Moved to a comment since it is not an appropriate answer for the original question.
This is a Downvote:)
This would work if the fasta was a single-line fasta and not multi-line like the OP has.
st.ph.n, thank you for the clarity on the above answer. Is there any way to do this with grep?
Unless the fasta's are linearised, as in the above comment, then no not really. I might be possible, but would be ugly and clunky I'm sure.