Question

How to extract fasta headers that contains specific text and it's sequence

0

Entering edit mode

4.2 years ago

Optimist ▴ 190

Greetings!!!

I have barrnap output of 100 Pseudomonas aeruginosa genomes.

The output looks like this (sequences have been trimmed to avoid huge lines in biostars)

>16S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6516148-6517679(-)
TGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAG
>23S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6512786-6515674(-)
TCAAGTGAAGAAGCGCATACGGTGGATGCCTTGGCAGTCAGAGGCGATGAAAGACGTGGTAGCCTGCGAAAAGCT
>5S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6512529-6512639(-)
TGACGATCATAGAGCGTTGGAACCACCTGATCCCTTCCCGAACTCAGAAGTGA

I want to extract only 16s rRNA headers and sequences from all the outputs.

Result output should look like this

>16S_rRNA::Pseudomonas_aeruginosa_PAOC_Seq_1:6516148-6517679(-) 
TGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGATGAAGGGAGCTTGCTCCTGGATTCAGCGGCGGACGGGTGAG

How can I get this output

Thank you all

barrnap 16srRNA fasta • 3.2k views

ADD COMMENT • link written 4.2 years ago by Optimist ▴ 190

2

Entering edit mode

What have you tried so far?

did you had a look at for instance seqtk or SeqKit ?

ADD REPLY • link 4.2 years ago by lieven.sterck 15k

0

Entering edit mode

filter_fasta.py has worked.

ADD REPLY • link 4.2 years ago by Optimist ▴ 190

0

Entering edit mode

with awk and flattend fasta:

awk '/^>16S/{print;getline;print}' seq.fa

ADD REPLY • link 4.2 years ago by cpad0112 21k

0

Entering edit mode

Hi optimist,

can you share your script? how did you used barrnap on 100 P.aeruginosa genome ?

Thank you so much!

ADD REPLY • link 3.1 years ago by Neel ▴ 20

Ram · Answer 1 · 2021-02-02

2

Entering edit mode

4.2 years ago

Optimist ▴ 190

I have found a solution to this.

seqkit rmdup -n -i -j <threads> <infilename> > outfilename

Thank you all for the responses

Cheers Have a great time!!!

ADD COMMENT • link updated 4.2 years ago by Ram 45k • written 4.2 years ago by Optimist ▴ 190

0

Entering edit mode

This would remove dups by sequence ID, not by sequence. Try using seqkit rmdup -s -i -j <threads> <infilename> > outfilename

ADD REPLY • link 4.2 years ago by cpad0112 21k

score 2 · Answer 2 · 2021-02-02

If there are no linebreaks in the sequences I would do it like this:

paste - - <file.fasta | awk 'BEGIN{FS="\t";OFS="\n"}{if($1~/16S/){print $1,$2}}'

If you really have duplicate headers, i.e. identical strings then you can (note if the sequences under identical headers differ you will lose that information here):

paste - - <file.fasta | awk 'BEGIN{OFS=FS="\t"}{if($1~/16S/){print $0}}' | sort -t $'\t'-uk1,1 | awk 'BEGIN{FS="\t";OFS="\n"}{print $1,$2}'

score 1 · Answer 3 · 2021-02-02

1

Entering edit mode

4.2 years ago

Fatima ▴ 1000

If each sequence is one and only one line:

grep -A 1 "16S_rRNA:" filename

Should do the trick!

ADD COMMENT • link 4.2 years ago by Fatima ▴ 1000

0

Entering edit mode

This has worked too. Thank You

is there a way to remove duplicate headers along with the associated fasta sequence

Looks like most of the genomes have more than 1 16SrRNA sequences (as expected).

ADD REPLY • link 4.2 years ago by Optimist ▴ 190

0

Entering edit mode

You can use cd-hit with cutoff value of 1 (-c 1), it will remove the redundant sequences, but will also remove them if their header is different (as long as sequences are identical). So, for example when you have three identical sequences, the output file will only contain one of them ( the representative).

ADD REPLY • link 4.2 years ago by Fatima ▴ 1000

0

Entering edit mode

Thank you

This might not help because most of the 16S rRNA seqs would be conserved (identical).

ADD REPLY • link 4.2 years ago by Optimist ▴ 190

0

Entering edit mode

Are the headers exactly identical?

grep "16S_rRNA" filename > headers 

#This is not very efficient but should do the work
cat headers | sort | uniq | while read line ; do grep -A 1 "${line}" filename >> output ; done

ADD REPLY • link 4.2 years ago by Fatima ▴ 1000