Extracting Fasta sequences
0
0
Entering edit mode
7.2 years ago

Hello, I want to extract all the sequences which has 'tRNA-phe' present in it (along with the fasta head). And, I also want to extract the tRNA-phe sequences from multiple files, along with fasta head.

for eg, following fasta sequence contain tRNA-Phe, I want all the fasta head along with sequences with this key.

'' >JSAA01000083.1 Elizabethkingia anophelis strain nuh11 contig_83, whole genome shotgun sequence 1 genes found 1 tRNA-Phe c[5099,5172] 34 (gaa) ggtttcttagctcagttggtagagcaatggattgaaaatccatgtgtccc tggttcgattcctggagaaaccac ''

please help...

linux commands genomics • 2.5k views
ADD COMMENT
1
Entering edit mode

have you try anything?

ADD REPLY
0
Entering edit mode

I tried using grep command.

''grep -w tRNA-Phe file.txt ''

it gave me the output - 1 tRNA-Phe c[5099,5172] 34 (gaa) 1 tRNA-Phe c[10265,10338] 34 (gaa)

but not the sequences and fasta head.

ADD REPLY
2
Entering edit mode

There are lots of similar posts in this site, please search before asking.

seqkit grep --by-name --use-regexp --pattern tRNA-Phe  seqs.fa
ADD REPLY
0
Entering edit mode

It is not working.

ADD REPLY
0
Entering edit mode

what's the error? Try provide more information when giving feedback.

I guess you did not install the tool, haha

ADD REPLY
0
Entering edit mode

I already have seqkit,regexp installed .

can u inbox your mail id . I'll mail you the files

ADD REPLY
0
Entering edit mode

just paste the main error message here

ADD REPLY
1
Entering edit mode

grep -w will search for matches that represent / contain the entire word. Did you try a normal grep?

Also, if you want also the sequence afterwards, assuming the sequence is all in one line you can use grep -A1 which will return the "after context" of one line after the the one that matched your pattern.

Finally, why don't you install Bioawk and use that? It's super easy for these tasks.

ADD REPLY
0
Entering edit mode

The first line contains the fasta head, the second one contains 'no. of genes found' and 3rd line contains the no. name of the gene,location and size. and 4th and 5th line contains the sequences.

''>LNOG01000023.1 Elizabethkingia anophelis strain 0422 contig_7, whole genome shotgun sequence

3 genes found

1 tRNA-Phe c[9122,9195] 34 (gaa)

ggtttcttagctcagttggtagagcaatggattgaaaatccatgtgtccc tggttcgattcctggagaaaccac

2 tRNA-Phe c[9762,9835] 34 (gaa)

ggtttcttagctcagttggtagagcaatggattgaaaatccatgtgtccc tggttcgattcctggagaaaccac

3 tRNA-Ser [118406,118494] 35 (gga)

agagaggtggccgagtggtcgaaggcgcacgcctggaaagtgtgtatact ccaaaagggtatcgagggttcgaatcccttcctctctgc ''

ADD REPLY
1
Entering edit mode

that's interesting. check this https://en.m.wikipedia.org/wiki/FASTA_format

it's not a FASTA format.

you need write some scripts to handle this special format. yes you do.

ADD REPLY

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6