Extract specific genes from a FASTA file.
1
1
Entering edit mode
5.4 years ago

Hello,

I want to extract a specific gene sequence from a multi-FASTA file using bash commands like awk,sed,grep etc. My FASTA file (3Lchr.fasta) comes from FlyBase and every gene in the chromosome has the following annotation type:

> FBgn0010431 type=gene; loc=3L:8406112..8407203; ID=FBgn0010431; name=mtrm; dbxref=FlyBase:FBgn0010431,FlyBase:FBan0018543,FlyBase_Annotation_IDs:CG18543,GB_protein:AAF50422,GB:AI546359,GB:AW943933,GB:AY061524,GB_protein:AAL29072,GB:U03288,GB_protein:AAA03463,UniProt/Swiss-Prot:Q23973,INTERPRO:IPR001660,INTERPRO:IPR011510,OrthoDB7_Drosophila:EOG7MHCJK,EntrezGene:38958,INTERPRO:IPR013761,BDGP_clone:LD47919,InterologFinder:38958,BIOGRID:64371,DPIM:FBgn0010431,DroID:FBgn0010431,DRSC:FBgn0010431,FLIGHT:FBgn0010431,FlyAtlas:CG18543-RA;CG7127-RA,FlyMine:FBgn0010431,Fly-FISH:CG18543,FLYGUT:FBgn0010431,GenomeRNAi:38958,INTERACTIVEFLY:/genebrief/matrimony.htm,modMine:FBgn0010431; derived_computed_cyto=66C11-66C11%3B Limits computationally determined from genome sequence between @P{lacW}l(3)L0139<up>L0139</up>@ and @P{PZ}Gug<up>03928</up>@%26@P{EP}CG6745<up>EP595b</up>@; gbunit=AE014296; MD5=ed5a26fca514a130e2b45aa01727082c; length=1092; release=r5.57; species=Dmel; 

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG
AAGGTGAACACCTCCAACGTGCGATGCTCTACTCCGATCTTTGGCAACTT
CCGTTCGCCCAATCTCTCGCCCATCGAGAATATGGGCACGAAGGGGAAGA
GTCCAGTGTCGCCCATGCGGTTCGCTACCTTCAAGAAAGTGCCAACGAAG
GTGCATCCCAAGCAGCAGCAGCAGCAGCAGCATCAGCACTGCCATCGCAC
TCAGCTTAAGCCCCCGCCATTCGTGCTGCCCAAGCCGCAGGAGGAGATCA
TCGAGCCGGAGCGAGAAATAAAGAGCTGCAGCAGCCCGGATACCTGTTCG
GATGACTCGAATATGGAGACCTCACTGGCCTTGGAGTCGCGTCGTCGTTC
CATCAAAGCATCGAACCACTCGTACGTGGTTAACCATGCCGCCAATGTGG
AACAGATTCTCATGCACATGGGCCTGGAGAACTATGTGACCAATTTCGAA
GAGGCTCACATCGATCTGGTGGAACTGGCATCCTTGGAGCGTGCTGATCT
TGTTAAAATCGGCCTAAATACCGATGAGGATTGCAACCGTATCATGGATG
TGCTCCACACTCTTTAAGTGAATCGCTGGCATGTCCATCTTTTATCATAC
ACTAAGATTTTAAAATGTTTTATGTGAAGAGAAACTTTGTGATTTTAATA
TATGTGGATAATTAAGTAGGTATTCTAGAATTATTGTAAAAGACCTGGTG
TCTTAAAGCTCGTCGTTTTTAGCCTTAAATAATAAACTCTAGTACACTTT
AAGAATCTATGTAATTCCTAAAATGTTGCTCCGAAGCTCATTTACAAATT
TAAATTTATTTCGTAAGCTAGCAAATTGCATAATACTTTTTGTCAAAAAA
AAAAAAAAAAATTGGGGTATAAACGCTTTTGTAAAACACACATTTTTAAA
ATTAATAAACTCATAAGAATCAAATGCATATTAAAAGGTTAC

I want to extract genes by gene names, in this case the name of the gene is mtrm. Reading some related posts, I managed to successfully extract the gene by its name in 2 steps. Firstly, I edited each of the headliners of the multi FASTA with the following pipeline:

awk -F '[/^>;,]' 'NF>1{print ">" $5} {print $1}' 3Lchr.fasta | awk NF | sed 's/ name=//' > 3Lchr_edit.fasta

Thus, each headline remains only with gene name, in my example >mtrm. This works only for FlyBase annotation type, and if you want to use it for a different annotation, you have to adjust the pipe for your case. Now, all the genes in the FASTA look like this:

>mtrm

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG

Now, I can extract any gene by its gene name and save it in another FASTA with awk:

awk -v seq="mtrm" -v RS='>' '$1 == seq {print RS $0}' 3Lchr_edit.fasta > mtrm.fasta

I hope these commands helps some of you because for me it was pretty hard to find a solution. If anyone knows/finds an easier or simplified solution in bash, I would appreciate sharing.

bash FASTA gene sequence • 6.0k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
5.4 years ago

A more general solution is to use by Wei Shen. SeqKit is a program that does quite a large number of convenient things with fasta/fastq files, and is efficient for big files.

Example: If I have a fasta file containing chitinase sequences and I want to extract sequences whose names contain the regular expression AHCHIT.*, the command would be

seqkit grep -r -p "AHCHIT.*" chitinase.wrp
>AHCHIT - A.hypogaea mRNA for chitinase (chit 2).
atctcgttcaagtcggcgctctggttgtggatgacagagcagaaaccaaaaccttcttgc
cacaacgtcatggttgggaattacgtgccaacagcatctgatagagcagcaaatagaacc
ttagggtttgggttggttacgaacatcatcaacggcggcctggactgcgg
>AHCHIT1A - A.hypogaea mRNA for chitinase (chit 1A).
atctcgttcaagtcggcgctctggttgtggatgacagcacaaggaaacaagccatcaagc
catgatgtcatcaccggaagatggacaccatcggccgcggacagggcggccggccgagtc
tccggatttggagtgatcacgaacatcatcaacggcggcctggactgcgg
>AHCHIT4 - A.hypogaea mRNA for chitinase (chit 4).
atctcgttcaagtcggcgctctggttgtggatgacaccacaagggaataagccctcgtgc
cacgacgtcatcaccaatgcatggaggccaaccgccactgactcggcggcaggtcgagcc
ccaggctacggtgtcatcacgaacatcatcaacggcggcctggactgcgg

SeqKit is also included as part of the BIRCH system, which ties together hundreds of popular bioinformatics programs in an intuitive graphic interface. An example using seqkit stat to generate statistics from NGS read files can be seen in the video https://www.youtube.com/watch?v=56T05sOcODI

ADD COMMENT
0
Entering edit mode

That website looks stuck in the early 90s with its glaring colors and random odd boxes. You seem to be associated with the tool, maybe hire a web designer to clean up the website a little?

ADD REPLY
0
Entering edit mode

I agree, the site uses pretty basic HTML. In fact, if I had it in the budget, I would probably do exactly that. The ToDo list is very long, and like a lot of things in bioinformatics, new features and bug fixes tend to get priority.

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6