Question

Extract specific genes from a FASTA file.

1

Entering edit mode

5.4 years ago

alexandru.bologa.marian ▴ 50

Hello,

I want to extract a specific gene sequence from a multi-FASTA file using bash commands like awk,sed,grep etc. My FASTA file (3Lchr.fasta) comes from FlyBase and every gene in the chromosome has the following annotation type:

> FBgn0010431 type=gene; loc=3L:8406112..8407203; ID=FBgn0010431; name=mtrm; dbxref=FlyBase:FBgn0010431,FlyBase:FBan0018543,FlyBase_Annotation_IDs:CG18543,GB_protein:AAF50422,GB:AI546359,GB:AW943933,GB:AY061524,GB_protein:AAL29072,GB:U03288,GB_protein:AAA03463,UniProt/Swiss-Prot:Q23973,INTERPRO:IPR001660,INTERPRO:IPR011510,OrthoDB7_Drosophila:EOG7MHCJK,EntrezGene:38958,INTERPRO:IPR013761,BDGP_clone:LD47919,InterologFinder:38958,BIOGRID:64371,DPIM:FBgn0010431,DroID:FBgn0010431,DRSC:FBgn0010431,FLIGHT:FBgn0010431,FlyAtlas:CG18543-RA;CG7127-RA,FlyMine:FBgn0010431,Fly-FISH:CG18543,FLYGUT:FBgn0010431,GenomeRNAi:38958,INTERACTIVEFLY:/genebrief/matrimony.htm,modMine:FBgn0010431; derived_computed_cyto=66C11-66C11%3B Limits computationally determined from genome sequence between @P{lacW}l(3)L0139<up>L0139</up>@ and @P{PZ}Gug<up>03928</up>@%26@P{EP}CG6745<up>EP595b</up>@; gbunit=AE014296; MD5=ed5a26fca514a130e2b45aa01727082c; length=1092; release=r5.57; species=Dmel; 

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG
AAGGTGAACACCTCCAACGTGCGATGCTCTACTCCGATCTTTGGCAACTT
CCGTTCGCCCAATCTCTCGCCCATCGAGAATATGGGCACGAAGGGGAAGA
GTCCAGTGTCGCCCATGCGGTTCGCTACCTTCAAGAAAGTGCCAACGAAG
GTGCATCCCAAGCAGCAGCAGCAGCAGCAGCATCAGCACTGCCATCGCAC
TCAGCTTAAGCCCCCGCCATTCGTGCTGCCCAAGCCGCAGGAGGAGATCA
TCGAGCCGGAGCGAGAAATAAAGAGCTGCAGCAGCCCGGATACCTGTTCG
GATGACTCGAATATGGAGACCTCACTGGCCTTGGAGTCGCGTCGTCGTTC
CATCAAAGCATCGAACCACTCGTACGTGGTTAACCATGCCGCCAATGTGG
AACAGATTCTCATGCACATGGGCCTGGAGAACTATGTGACCAATTTCGAA
GAGGCTCACATCGATCTGGTGGAACTGGCATCCTTGGAGCGTGCTGATCT
TGTTAAAATCGGCCTAAATACCGATGAGGATTGCAACCGTATCATGGATG
TGCTCCACACTCTTTAAGTGAATCGCTGGCATGTCCATCTTTTATCATAC
ACTAAGATTTTAAAATGTTTTATGTGAAGAGAAACTTTGTGATTTTAATA
TATGTGGATAATTAAGTAGGTATTCTAGAATTATTGTAAAAGACCTGGTG
TCTTAAAGCTCGTCGTTTTTAGCCTTAAATAATAAACTCTAGTACACTTT
AAGAATCTATGTAATTCCTAAAATGTTGCTCCGAAGCTCATTTACAAATT
TAAATTTATTTCGTAAGCTAGCAAATTGCATAATACTTTTTGTCAAAAAA
AAAAAAAAAAATTGGGGTATAAACGCTTTTGTAAAACACACATTTTTAAA
ATTAATAAACTCATAAGAATCAAATGCATATTAAAAGGTTAC

I want to extract genes by gene names, in this case the name of the gene is mtrm. Reading some related posts, I managed to successfully extract the gene by its name in 2 steps. Firstly, I edited each of the headliners of the multi FASTA with the following pipeline:

awk -F '[/^>;,]' 'NF>1{print ">" $5} {print $1}' 3Lchr.fasta | awk NF | sed 's/ name=//' > 3Lchr_edit.fasta

Thus, each headline remains only with gene name, in my example >mtrm. This works only for FlyBase annotation type, and if you want to use it for a different annotation, you have to adjust the pipe for your case. Now, all the genes in the FASTA look like this:

>mtrm

GCCATTTGCCAGCTTGAATTCGCAAAGAACGCATCATTTTCTGTGTCACA
GTATCTAAAAAAAATGGAGAATTCTCGCACGCCCACGAACAAGACCAAAA
TTACGCTTAATCGCACGCCAACGCTAAAGGAGCGCAGATGGAACACCCTG

Now, I can extract any gene by its gene name and save it in another FASTA with awk:

awk -v seq="mtrm" -v RS='>' '$1 == seq {print RS $0}' 3Lchr_edit.fasta > mtrm.fasta

I hope these commands helps some of you because for me it was pretty hard to find a solution. If anyone knows/finds an easier or simplified solution in bash, I would appreciate sharing.

bash FASTA gene sequence • 6.0k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 5.4 years ago by alexandru.bologa.marian ▴ 50

0

Entering edit mode

crossposted on BIoinfomatics.SE

ADD REPLY • link 5.4 years ago by kamiljaron ▴ 230

Ram · Answer 1 · 2019-07-21

A more general solution is to use by Wei Shen. SeqKit is a program that does quite a large number of convenient things with fasta/fastq files, and is efficient for big files.

Example: If I have a fasta file containing chitinase sequences and I want to extract sequences whose names contain the regular expression AHCHIT.*, the command would be

seqkit grep -r -p "AHCHIT.*" chitinase.wrp
>AHCHIT - A.hypogaea mRNA for chitinase (chit 2).
atctcgttcaagtcggcgctctggttgtggatgacagagcagaaaccaaaaccttcttgc
cacaacgtcatggttgggaattacgtgccaacagcatctgatagagcagcaaatagaacc
ttagggtttgggttggttacgaacatcatcaacggcggcctggactgcgg
>AHCHIT1A - A.hypogaea mRNA for chitinase (chit 1A).
atctcgttcaagtcggcgctctggttgtggatgacagcacaaggaaacaagccatcaagc
catgatgtcatcaccggaagatggacaccatcggccgcggacagggcggccggccgagtc
tccggatttggagtgatcacgaacatcatcaacggcggcctggactgcgg
>AHCHIT4 - A.hypogaea mRNA for chitinase (chit 4).
atctcgttcaagtcggcgctctggttgtggatgacaccacaagggaataagccctcgtgc
cacgacgtcatcaccaatgcatggaggccaaccgccactgactcggcggcaggtcgagcc
ccaggctacggtgtcatcacgaacatcatcaacggcggcctggactgcgg

SeqKit is also included as part of the BIRCH system, which ties together hundreds of popular bioinformatics programs in an intuitive graphic interface. An example using seqkit stat to generate statistics from NGS read files can be seen in the video https://www.youtube.com/watch?v=56T05sOcODI