Hi all, I'm actually using bedtools to extracte my cds in fasta format from a gff3 file.
The command is : bedtools getfasta -fi scaf0_test.fa -bed run_augustus_0035.gff.out -fo fasta.out -s
But the problem is that it gives me not only the coding part (without stop codon inside but also the non-coding part)
For exemple, here is my protein sequence from de gff file:
MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVSSVL
RAILQDSDKPESEHSGHACEQCGKLYKTVRTLYSHIMLKHPKDKEEVQCSVCDKKYSSALSLQKHVRYMHRYEHRCK
TCYRTFATSEMLVCHRESCYNNVSPCPVCNKIFDSRLALRNHINYNHPRSEESSVQERRQCNVCSRMFTSSRSLLNHM
AAIHPVGTTDCNLCGRTFNSMPAFRSHFLYKHGDHGVHCTKCHKLFATDTSLRRHMEKVHGKNNKPGFLCGICQTYFYK
SSDLVQHILANHKETSSD
And here is my dna sequence comming from the outfile (traducted):
MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVR*YVTNTG
VDSEEIKDRSCKHLQYRAPLYARA*FALIRHCADATTLPNITAFGIVTLPV**YKIN*THYCGVICAQS*TALHVMTC*RDECFRFQI
RRSLTLITMVVLIIMRTPRTIYT*TNPRNHILSP*SLRCFLTYIHVYKIYKNRSRTIRRGQVRSQYIECYIYTL*HPCLQSLSNCMRQ
YLCYRSQKQVCIFIEDPACLSETAIFRIRCKRKRHIFTQIFTSNLLNLPTLKFQRNKYPFKFIYF**ISTNLFNCKLSTKISCKKKEIF
FSSSSFFFSTDIYQLRRVCTYITENETRKDVNIINTLQNEC*NRYC*YRSKHSSVGN*YR*GTFAFFQFRIARHSTRLRQTGERA
FRARL*AMRQTL*NSEDALFAYHAETPQR*RGGAVFRLR*KVLVGTESTETRALHAPL*APM*NVLQDFCNV*NASLSQGKLLQ
QRVPMPRVQQDI*LPTCTSESYKL*SSKK*RKFGSRKEAMQRL*PHVY*FS*PPQPHGGYTPGRHD*LQFVRPHF*FHASLSKP
FPLQAR*PRRTLHKVS*IIRNRYKSSPAHGESTRQK**TRFLVRYMPNLFL*IE*PGAAHIGKP*GDIE*L
As you can see, there are a lot of stop codon, do you know why I do not get the exact same protein sequence? The fasta dna sequence should contain two cds concatened (one with 74 bases and the other should have 256), then the total length of this sequence should be 330.
Here is an exemple of my gff file:
# gffread v0.9.9
##gff-version 3
scaffold_0 AUGUSTUS mRNA 42655 44668 . - . ID=g1.t1;geneID=g1
scaffold_0 AUGUSTUS CDS 42655 43423 0.94 - 1 Parent=g1.t1
scaffold_0 AUGUSTUS CDS 44445 44668 0.82 - 0 Parent=g1.t1
scaffold_0 AUGUSTUS mRNA 51102 55274 . + . ID=g2.t1;geneID=g2
scaffold_0 AUGUSTUS CDS 51102 51192 0.60 + 0 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 51310 51528 0.80 + 2 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 52816 52845 0.64 + 2 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 53114 53223 0.50 + 2 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 53333 53633 0.91 + 0 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 53981 54296 0.64 + 2 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 54559 54581 0.94 + 1 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 54783 54975 0.89 + 2 Parent=g2.t1
scaffold_0 AUGUSTUS CDS 55184 55274 0.66 + 1 Parent=g2.t1
Have you tried gffread - part of gff utilities (http://ccb.jhu.edu/software/stringtie/gff.shtml)?
Hi, thanks for you help. How could i run bedtools geftasta with a cds entrie? I cannot find this argument on the commande line?
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.Ok, it is done, thank you.