Hello people, need to parse the cds fraction of a genome based on a gff3 file and a genome file. Do you know any good parser for that? For the moment i am with:
cat mygenome.gff3 | awk -v FS="\t" -v OFS="\t" '$3 == "CDS" {print $1, $4-1 ,$5, $1":"$4"-"$5":"$9}' | bedtools getfasta -name -fi mygenome.fasta -bed - -fo cds.fa
please note: in this annotation exons are identified as cds1/cds2/cds3 etc..
At this point i just got all cds for each transcript. But my aim is: for each transcript parse the complete CDS sequence after joining all the cds1 cds2 cds3 etc.. also based on strand orientation.
In summary i want a table like this: chrom | coord. cds | seq (CDS)
do you have any clues for me? thanks
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
thanks, can you also tell me how you printed the pic of this page that you just posted?
Directly next to the button for code markup is a button for inserting images. You need to put the picture online somewhere, I use tinypic but there are many alternatives.