I explain my problem.
I have a huge file in gff format such that:
scaffold_31 AUGUSTUS CDS 18857 19210 0.63 + 0 transcript_id "g56.t1"; gene_id "g56";
scaffold_32 AUGUSTUS CDS 8973 9290 0.82 - 0 transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS CDS 11374 11507 0.96 - 2 transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS CDS 11586 11733 0.39 - 0 transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS CDS 5303 5323 0.83 - 0 transcript_id "g58.t1"; gene_id "g58";
scaffold_33 AUGUSTUS CDS 5810 6034 0.97 - 0 transcript_id "g58.t1"; gene_id "g58";
scaffold_34 AUGUSTUS CDS 1390 1805 0.87 + 1 transcript_id "g59.t1"; gene_id "g59";
scaffold_37 AUGUSTUS CDS 15299 15390 0.91 - 2 transcript_id "g60.t1"; gene_id "g60";
scaffold_37 AUGUSTUS CDS 15622 15826 0.88 - 0 transcript_id "g60.t1"; gene_id "g60";
an so on... And I would like to find a command to extract in one side transcrit where their FIRST CDS starts with a a codon phase 0 (the 7 th column), and those from which their FIRST CDS starts with a 1 or a 2. Then, I would like to get 3 files and here it would be:
First file: with the first CDS of the transcript starting in phase 0.
scaffold_31 AUGUSTUS CDS 18857 19210 0.63 + 0 transcript_id "g56.t1"; gene_id "g56";
scaffold_32 AUGUSTUS CDS 8973 9290 0.82 - 0 transcript_id
scaffold_32 AUGUSTUS CDS 8973 9290 0.82 - 0 transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS CDS 5303 5323 0.83 - 0 transcript_id "g58.t1"; gene_id "g58";
scaffold_33 AUGUSTUS CDS 5810 6034 0.97 - 0 transcript_id "g58.t1"; gene_id "g58";
The second with with the first CDS of the transcript starting in phase 1:
scaffold_34 AUGUSTUS CDS 1390 1805 0.87 + 1 transcript_id "g59.t1"; gene_id "g59";
And the third with the first CDS of the transcript starting in phase 2:
scaffold_37 AUGUSTUS CDS 15299 15390 0.91 - 2 transcript_id "g60.t1"; gene_id "g60";
scaffold_37 AUGUSTUS CDS 15622 15826 0.88 - 0 transcript_id "g60.t1"; gene_id "g60";
As you can see, since the transcrit for exemple transcript_id "g60.t1 has its first CDS starting with the phase 2, all the folowwing CDS belonging to this transcript has to be transfered to the same file.
Thanks for you help, I hope someone will find a solution :)? I thought that awk could help ?
Your file looks like a gtf file, not gff. And you probably want to be careful with what you mean by "first" CDS. Do you mean "first" according to genomic coordinate or "first" according to the gene? If a gene is on the minus strand of the genome, the first CDS according to the gene is actually the last CDS in genomic coordinate.
Yes it is a gtf file but I can convert it in gff3 just in case it is needed. Yes according to the gene, because I need to translate all my fasta sequence into protein, then I would like to separate sequences according to their codon phase to delete one or two nucleotides to be in the right phase to translate the seq. You are right for the minus strand, it is much more difficult then.. Do you think it is possible? I guess it is one of standard gtf treatments is not it?
Why not use the
gffread
utility and extract the nucleotide sequences from your fasta file and GTF file? And then translate.I used it but by doing
gffread -g scaf0_test.fa merged.gff3 -w teste.fa
I still have the problem that if i extract the nucleotide sequence in fasta format, all CDS starting with a start codon so in phase 0 will be well translated but if there is no start codon or that the codon phase is 1 or 2, then if I make an translation of my fasta seq, the protein will be wrong and I will have to delete 2 bases if the phase is 2 or 1 if the phase is 1, but I do not want to do it manually for all my sequence...Have you tried the -y parameter in gffread:
Well, I indeed have the protein fasta outfile using the parameter -y but what I need is the dna sequence corresponding to this protein sequence. For exemple iI get the following aa seq:
and the dna seq:
And when I translate it I get a wrong translation because the CDS phase is not good:
So, if I delete 2 bases from the dna sequence in the fasta file I get the good codon phase and then the good amino acide sequence:
and translated:
So, I really need to get all the dna sequences in the good codon phase...
Maybe bedtools getfasta can do it you think?