Question

using awk to extract a specific pattern

0

Entering edit mode

7.1 years ago

Chvatil ▴ 140

I explain my problem.

I have a huge file in gff format such that:

scaffold_31 AUGUSTUS    CDS 18857   19210   0.63    +   0   transcript_id "g56.t1"; gene_id "g56";
scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS    CDS 11374   11507   0.96    -   2   transcript_id "g57.t1"; gene_id "g57";
scaffold_32 AUGUSTUS    CDS 11586   11733   0.39    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS    CDS 5303    5323    0.83    -   0   transcript_id "g58.t1"; gene_id "g58";
scaffold_33 AUGUSTUS    CDS 5810    6034    0.97    -   0   transcript_id "g58.t1"; gene_id "g58";
scaffold_34 AUGUSTUS    CDS 1390    1805    0.87    +   1   transcript_id "g59.t1"; gene_id "g59";
scaffold_37 AUGUSTUS    CDS 15299   15390   0.91    -   2   transcript_id "g60.t1"; gene_id "g60";
scaffold_37 AUGUSTUS    CDS 15622   15826   0.88    -   0   transcript_id "g60.t1"; gene_id "g60";

an so on... And I would like to find a command to extract in one side transcrit where their FIRST CDS starts with a a codon phase 0 (the 7 th column), and those from which their FIRST CDS starts with a 1 or a 2. Then, I would like to get 3 files and here it would be:

First file: with the first CDS of the transcript starting in phase 0.

scaffold_31 AUGUSTUS    CDS 18857   19210   0.63    +   0   transcript_id "g56.t1"; gene_id "g56";
    scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id 
scaffold_32 AUGUSTUS    CDS 8973    9290    0.82    -   0   transcript_id "g57.t1"; gene_id "g57";
scaffold_33 AUGUSTUS    CDS 5303    5323    0.83    -   0   transcript_id "g58.t1"; gene_id "g58";
    scaffold_33 AUGUSTUS    CDS 5810    6034    0.97    -   0   transcript_id "g58.t1"; gene_id "g58";

The second with with the first CDS of the transcript starting in phase 1:

scaffold_34 AUGUSTUS    CDS 1390    1805    0.87    +   1   transcript_id "g59.t1"; gene_id "g59";

And the third with the first CDS of the transcript starting in phase 2:

scaffold_37 AUGUSTUS    CDS 15299   15390   0.91    -   2   transcript_id "g60.t1"; gene_id "g60";
    scaffold_37 AUGUSTUS    CDS 15622   15826   0.88    -   0   transcript_id "g60.t1"; gene_id "g60";

As you can see, since the transcrit for exemple transcript_id "g60.t1 has its first CDS starting with the phase 2, all the folowwing CDS belonging to this transcript has to be transfered to the same file.

Thanks for you help, I hope someone will find a solution :)? I thought that awk could help ?

gff extract awk parsing • 2.9k views

ADD COMMENT • link 7.1 years ago by Chvatil ▴ 140

1

Entering edit mode

Your file looks like a gtf file, not gff. And you probably want to be careful with what you mean by "first" CDS. Do you mean "first" according to genomic coordinate or "first" according to the gene? If a gene is on the minus strand of the genome, the first CDS according to the gene is actually the last CDS in genomic coordinate.

ADD REPLY • link 7.1 years ago by Damian Kao 16k

0

Entering edit mode

Yes it is a gtf file but I can convert it in gff3 just in case it is needed. Yes according to the gene, because I need to translate all my fasta sequence into protein, then I would like to separate sequences according to their codon phase to delete one or two nucleotides to be in the right phase to translate the seq. You are right for the minus strand, it is much more difficult then.. Do you think it is possible? I guess it is one of standard gtf treatments is not it?

ADD REPLY • link 7.1 years ago by Chvatil ▴ 140

2

Entering edit mode

Why not use the gffread utility and extract the nucleotide sequences from your fasta file and GTF file? And then translate.

ADD REPLY • link 7.1 years ago by GenoMax 151k

0

Entering edit mode

I used it but by doing gffread -g scaf0_test.fa merged.gff3 -w teste.fa I still have the problem that if i extract the nucleotide sequence in fasta format, all CDS starting with a start codon so in phase 0 will be well translated but if there is no start codon or that the codon phase is 1 or 2, then if I make an translation of my fasta seq, the protein will be wrong and I will have to delete 2 bases if the phase is 2 or 1 if the phase is 1, but I do not want to do it manually for all my sequence...

ADD REPLY • link 7.1 years ago by Chvatil ▴ 140

0

Entering edit mode

Have you tried the -y parameter in gffread:

  -y  write a protein fasta file with the translation of CDS for each record

ADD REPLY • link 7.1 years ago by Damian Kao 16k

0

Entering edit mode

Well, I indeed have the protein fasta outfile using the parameter -y but what I need is the dna sequence corresponding to this protein sequence. For exemple iI get the following aa seq:

>g37020.t1 gene=g37020
GTKVTEDKARKLFADVADASNVELVKFKEALTKLAKDQQKNVEELAKTLAE.

and the dna seq:

>g37020.t1 gene=g37020 CDS=3-158
ACGGCACGAAAGTAACAGAAGACAAGGCACGGAAGCTATTCGCTGACGTAGCAGACGCGAGTAACGTCGA
GCTCGTCAAGTTCAAGGAAGCGTTGACCAAATTGGCCAAGGATCAGCAGAAGAATGTGGAAGAGCTGGCC
AAAACCTTGGCGGAATGA

And when I translate it I get a wrong translation because the CDS phase is not good:

>seqq
TARK*QKTRHGSYSLT*QTRVTSSSSSSRKR*PNWPRISRRMWKSWPKPWRN

But as you can see in the gft format, the codon phase of the first CDS (and here the only one) is 2: 

# Constraints/Hints:
# (none)
# Predicted genes for sequence number 117602 on both strands
# start gene g37020
scaffold_117601 AUGUSTUS    gene    1   894 0.48    +   .   g37020
scaffold_117601 AUGUSTUS    transcript  1   894 0.48    +   .   g37020.t1
scaffold_117601 AUGUSTUS    intron  1   736 0.48    +   .   transcript_id "g37020.t1"; gene_id "g37020";
scaffold_117601 AUGUSTUS    CDS 737 894 0.54    +   2   transcript_id "g37020.t1"; gene_id "g37020";
scaffold_117601 AUGUSTUS    stop_codon  892 894 .   +   0   transcript_id "g37020.t1"; gene_id "g37020";
# protein sequence = [GTKVTEDKARKLFADVADASNVELVKFKEALTKLAKDQQKNVEELAKTLAE]
# end gene g37020
###

So, if I delete 2 bases from the dna sequence in the fasta file I get the good codon phase and then the good amino acide sequence:

>seq
GGCACGAAAGTAACAGAAGACAAGGCACGGAAGCTATTCGCTGACGTAGCAGACGCGAGTAACGTCGAGCTCGTCAAGTTCAAGGAAGCGTTGACCAAATTGGCCAAGGATCAGCAGAAGAATGTGGAAGAGCTGGCCAAAACCTTGGCGGAATGA-

and translated:

>seq
GTKVTEDKARKLFADVADASNVELVKFKEALTKLAKDQQKNVEELAKTLAE*

So, I really need to get all the dna sequences in the good codon phase...

ADD REPLY • link 7.1 years ago by Chvatil ▴ 140

0

Entering edit mode

Maybe bedtools getfasta can do it you think?

ADD REPLY • link 7.1 years ago by Chvatil ▴ 140