Entering edit mode
7.8 years ago
fufuyou
▴
110
Hi, How can I remove out short predict genes from gff file? Or how can I set a value for CDS or protein length? Thanks, Fuyou
Could you please provide part of your file and explain the reason for such filtration of your gff file. If you used a program to predict gene models then the gene length cutoff should be set in it because it affects your statistical model. If you are trying to have only high quality predicted gene models and you assumed that short genes are potential errors, then you have to look at GO terms and see if in other species these GO terms are enriched with short genes.
It would be good if you could add some more information to your question. Based on what would you filter? The distance between begin and end has to be a minimal value? Try to be as specific as possible!
Thanks. I think it is not the distance between begin and end. I think I want to know how to set a minimal value for protein sequence or CDS. fUYOU
A minimum what? And if you aren't sure, how should we know? Maybe it's best that you first figure out what want before asking people to help you.
Thanks. I am sorry about my quesition is not clear. My mean is that I have gotten a gff files based on some predict software. But I find some genes is so short. I want to remove out these short genes. For example, I hope all genes protein sequences is more than 50 aa using this gff files. Or all genes CDS is more than 150 bp. I want to remove out some predicted genes with lower than 50 aa protein sequences. Like as following:
I want to remove out the second predicted, mrna0002.
How about:
awk '{if (($5 - $4)> 150) print $0}' your_file > new_file
Adjust150
to a value that will exclude things smaller than that length.Thanks, But I think I should only do mRNA line.
If you want to only remove
mRNA
line then:awk '{if (($5 - $4)> 150 || ($3 == "exon")) print $0}' your_file > new_file
.If you want to only keep
mRNA
line then:awk '{if (($5 - $4)> 150 || ($3 == "mRNA")) print $0}' your_file > new_file
Thanks, My mean is if one gene, for example mrna0001, $5-$4 > 150 in mRNA line, I want to keep mRNA and exon line. If one gene, for example mrna0002, $5-$4 < 150 in mRNA line, I want to remove both mRNA and exon. I want to get the result is
. I think your code shoul be close what I want. I am very appreciated your help. Fuyou
Careful with the $5 -$4 thing, that's the length of mRNA in genomic coordinates ($end-$start) and this is not the same as the length of the transcript not to mention CDS. There are no annotation of CDS in your example GFF, nor UTRs. Without this information, the length of the CDS cannot be determined. In addition, there is something more that is odd:
These exons overlap, but they have the same parent transcript, but if one has a different start, they cannot both belong to the same mRNA, can they? Even if, it shows that you cannot just sum over the length of each exon for each transcript.