Hi everyone,
I need to re-analyze some old data with a previous protocol that is not easily understandable as a beginner. In it it says: "We used the EnsEMBL mouse genome assembly GRCm38.p6, where all non- coding regions were excluded, and all fully contained shorter coding sequences were collapsed (gffread -C - M -K)".
Since it says that the genome assembly was used, I am a little bit confused if this relates to the GTF file or the FAST sequence.
Am I right that it belongs to the GTF file and I can simply run:
gffread GRCm38.p6File.gtf -C -M -K -o Modified_GRCm38.p6File.gtf
and I correctly excluded the non-coding regions and collapsed all fully contained shorter coding sequences?
Thank you
Not knowing the aims of the study in question, it's hard to say for sure. Presumably they were only interested in analyzing coding regions. As for reasons for collapsing...perhaps to avoid processing duplicate and/or spurious annotations that are fully encompassed by other annotations.
Please use
ADD REPLY/ADD COMMENT
when responding to existing posts to keep threads logically organized. This comment should go below @Beginnners below.