Question

Probably very simple question for gffread

0

Entering edit mode

5.9 years ago

Beginner ▴ 90

Hi everyone,

I need to re-analyze some old data with a previous protocol that is not easily understandable as a beginner. In it it says: "We used the EnsEMBL mouse genome assembly GRCm38.p6, where all non- coding regions were excluded, and all fully contained shorter coding sequences were collapsed (gffread -C - M -K)".

Since it says that the genome assembly was used, I am a little bit confused if this relates to the GTF file or the FAST sequence.

Am I right that it belongs to the GTF file and I can simply run:

gffread GRCm38.p6File.gtf -C -M -K -o Modified_GRCm38.p6File.gtf

and I correctly excluded the non-coding regions and collapsed all fully contained shorter coding sequences?

Thank you

gffread • 1.2k views

ADD COMMENT • link updated 5.9 years ago by Dave Carlson ★ 2.1k • written 5.9 years ago by Beginner ▴ 90

0

Entering edit mode

Do you know why this was done? Why would you collapse the "fully contained shorter coding sequences" of the GTF file and exlude the non-coding regions?

Not knowing the aims of the study in question, it's hard to say for sure. Presumably they were only interested in analyzing coding regions. As for reasons for collapsing...perhaps to avoid processing duplicate and/or spurious annotations that are fully encompassed by other annotations.

ADD REPLY • link 5.9 years ago by Dave Carlson ★ 2.1k

0

Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. This comment should go below @Beginnners below.

ADD REPLY • link 5.9 years ago by GenoMax 152k

score 1 · Answer 1 · 2019-09-10

1

Entering edit mode

5.9 years ago

Dave Carlson ★ 2.1k

Usually, each fasta assembly will be released with its own set of annotations (in GTF of GFF format), so I suspect that the protocol is simply specifying which annotation/assembly version they're talking about.

Your command looks fine, except there is an extra space between "-" and "M" (should be "-M" without a space).

Edit: Just noticed a second typo. There shouldn't be a space after the "_" in your output filename.

ADD COMMENT • link 5.9 years ago by Dave Carlson ★ 2.1k

0

Entering edit mode

Thank you for the fast answer! I think the spaced arose from a copy past issue and I corrected it.

Do you know why this was done? Why would you collapse the "fully contained shorter coding sequences" of the GTF file and exlude the non-coding regions?

ADD REPLY • link 5.9 years ago by Beginner ▴ 90