Getting Annotation Data From A Gff File
1
0
Entering edit mode
11.1 years ago
robjohn7000 ▴ 110

Hi,

I' m trying to get annotation data from a gff file using rtracklayer, and got some errors that I can't figure out:

library(rtracklayer)
library(GenomicRanges)
library(Rsamtools)

egff <- import.gff("./data/NC_016810.gff", asRangedData=FALSE)
seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="region"),]))

Errors:

 Error in .normargSeqlengths(value, seqnames(x)) : 
 length of supplied 'seqlengths' must equal the number of sequences

Can anyone please help me understand the reason for the errors?

r bioconductor annotation • 5.3k views
ADD COMMENT
0
Entering edit mode

what's the output of length(which(elementMetadata(gff)$type=="region")) and length(seqlengths(gff))? They're apparently not the same.

ADD REPLY
0
Entering edit mode

typo? s/egff/gff/

ADD REPLY
0
Entering edit mode
11.1 years ago
Michael 55k

You are trying to set the seqlengths attribute of a GRanges object, however from the documentation:

seqlengths

an integer vector named with the sequence names and containing the lengths (or NA) for each level(seqnames).

This attributes refers to the "landmark" sequences (e.g. chromosomes, scaffolds, contigs), not all sequence features in the GFF file ("region"). There is only one chromosome for NC_01681 according to the GenBank file that is:

LOCUS       NC_016810            4878012 bp    DNA     linear   CON 20-AUG-2013
DEFINITION  Salmonella enterica subsp. enterica serovar Typhimurium str.
            SL1344, complete genome.

So, the correct call should be seqlengths(gff) <- c(4878012).

To access or set the length of each feature, use width(gff) or width(gff)<-.

ADD COMMENT
0
Entering edit mode

problem solved! Thanks vvery much Michael for the code and the exaplanation. I'm surprised how c(4878012) did the trick; though I know 'c' combines values.

ADD REPLY
0
Entering edit mode

Hi,

I am having the same problem with very similar code (that I learnt in a BioC RNAseq workshop). My code is:

gff <- import.gff("dmel-all-r6.05.gff.gz", asRangedData=F)
seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),]))

I then get the same error message as the previous questioner:

Error in .normargSeqlengths(value, seqnames(x)) :
  length of supplied 'seqlengths' must equal the number of sequences

The problem I'm now having, after seeing the solution above, is that I don't understand where the '4878012' length is coming from, or how to work out what the length should be in my case. Would anyone be willing to give a little more explanation?

Thanks very much in advance.

ADD REPLY
0
Entering edit mode

I guess the error message is justified. Did you check what both sides of the assignment yield, just compare their length? Possible reasons: you need to look into the GFF file, there is not requirement that all or any chromosomes are defined in the GFF, possibly not all of the landmarks are of type chromosome (imagine mitochondrial genome, unplaced scaffolds, etc.) then your code would fail.

ADD REPLY
0
Entering edit mode

Hi,

I compared the lengths of each side of the assignment:

length(seqlengths(gff))
[1] 1871
length(end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])))
[1] 2

Obviously this makes the error message in itself very clear, but I'm not sure what information I need to figure out the correct code.

I also tried to get a handle on what the properties of the .gff file were:

seqnames(gff)

factor-Rle of length 12818678 with 1871 runs
  Lengths:                               10 ...
  Values :                  211000022278031 ...
Levels(1871): 211000022278031 211000022278032 ... Y_mapped_Scaffold_9_D1573
ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])

IRanges of length 2
    start   end width
[1]     1 19517 19517
[2]     1 19524 19524

I'm sorry to say that I am still stuck. I have tried setting seqlengths(gff) to some of these lengths (2, 1871, 12818678) but none of these work. I just don't understand which length I should be using, and I can't see where to look to find the correct figure.

I am slightly concerned that there is something odd about the .gff file I was recommended to use, because:

seqinfo(gff)
Seqinfo object with 1871 sequences from an unspecified genome; no seqlengths:
  seqnames                        seqlengths isCircular genome
  211000022278031                       <NA>       <NA>   <NA>
  211000022278032                       <NA>       <NA>   <NA>
  211000022278033                       <NA>       <NA>   <NA>
  211000022278038                       <NA>       <NA>   <NA>
  211000022278047                       <NA>       <NA>   <NA>
  ..                                    ..        ..    ..
  Y_mapped_Scaffold_30_D1720            <NA>       <NA>   <NA>
  Y_mapped_Scaffold_34_D1584            <NA>       <NA>   <NA>
  Y_mapped_Scaffold_53_D1765            <NA>       <NA>   <NA>
  Y_mapped_Scaffold_5_D1748_D1610       <NA>       <NA>   <NA>
  Y_mapped_Scaffold_9_D1573             <NA>       <NA>   <NA>

So there doesn't appear to be any seqlengths? Can this be correct?

Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6