Question

Follow-Up: Still Having Trouble Importing Annotation Into R

1

Entering edit mode

12.0 years ago

jobinv ★ 1.1k

I previously asked a question about this here: Follow-up question: how to import annotation file into R? and received very helpful tips from Olivier (to use AILUN) and David Westergaard (to use BioConductor). However, I'm still having trouble, and hope someone can help me a bit more with this.

Background: I am currently working on annotating features in our microarray study. I had previously referred to it as a human microarray, but apparently it was Agilent Rat Whole Genome 4131F all along, as I discovered when I uploaded the probe id's to AILUN (thanks for the tip, Olivier). Our group works with xenograft models, and uses both rat arrays and human arrays, so I had just misunderstood which array this particular experiment had been from.

Nevertheless, I'm having quite a bit of trouble getting the 43,018 probes mapped to gene symbols (although, granted, a large number of these probes are control probes like DarkCorner and GE_BrightCorner).

My first attempt was to use Agilent's own annotation files, but these proved to be very difficult for me to import into R. The reason is that some probes have empty entries in some columns (example: A_32_P68142), which confuses my read.table() command in R (I even tried providing na.strings="" and comment.char="" arguments, nothing helped), as I pointed out in my previous post. Olivier also suggested that I try cleaning up the table with UNIX commands, but I am not familiar enough with these to manage this.

My second attempt was to go via BioConductor's annotation package, but they only have a package for 4131A, not 4131F. I hoped that this would be good enough, but it only mapped 24,552 out of 45,018 probes. Examples of probes not mapped: A_44_P330643, A_44_P549509, A_43_P14989

My third attempt was to use to annotation file from AILUN itself. While they say that they have a 99.96% match, they still seem to map only 21,498 out of 45,018 probes. Examples of probes not mapped: A_43_P14989, A_44_P553249, A_44_P260580.

Does anyone have any tips for me?

annotation agilent • 4.7k views

ADD COMMENT • link 12.0 years ago by jobinv ★ 1.1k

score 1 · Answer 1 · 2013-08-04

1

Entering edit mode

12.0 years ago

Michael 56k

Try to get an overview of what is wrong in the file. This also sheds some light on the competences of the people making these files. The files have seemingly undergone manual editing without taking care of the correct number of columns. The following perl code lists the lines which do not have 14 items:

 cat 014850_D_AA_20070207.txt | perl -naF"\t" -e '$i++; s/\t/<tab>/g; print $c++." : line $i has ". scalar @F ." entries \n$_\n" if  @F != 14' | less

With the following result:

[...]    
559 : line 40908 has 2 entries 
A_24_P937084<tab>A_24_P937084

560 : line 40936 has 12 entries 
A_24_P938135<tab>Z25424<tab><tab>Z25424<tab>Hs.515032<tab><tab><tab><tab><tab><tab><tab>H.sapiens protein-serine/threonine kinase gene, complete CDS. [Z25424]

Now, you could complain to Agilent that for the money you pay them, they should provide you with correct files, or repair yourself, trusting them that the files do not contain more errors.

You can make a file with all correct lines like so:

 cat 014850_D_AA_20070207.txt | perl -naF"\t" -e"print if  @F == 14" > ok.txt

And one with the rest accordingly:

 cat 014850_D_AA_20070207.txt | perl -naF"\t" -e"print if  @F < 14" > notok.txt

and rescue information in excel manually.

The result ok.txt still contains ' - default quote character for read.table - like in 3'-UTR and # - default comment char - (exon # 5), so in order to read all columns into R, you should use the following settings:

Update:

any.biological.annotation = read.table("ok.txt", sep="\t", quote="", comment.char="", na.strings="", header=TRUE)

This might be a good preset for parsing most biological annotations in R. Never use the fill=T option (unless you are prepared to shoot yourself in the foot), because that will just cause errors to go unnoticed.

ADD COMMENT • link 12.0 years ago by Michael 56k

0

Entering edit mode

Thanks for the extensive suggestion. I might eventually make that complaint letter to Agilent like you suggest, but to begin with, I just need to make a bit of progress with this particular study. As such, I first tried your suggested command (I outputted to a differently named file, otherwise identical).

cat 014850_D_AA_20070207.txt | perl -naF"\t" -e"print if  @F == 14" > annotation_features.txt

However, when I run the following command in R

 annotation_features <- read.table("../annotation/annotation_features.txt",sep="\t",header=T)

I still seem to get the error message

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 267 did not have 14 elements

It seems I'm still stuck :(

ADD REPLY • link 12.0 years ago by jobinv ★ 1.1k

0

Entering edit mode

oh, sorry, I didn't check if that actually resolves things, I was sure it should. I will try and see what this is caused by. In the meantime, could you try runnning the original file through dos2unix and repeat, just in case?

ADD REPLY • link 12.0 years ago by Michael 56k

0

Entering edit mode

Update 1: I contacted Agilent, and they sent me a newer annotation file that should work better.

Update 2: I ran the following commands:

dos2unix 014879_D_AA_20130204.txt 
cat 014879_D_AA_20130204.txt | perl -naF"\t" -e"print if @F == 14" > annotation_features.txt

And in R:

annotation_features <- read.table("../annotation/annotation_features.txt",sep="\t",header=T)

Getting the error message

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 150 did not have 14 elements

Update 3: I opened the file annotation_features.txt and removed all columns except for ProbeID and GeneSymbol. This time, the command ran to completion with no error message, which seems to suggest that it might be the free text columns that might be the issue. Although, it seems a number of probes are still not being annotated (examples: A_44_P898256, A_44_P187524, A_43_P14989). It even seems like there is quite a large proportion that are not being annotated:

sumis.na(data$genes))/length(data$genes) 
[1] 0.3938869

Should I have any expectation that I should ultimately manage to annotate all probes that are not DarkCorner or GE_BrightCorner?

ADD REPLY • link 12.0 years ago by jobinv ★ 1.1k

0

Entering edit mode

I think I got the solution for you, please look at the update in the answer.

ADD REPLY • link 12.0 years ago by Michael 56k

0

Entering edit mode

It worked! Thank you so much :)

ADD REPLY • link 12.0 years ago by jobinv ★ 1.1k

score 0 · Answer 2 · 2013-08-08

I received an answer from Agilent when I emailed them:

Gene Name or Gene Symbol is not always available for all our probes, but the other columns in the file you sent are also annotations : accession numbers (from GeneBank, Ensembl, etc…) and description.
Some customers make the mistake of thinking that every transcript has its own GeneName or GeneSymbol, which is just not true. Also keep in mind that this Rat design was created in 2007, so some the probes that were selected back then based on public databases have since then been deemed inappropriate, or classified in most recent databases as irrelevant, etc… This explains the "unknown" status most of the time I believe. Even if we annotate probes regularly, we never take un-annotated probes out of a design, so if such a probe is in a given design, it just gets a different annotation in the next round, but it stays on the design. If customers want a more recent design, they should look into design ID 028282 or 028279." The arrays are often designed before all the genes are known for an organism. This can for example be done using sequencing data from mRNA samples, as you see for several probes on the rat array you use. Later versions of the design will have better and better annotations, as progress is made in the research community. For the version you work with, 22.884 probes has been assigned a genename (this is all the genenames that has been available from GeneBankl, Ensembl etc). How you choose to use the data from the rest of the probes is really up to you. By far most of the probes has the accession numbers from GeneBank etc listed, even if the genename is unknown. One way of annotating is then to use the accession number instead of the genename for those probes. The probes labeled just "unknown"/no info, I would probably skip in further analysis.