Hi All,
I am new to R and I struggle with the following:
I generated a list with a MusTr <- read.fasta(file = "Mus_musculus.GRCm38.cdna.all.fa", as.string = TRUE)
function, which looks like this
$ENSMUST00000177564.1
[1] "atcggagggatacgag"
attr(,"name")
[1] "ENSMUST00000177564.1"
attr(,"Annot")
[1] ">ENSMUST00000177564.1 cdna chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]"
attr(,"class")
[1] "SeqFastadna"
$ENSMUST00000196221.1
[1] "atggcatat"
attr(,"name")
[1] "ENSMUST00000196221.1"
attr(,"Annot")
[1] ">ENSMUST00000196221.1 cdna chromosome:GRCm38:14:54113468:54113476:1 gene:ENSMUSG00000096749.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd1 description:T cell receptor delta diversity 1 [Source:MGI Symbol;Acc:MGI:4439547]"
attr(,"class")
[1] "SeqFastadna"
$ENSMUST00000179664.1
[1] "atggcatatca"
attr(,"name")
[1] "ENSMUST00000179664.1"
attr(,"Annot")
[1] ">ENSMUST00000179664.1 cdna chromosome:GRCm38:14:54113468:54113478:1 gene:ENSMUSG00000096749.2 gene_biotype:TR_D_gene transcript_biotype:processed_transcript gene_symbol:Trdd1 description:T cell receptor delta diversity 1 [Source:MGI Symbol;Acc:MGI:4439547]"
attr(,"class")
[1] "SeqFastadna"
$ENSMUST00000178537.1
[1] "gggacagggggc"
attr(,"name")
[1] "ENSMUST00000178537.1"
attr(,"Annot")
[1] ">ENSMUST00000178537.1 cdna chromosome:GRCm38:6:41533201:41533212:1 gene:ENSMUSG00000095668.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trbd1 description:T cell receptor beta, D region 1 [Source:MGI Symbol;Acc:MGI:4439571]"
attr(,"class")
[1] "SeqFastadna"
I would like to retrieve the attributes of each element (e.g. as a vector). For a single element of the list attr
works (here to retrieve "Annot" attribute):
attr(MusTr$ENSMUST00000196221.1, "Annot", exact = FALSE)
[1] ">ENSMUST00000196221.1 cdna chromosome:GRCm38:14:54113468:54113476:1 gene:ENSMUSG00000096749.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd1 description:T cell receptor delta diversity 1 [Source:MGI Symbol;Acc:MGI:4439547]"
How do I achieve the same for multiple/all elements of the list? Thanks in advance for any suggestions.
The Ensembl perl API is the tool to use for this kind of job. Any particular reason you need to use R for this ? If you have to, check the biomaRt package or the mygene package. They would give you access to some annotations but are not as flexible or comprehensive as the Ensembl API.
I would (potentially) find R helpful in manipulating FASTA files, through converting them to data frames, and then easy exporting with
write.fasta
. Otherwise, no particular reason. Thanks for your suggestion.