Question

How Do I Access And Query Entire Genome Sequences With R

6

Entering edit mode

14.7 years ago

John ▴ 790

Is there a package in R where I can load in a whole genome, and then say e.g. get.gene.sequence (input_genome, "gene_symbol"), or another tool that you'd use for this job?

gene r • 24k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by John ▴ 790

Ram · Answer 1 · 2010-03-19

15

Entering edit mode

14.7 years ago

Neilfws 49k

seqinR is a good option for R.

If you don't need the whole genome locally, you can fetch sequences from Ensembl using biomaRt:

library(biomaRt)
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
seq     <- getSequence(id = "A2M", type="hgnc_symbol", mart = ensembl, seqType = "transcript_exon_intron")

I would also consider one of the Bio* libraries for sequence retrieval and manipulation.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Neilfws 49k

0

Entering edit mode

Does seqinR work for newly-sequenced genomes? I mean genomes for non-model organisms which are not available in biomaRt. Thanks.

ADD REPLY • link 13.7 years ago by Dejian ★ 1.3k

0

Entering edit mode

If you have sequence, it will load it.

ADD REPLY • link 13.7 years ago by Neilfws 49k

0

Entering edit mode

I need to query hg19 reference genome and retrieve several 50 nucleotides bases from various positions (exonic, intronic, etc). For testing that if it works I use the following command but it returns empty result :

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
result = getSequence(chromosome=2, start=86374868, end=86374877,type="entrezgene",seqType="gene_exon_intron", mart=ensembl)

I expect to see acatctcgca as a result. I think the problem is come from miss configuration in function parameters. Am I using the right tool for this problem?

ADD REPLY • link updated 13 months ago by Ram 44k • written 9.1 years ago by amir.jebelli ▴ 50

0

Entering edit mode

People are unlikely to see your query here buried in the comments of an old question. You'll probably have better luck adding it as a new question.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by Chris Miller 22k

0

Entering edit mode

Sorry, I am new to Biostars, I post my question as a new post here.

ADD REPLY • link 9.1 years ago by amir.jebelli ▴ 50

Ram · Answer 2 · 2010-03-18

5

Entering edit mode

14.7 years ago

Chris Miller 22k

Fasta files are pretty easy to manipulate using the seqinR package, if you've got the memory to handle it. (~4GB for chr1)

library(seqinr)
seq = read.fasta("chr1.fa",seqtype="DNA")
geneSeq = seq[[1]][geneStart:geneStop]

There's probably a better way to do it, but this may get you started.

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.7 years ago by Chris Miller 22k

1

Entering edit mode

thanks Chris, but that doesn't give me any of the annotations right? I want to say gene symbol "Rbcl" and if it is present then it returns the coding sequence for that gene.

ADD REPLY • link 14.7 years ago by John ▴ 790

0

Entering edit mode

1) It's pretty easy to slurp down all 18k gene names and locations from someplace like Santa Cruz, and store it in a file locally. This gets trickier if you're worried about aliases and such, in which case you'll want to do something like Neil or Pierre suggest and query a web service.

2) Credit where credit is due - Neil pointed me to the seqinr package the other day

ADD REPLY • link 14.7 years ago by Chris Miller 22k

0

Entering edit mode

is there a way to get sequences lengths as a data frame without reading in the whole file?

ADD REPLY • link 14.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Jeremy - are you talking about the gene lengths? It's pretty easy to parse through the gene list from UCSC line by line and find just the length between first and last exon. I'd probably parse it with Ruby or Perl first, then read the results into R.

If you actually need the sequence, I'd probably change my recommendation to the BSGenome package (http://bioconductor.org/packages/2.3/bioc/html/BSgenome.html) in combination with the BioStrings package. It'll hold the whole genome in about 500mb which makes it easier to handle.

ADD REPLY • link 14.7 years ago by Chris Miller 22k

Ram · Answer 3 · 2010-03-18

Ok, here is my java solution. It only uses some remote resources (the anonymous mysql server and the DAS server of the UCSC). My piece of code takes a list of refGene as its arguments and only prints the genes/geneomic-sequences to stdout but one could imagine it would store a pair(name,sequence) in memory.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Gene2Seq
    {
    private static class DASHandler
        extends DefaultHandler
        {
        private boolean inDNA=false;
        @Override
        public void startElement(String uri, String localName, String qName,
                Attributes attributes) throws SAXException
            {
            inDNA=(qName.equals("DNA"));
            }
        @Override
        public void endElement(String uri, String localName, String qName)
                throws SAXException
            {
            inDNA=false;
            }
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException
            {
            if(inDNA) System.out.print(new String(ch, start, length).replace("\n", ""));
            }
        }

    public static void main(String[] args) throws Throwable
        {
        //put the JDBC driver for mysql in the $CLASSPATH
        Class.forName("com.mysql.jdbc.Driver");
        Connection con = DriverManager.getConnection(
                "jdbc:mysql://genome-mysql.cse.ucsc.edu/hg19",
                "genome", ""
                );
        SAXParserFactory f=SAXParserFactory.newInstance();
        SAXParser parser=f.newSAXParser();
        PreparedStatement pstmt=con.prepareStatement(
                "select name,chrom,txStart,txEnd from refGene where name2=?"
                );
        for(String name: args)
            {
            pstmt.setString(1, name);
            ResultSet row=pstmt.executeQuery();
            while(row.next())
                {
                System.out.println(">"+row.getString(1)+"|"+row.getString(2)+":"+row.getInt(3)+"-"+row.getInt(4));
                parser.parse(
                        "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment="+row.getString(2)+":"+(row.getInt(3)+1)+","+(row.getInt(4)+1),
                        new DASHandler());
                System.out.println();
                }
            row.close();
            }

        pstmt.close();
        con.close();
        }
    }

output with EIF4G1:

>NM_004953|chr3:184038262-184053145
agatgggctgaaagtggaactcaaggggtttctggcacctacctacctgcttcccgctggggggtggggagttggcccag....
>NM_182917|chr3:184032970-184053145
tcctcgacggccgccgcccgcctggccttttagggcctgactcccgcccttcctggcctacactcctgggcggcggcagg....
>NM_198241|chr3:184032355-184053145
gaagcggtggccgccgagcgggatctgtgcggggagccggaaatggt....

Ram · Answer 4 · 2010-03-19

R has gotten a lot of sequence handling and searching routines recently that make it a good choice also for sequence analysis in bioinformatics.

Have a look at the BSGenome package in BioConductor. It is meant to hold the genome sequence and allow fast sequence searches in the genome sequence. It does not contain real genome annotations though. There are readymade packages for a bunch of eukaryote genomes you can download, but of course your organism has to be in the list. This can be used together with the BioStrings package that allows for fast sequence searches and manipulation.

It could be a good team with biomaRt (already mentioned, I really recommend this) if you want further local sequence processing or if you already have the gene start/end coordinates, or alternatively transfer the coordinates only and cut these out of the BSGenome.

Ram · Answer 5 · 2010-03-24

Hello,

Concerning the seqinR package I would also consider the "query" function which allows you to query various databases as EMBL, GenBank, Uniprot, some Ensembl genomes and others databases structured under ACNUC.

library(seqinr)

To check which databases are available, type:

choosebank(infobank=TRUE)

For example if you want to query the ensembl human genome for sequences in which the field hgnc is a2m :

choosebank("human")
query("mylist","k=hgnc:a2m")

Note that in the case of Ensembl, cross-references may be used as keywords:

For example in the selected sequence, the annotations are

S12_1.PE362                
Location/Qualifiers    (length=4425
bp) FT   CDS          
join(complement(9268360..9268445),complement(9265956..9266139),
FT                
complement(9264973..9265132),complement(9264755..9264807),
FT                
complement(9262910..9262930),complement(9262463..9262631),
FT                
complement(9261917..9262001),complement(9260120..9260240),
FT                
complement(9259087..9259201),complement(9258832..9258941),
FT                
complement(9256835..9256996),complement(9254043..9254270),
FT                
complement(9253740..9253803),complement(9251977..9252119),
FT                
complement(9251203..9251352),complement(9248135..9248296),
FT                
complement(9247569..9247680),complement(9246061..9246175),
FT                
complement(9243797..9244025),complement(9242952..9243078),
FT                
complement(9242498..9242619),complement(9241796..9241847),
FT                
complement(9232690..9232773),complement(9232235..9232411),
FT                
complement(9231840..9231927),complement(9230297..9230453),
FT                
complement(9229942..9230016),complement(9229352..9229532),
FT                
complement(9227156..9227379),complement(9225249..9225467),
FT                
complement(9224955..9225082),complement(9223084..9223174),
FT                
complement(9222341..9222409),complement(9221336..9221438),
FT                
complement(9220779..9220820),complement(9220419..9220435))
FT                
/gene="ENSG00000175899" FT          
/protein_id="ENSP00000323929" FT    
/note="transcript_id=ENST00000318602"
FT                
/db_xref="HGNC:A2M" FT              
/db_xref="UCSC:uc001qvk.1" FT       
/db_xref="CCDS:CCDS44827.1" FT      
/db_xref="HPA:CAB017621" FT         
/db_xref="HPA:CAB017621" FT         
/db_xref="HPA:HPA002265" FT         
/db_xref="HPA:HPA002265" FT         
/db_xref="WikiGene:A2M" FT          
/db_xref="Uniprot/SWISSPROT:A2MG_HUMAN"
FT                
/db_xref="RefSeq_peptide:NP_000005.2"
FT                
/db_xref="RefSeq_dna:NM_000014.4" FT
/db_xref="Uniprot/SPTREMBL:C9J773_HUMAN"
FT                
/db_xref="Uniprot/SPTREMBL:Q9BQ22_HUMAN"
FT                
/db_xref="EntrezGene:A2M" FT        
/db_xref="EMBL:AB209614" FT         
/db_xref="EMBL:AC007436" FT         
/db_xref="EMBL:AF109189" FT         
/db_xref="EMBL:AF349032" FT         
/db_xref="EMBL:AF349033" FT         
/db_xref="EMBL:AY591530" FT         
/db_xref="EMBL:BC026246" FT         
/db_xref="EMBL:BC040071" FT         
/db_xref="EMBL:CR749334" FT         
/db_xref="EMBL:M11313"

HGNC:A2M, UCSC:uc001qvk.1,CCDS:CCDS44827.1,HPA:CAB017621,etc. are keywords which may be used to retrieve the sequence

Now to check the sequence list ( in this case there is only 1 sequence in the list) :

mylist$req
[[1]]
          name         length          frame         ncbicg 
"HS12_1.PE362"         "4425"            "0"            "1"

to get the sequence data:

seq<-sapply(mylist$req[1:1],getSequence, as.string = FALSE)

to save data in fasta format:

write.fasta(sequences=seq,names="my_sequence" , file.out = "myseq.fasta")

You can get as well whole chromsome sequences, extract data in several formats, extract fragments of sequences, translate into protein.

You may find more information on the seqinR page here

I hope this may help you

Simon