How Do I Access And Query Entire Genome Sequences With R
5
6
Entering edit mode
14.7 years ago
John ▴ 790

Is there a package in R where I can load in a whole genome, and then say e.g. get.gene.sequence (input_genome, "gene_symbol"), or another tool that you'd use for this job?

gene r • 24k views
ADD COMMENT
15
Entering edit mode
14.7 years ago
Neilfws 49k

seqinR is a good option for R.

If you don't need the whole genome locally, you can fetch sequences from Ensembl using biomaRt:

library(biomaRt)
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
seq     <- getSequence(id = "A2M", type="hgnc_symbol", mart = ensembl, seqType = "transcript_exon_intron")

I would also consider one of the Bio* libraries for sequence retrieval and manipulation.

ADD COMMENT
0
Entering edit mode

Does seqinR work for newly-sequenced genomes? I mean genomes for non-model organisms which are not available in biomaRt. Thanks.

ADD REPLY
0
Entering edit mode

If you have sequence, it will load it.

ADD REPLY
0
Entering edit mode

I need to query hg19 reference genome and retrieve several 50 nucleotides bases from various positions (exonic, intronic, etc). For testing that if it works I use the following command but it returns empty result :

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
result = getSequence(chromosome=2, start=86374868, end=86374877,type="entrezgene",seqType="gene_exon_intron", mart=ensembl)

I expect to see acatctcgca as a result. I think the problem is come from miss configuration in function parameters. Am I using the right tool for this problem?

ADD REPLY
0
Entering edit mode

People are unlikely to see your query here buried in the comments of an old question. You'll probably have better luck adding it as a new question.

ADD REPLY
0
Entering edit mode

Sorry, I am new to Biostars, I post my question as a new post here.

ADD REPLY
5
Entering edit mode
14.7 years ago

Fasta files are pretty easy to manipulate using the seqinR package, if you've got the memory to handle it. (~4GB for chr1)

library(seqinr)
seq = read.fasta("chr1.fa",seqtype="DNA")
geneSeq = seq[[1]][geneStart:geneStop]

There's probably a better way to do it, but this may get you started.

ADD COMMENT
1
Entering edit mode

thanks Chris, but that doesn't give me any of the annotations right? I want to say gene symbol "Rbcl" and if it is present then it returns the coding sequence for that gene.

ADD REPLY
0
Entering edit mode

1) It's pretty easy to slurp down all 18k gene names and locations from someplace like Santa Cruz, and store it in a file locally. This gets trickier if you're worried about aliases and such, in which case you'll want to do something like Neil or Pierre suggest and query a web service.

2) Credit where credit is due - Neil pointed me to the seqinr package the other day

ADD REPLY
0
Entering edit mode

is there a way to get sequences lengths as a data frame without reading in the whole file?

ADD REPLY
0
Entering edit mode

Jeremy - are you talking about the gene lengths? It's pretty easy to parse through the gene list from UCSC line by line and find just the length between first and last exon. I'd probably parse it with Ruby or Perl first, then read the results into R.

If you actually need the sequence, I'd probably change my recommendation to the BSGenome package (http://bioconductor.org/packages/2.3/bioc/html/BSgenome.html) in combination with the BioStrings package. It'll hold the whole genome in about 500mb which makes it easier to handle.

ADD REPLY
5
Entering edit mode
14.7 years ago

Ok, here is my java solution. It only uses some remote resources (the anonymous mysql server and the DAS server of the UCSC). My piece of code takes a list of refGene as its arguments and only prints the genes/geneomic-sequences to stdout but one could imagine it would store a pair(name,sequence) in memory.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Gene2Seq
    {
    private static class DASHandler
        extends DefaultHandler
        {
        private boolean inDNA=false;
        @Override
        public void startElement(String uri, String localName, String qName,
                Attributes attributes) throws SAXException
            {
            inDNA=(qName.equals("DNA"));
            }
        @Override
        public void endElement(String uri, String localName, String qName)
                throws SAXException
            {
            inDNA=false;
            }
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException
            {
            if(inDNA) System.out.print(new String(ch, start, length).replace("\n", ""));
            }
        }

    public static void main(String[] args) throws Throwable
        {
        //put the JDBC driver for mysql in the $CLASSPATH
        Class.forName("com.mysql.jdbc.Driver");
        Connection con = DriverManager.getConnection(
                "jdbc:mysql://genome-mysql.cse.ucsc.edu/hg19",
                "genome", ""
                );
        SAXParserFactory f=SAXParserFactory.newInstance();
        SAXParser parser=f.newSAXParser();
        PreparedStatement pstmt=con.prepareStatement(
                "select name,chrom,txStart,txEnd from refGene where name2=?"
                );
        for(String name: args)
            {
            pstmt.setString(1, name);
            ResultSet row=pstmt.executeQuery();
            while(row.next())
                {
                System.out.println(">"+row.getString(1)+"|"+row.getString(2)+":"+row.getInt(3)+"-"+row.getInt(4));
                parser.parse(
                        "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment="+row.getString(2)+":"+(row.getInt(3)+1)+","+(row.getInt(4)+1),
                        new DASHandler());
                System.out.println();
                }
            row.close();
            }

        pstmt.close();
        con.close();
        }
    }

output with EIF4G1:

>NM_004953|chr3:184038262-184053145
agatgggctgaaagtggaactcaaggggtttctggcacctacctacctgcttcccgctggggggtggggagttggcccag....
>NM_182917|chr3:184032970-184053145
tcctcgacggccgccgcccgcctggccttttagggcctgactcccgcccttcctggcctacactcctgggcggcggcagg....
>NM_198241|chr3:184032355-184053145
gaagcggtggccgccgagcgggatctgtgcggggagccggaaatggt....
ADD COMMENT
5
Entering edit mode
14.7 years ago
Michael 55k

R has gotten a lot of sequence handling and searching routines recently that make it a good choice also for sequence analysis in bioinformatics.

Have a look at the BSGenome package in BioConductor. It is meant to hold the genome sequence and allow fast sequence searches in the genome sequence. It does not contain real genome annotations though. There are readymade packages for a bunch of eukaryote genomes you can download, but of course your organism has to be in the list. This can be used together with the BioStrings package that allows for fast sequence searches and manipulation.

It could be a good team with biomaRt (already mentioned, I really recommend this) if you want further local sequence processing or if you already have the gene start/end coordinates, or alternatively transfer the coordinates only and cut these out of the BSGenome.

ADD COMMENT
2
Entering edit mode
14.7 years ago
Simon Penel ▴ 20

Hello,

Concerning the seqinR package I would also consider the "query" function which allows you to query various databases as EMBL, GenBank, Uniprot, some Ensembl genomes and others databases structured under ACNUC.

library(seqinr)

To check which databases are available, type:

choosebank(infobank=TRUE)

For example if you want to query the ensembl human genome for sequences in which the field hgnc is a2m :

choosebank("human")
query("mylist","k=hgnc:a2m")

Note that in the case of Ensembl, cross-references may be used as keywords:

For example in the selected sequence, the annotations are

S12_1.PE362                
Location/Qualifiers    (length=4425
bp) FT   CDS          
join(complement(9268360..9268445),complement(9265956..9266139),
FT                
complement(9264973..9265132),complement(9264755..9264807),
FT                
complement(9262910..9262930),complement(9262463..9262631),
FT                
complement(9261917..9262001),complement(9260120..9260240),
FT                
complement(9259087..9259201),complement(9258832..9258941),
FT                
complement(9256835..9256996),complement(9254043..9254270),
FT                
complement(9253740..9253803),complement(9251977..9252119),
FT                
complement(9251203..9251352),complement(9248135..9248296),
FT                
complement(9247569..9247680),complement(9246061..9246175),
FT                
complement(9243797..9244025),complement(9242952..9243078),
FT                
complement(9242498..9242619),complement(9241796..9241847),
FT                
complement(9232690..9232773),complement(9232235..9232411),
FT                
complement(9231840..9231927),complement(9230297..9230453),
FT                
complement(9229942..9230016),complement(9229352..9229532),
FT                
complement(9227156..9227379),complement(9225249..9225467),
FT                
complement(9224955..9225082),complement(9223084..9223174),
FT                
complement(9222341..9222409),complement(9221336..9221438),
FT                
complement(9220779..9220820),complement(9220419..9220435))
FT                
/gene="ENSG00000175899" FT          
/protein_id="ENSP00000323929" FT    
/note="transcript_id=ENST00000318602"
FT                
/db_xref="HGNC:A2M" FT              
/db_xref="UCSC:uc001qvk.1" FT       
/db_xref="CCDS:CCDS44827.1" FT      
/db_xref="HPA:CAB017621" FT         
/db_xref="HPA:CAB017621" FT         
/db_xref="HPA:HPA002265" FT         
/db_xref="HPA:HPA002265" FT         
/db_xref="WikiGene:A2M" FT          
/db_xref="Uniprot/SWISSPROT:A2MG_HUMAN"
FT                
/db_xref="RefSeq_peptide:NP_000005.2"
FT                
/db_xref="RefSeq_dna:NM_000014.4" FT
/db_xref="Uniprot/SPTREMBL:C9J773_HUMAN"
FT                
/db_xref="Uniprot/SPTREMBL:Q9BQ22_HUMAN"
FT                
/db_xref="EntrezGene:A2M" FT        
/db_xref="EMBL:AB209614" FT         
/db_xref="EMBL:AC007436" FT         
/db_xref="EMBL:AF109189" FT         
/db_xref="EMBL:AF349032" FT         
/db_xref="EMBL:AF349033" FT         
/db_xref="EMBL:AY591530" FT         
/db_xref="EMBL:BC026246" FT         
/db_xref="EMBL:BC040071" FT         
/db_xref="EMBL:CR749334" FT         
/db_xref="EMBL:M11313"

HGNC:A2M, UCSC:uc001qvk.1,CCDS:CCDS44827.1,HPA:CAB017621,etc. are keywords which may be used to retrieve the sequence

Now to check the sequence list ( in this case there is only 1 sequence in the list) :

mylist$req
[[1]]
          name         length          frame         ncbicg 
"HS12_1.PE362"         "4425"            "0"            "1"

to get the sequence data:

seq<-sapply(mylist$req[1:1],getSequence, as.string = FALSE)

to save data in fasta format:

write.fasta(sequences=seq,names="my_sequence" , file.out = "myseq.fasta")

You can get as well whole chromsome sequences, extract data in several formats, extract fragments of sequences, translate into protein.

You may find more information on the seqinR page here

I hope this may help you

Simon

ADD COMMENT
0
Entering edit mode

Using seqinr, can I give let's say ensembl gene id or just gene symbol and extract the gene as well as 20kb-50 kb upstream as well as downstream of that gene in multiple species such as chicken, human, mouse?

ADD REPLY

Login before adding your answer.

Traffic: 2357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6