Is there a package in R where I can load in a whole genome, and then say e.g. get.gene.sequence (input_genome, "gene_symbol")
, or another tool that you'd use for this job?
Is there a package in R where I can load in a whole genome, and then say e.g. get.gene.sequence (input_genome, "gene_symbol")
, or another tool that you'd use for this job?
seqinR is a good option for R.
If you don't need the whole genome locally, you can fetch sequences from Ensembl using biomaRt:
library(biomaRt)
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
seq <- getSequence(id = "A2M", type="hgnc_symbol", mart = ensembl, seqType = "transcript_exon_intron")
I would also consider one of the Bio* libraries for sequence retrieval and manipulation.
Fasta files are pretty easy to manipulate using the seqinR package, if you've got the memory to handle it. (~4GB for chr1)
library(seqinr)
seq = read.fasta("chr1.fa",seqtype="DNA")
geneSeq = seq[[1]][geneStart:geneStop]
There's probably a better way to do it, but this may get you started.
1) It's pretty easy to slurp down all 18k gene names and locations from someplace like Santa Cruz, and store it in a file locally. This gets trickier if you're worried about aliases and such, in which case you'll want to do something like Neil or Pierre suggest and query a web service.
2) Credit where credit is due - Neil pointed me to the seqinr package the other day
Jeremy - are you talking about the gene lengths? It's pretty easy to parse through the gene list from UCSC line by line and find just the length between first and last exon. I'd probably parse it with Ruby or Perl first, then read the results into R.
If you actually need the sequence, I'd probably change my recommendation to the BSGenome package (http://bioconductor.org/packages/2.3/bioc/html/BSgenome.html) in combination with the BioStrings package. It'll hold the whole genome in about 500mb which makes it easier to handle.
Ok, here is my java solution. It only uses some remote resources (the anonymous mysql server and the DAS server of the UCSC). My piece of code takes a list of refGene as its arguments and only prints the genes/geneomic-sequences to stdout but one could imagine it would store a pair(name,sequence) in memory.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class Gene2Seq
{
private static class DASHandler
extends DefaultHandler
{
private boolean inDNA=false;
@Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException
{
inDNA=(qName.equals("DNA"));
}
@Override
public void endElement(String uri, String localName, String qName)
throws SAXException
{
inDNA=false;
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException
{
if(inDNA) System.out.print(new String(ch, start, length).replace("\n", ""));
}
}
public static void main(String[] args) throws Throwable
{
//put the JDBC driver for mysql in the $CLASSPATH
Class.forName("com.mysql.jdbc.Driver");
Connection con = DriverManager.getConnection(
"jdbc:mysql://genome-mysql.cse.ucsc.edu/hg19",
"genome", ""
);
SAXParserFactory f=SAXParserFactory.newInstance();
SAXParser parser=f.newSAXParser();
PreparedStatement pstmt=con.prepareStatement(
"select name,chrom,txStart,txEnd from refGene where name2=?"
);
for(String name: args)
{
pstmt.setString(1, name);
ResultSet row=pstmt.executeQuery();
while(row.next())
{
System.out.println(">"+row.getString(1)+"|"+row.getString(2)+":"+row.getInt(3)+"-"+row.getInt(4));
parser.parse(
"http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment="+row.getString(2)+":"+(row.getInt(3)+1)+","+(row.getInt(4)+1),
new DASHandler());
System.out.println();
}
row.close();
}
pstmt.close();
con.close();
}
}
output with EIF4G1:
>NM_004953|chr3:184038262-184053145
agatgggctgaaagtggaactcaaggggtttctggcacctacctacctgcttcccgctggggggtggggagttggcccag....
>NM_182917|chr3:184032970-184053145
tcctcgacggccgccgcccgcctggccttttagggcctgactcccgcccttcctggcctacactcctgggcggcggcagg....
>NM_198241|chr3:184032355-184053145
gaagcggtggccgccgagcgggatctgtgcggggagccggaaatggt....
R has gotten a lot of sequence handling and searching routines recently that make it a good choice also for sequence analysis in bioinformatics.
Have a look at the BSGenome package in BioConductor. It is meant to hold the genome sequence and allow fast sequence searches in the genome sequence. It does not contain real genome annotations though. There are readymade packages for a bunch of eukaryote genomes you can download, but of course your organism has to be in the list. This can be used together with the BioStrings package that allows for fast sequence searches and manipulation.
It could be a good team with biomaRt (already mentioned, I really recommend this) if you want further local sequence processing or if you already have the gene start/end coordinates, or alternatively transfer the coordinates only and cut these out of the BSGenome.
Hello,
Concerning the seqinR package I would also consider the "query" function which allows you to query various databases as EMBL, GenBank, Uniprot, some Ensembl genomes and others databases structured under ACNUC.
library(seqinr)
To check which databases are available, type:
choosebank(infobank=TRUE)
For example if you want to query the ensembl human genome for sequences in which the field hgnc is a2m :
choosebank("human")
query("mylist","k=hgnc:a2m")
Note that in the case of Ensembl, cross-references may be used as keywords:
For example in the selected sequence, the annotations are
S12_1.PE362
Location/Qualifiers (length=4425
bp) FT CDS
join(complement(9268360..9268445),complement(9265956..9266139),
FT
complement(9264973..9265132),complement(9264755..9264807),
FT
complement(9262910..9262930),complement(9262463..9262631),
FT
complement(9261917..9262001),complement(9260120..9260240),
FT
complement(9259087..9259201),complement(9258832..9258941),
FT
complement(9256835..9256996),complement(9254043..9254270),
FT
complement(9253740..9253803),complement(9251977..9252119),
FT
complement(9251203..9251352),complement(9248135..9248296),
FT
complement(9247569..9247680),complement(9246061..9246175),
FT
complement(9243797..9244025),complement(9242952..9243078),
FT
complement(9242498..9242619),complement(9241796..9241847),
FT
complement(9232690..9232773),complement(9232235..9232411),
FT
complement(9231840..9231927),complement(9230297..9230453),
FT
complement(9229942..9230016),complement(9229352..9229532),
FT
complement(9227156..9227379),complement(9225249..9225467),
FT
complement(9224955..9225082),complement(9223084..9223174),
FT
complement(9222341..9222409),complement(9221336..9221438),
FT
complement(9220779..9220820),complement(9220419..9220435))
FT
/gene="ENSG00000175899" FT
/protein_id="ENSP00000323929" FT
/note="transcript_id=ENST00000318602"
FT
/db_xref="HGNC:A2M" FT
/db_xref="UCSC:uc001qvk.1" FT
/db_xref="CCDS:CCDS44827.1" FT
/db_xref="HPA:CAB017621" FT
/db_xref="HPA:CAB017621" FT
/db_xref="HPA:HPA002265" FT
/db_xref="HPA:HPA002265" FT
/db_xref="WikiGene:A2M" FT
/db_xref="Uniprot/SWISSPROT:A2MG_HUMAN"
FT
/db_xref="RefSeq_peptide:NP_000005.2"
FT
/db_xref="RefSeq_dna:NM_000014.4" FT
/db_xref="Uniprot/SPTREMBL:C9J773_HUMAN"
FT
/db_xref="Uniprot/SPTREMBL:Q9BQ22_HUMAN"
FT
/db_xref="EntrezGene:A2M" FT
/db_xref="EMBL:AB209614" FT
/db_xref="EMBL:AC007436" FT
/db_xref="EMBL:AF109189" FT
/db_xref="EMBL:AF349032" FT
/db_xref="EMBL:AF349033" FT
/db_xref="EMBL:AY591530" FT
/db_xref="EMBL:BC026246" FT
/db_xref="EMBL:BC040071" FT
/db_xref="EMBL:CR749334" FT
/db_xref="EMBL:M11313"
HGNC:A2M
, UCSC:uc001qvk.1
,CCDS:CCDS44827.1
,HPA:CAB017621
,etc. are keywords which may be used to retrieve the sequence
Now to check the sequence list ( in this case there is only 1 sequence in the list) :
mylist$req
[[1]]
name length frame ncbicg
"HS12_1.PE362" "4425" "0" "1"
to get the sequence data:
seq<-sapply(mylist$req[1:1],getSequence, as.string = FALSE)
to save data in fasta format:
write.fasta(sequences=seq,names="my_sequence" , file.out = "myseq.fasta")
You can get as well whole chromsome sequences, extract data in several formats, extract fragments of sequences, translate into protein.
You may find more information on the seqinR page here
I hope this may help you
Simon
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Does seqinR work for newly-sequenced genomes? I mean genomes for non-model organisms which are not available in biomaRt. Thanks.
If you have sequence, it will load it.
I need to query hg19 reference genome and retrieve several 50 nucleotides bases from various positions (exonic, intronic, etc). For testing that if it works I use the following command but it returns empty result :
I expect to see
acatctcgca
as a result. I think the problem is come from miss configuration in function parameters. Am I using the right tool for this problem?People are unlikely to see your query here buried in the comments of an old question. You'll probably have better luck adding it as a new question.
Sorry, I am new to Biostars, I post my question as a new post here.