Using the UCSC table browser Select the following options:
genome: mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene predictions
track: RefSeq Genes
table: refGene
region: genome
Identifiers (names/accessions): paste list {then paste the list of gene names}
output format: selected fields from primary and related tables
output file: results.txt
Then click 'get output' and in the following window select:
name
chrom
strand
txStart
txEnd
name2
This gives the following output in results.txt:
chrom strand txStart txEnd name2
chr3 + 34649994 34652460 Sox2
chr4 + 147021849 147060799 Rex2
chr6 + 122707564 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr6 + 122707488 122714633 Nanog
chr6 + 122707564 122714633 Nanog
chr10 + 78042286 78063622 Dnmt3l
chr10 + 78049958 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78055334 78063622 Dnmt3l
chr10 + 78049841 78063622 Dnmt3l
Change to the required format with:
awk '{print$5","$1":"$3"-"$4","$2}' results.txt | uniq
This gives:
Sox2,chr3:34649994-34652460,+
Rex2,chr4:147021849-147060799,+
Nanog,chr6:122707564-122714633,+
Nanog,chr6:122707488-122714633,+
Nanog,chr6:122707564-122714633,+
Dnmt3l,chr10:78042286-78063622,+
Dnmt3l,chr10:78049958-78063622,+
Dnmt3l,chr10:78055334-78063622,+
Dnmt3l,chr10:78049841-78063622,+
Why are there multiple rows for some genes? Because these genes have more than one transcript
The BioMart online tool is probably the easiest way to do it. There's a video tutorial to get you started. Just filter by your list of gene names (ID list: Associated gene name) and get the coordinates as attributes.
Thanks Emily, that is indeed very simple. Do you know how the gene start and end position is defined? Transcription start/end, or perhaps coding start/end?
5' transcription start of the most 5' transcript and 3' transcription end of the most 3' transcript
Thanks venu. I solved my problem by biomaRt package!! So thanks !!! The other ways you suggest also look very useful.