hg19 exon, intron, UTR region
4
1
Entering edit mode
5.2 years ago
9521ljh ▴ 50

where to get the file of hg19 exon, intron, UTR region ?

I read lots of post in biostar. However it's been a long time post.

sequencing next-gen • 2.7k views
ADD COMMENT
3
Entering edit mode
5.2 years ago
ATpoint 86k

Get the respective GTF (annotation) file for your genome. Once you have this you can basically follow A: how to get intronic and intergenic sequences based on gff file? to get the respective features. GTFs can be found at NCBI, Ensembl or GENCODE.

ADD COMMENT
2
Entering edit mode
5.2 years ago

Be aware that UTRs are part of exons. So when you get UTR regions from a GTF/GFF, they will overlap with the regions annotated as exons in the GTF. Perhaps what you want is not things annotated as exons, but things annotated as CDS?

ADD COMMENT
2
Entering edit mode
5.2 years ago

using bioalcidaejdk : http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ wget  -q -O - "ftp://ftp.ensemblgenomes.org/pub/release-45/metazoa/gtf/apis_mellifera/Apis_mellifera.Amel_4.5.45.gtf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -f biostar.code  

(...)
6   4682136 4682216 +   GB52198-RA.Intron16
6   4682387 4682473 +   GB52198-RA.Intron17
6   4682696 4682760 +   GB52198-RA.Intron18
6   4682905 4682967 +   GB52198-RA.Intron19
6   4676837 4677042 +   5' UTR of GB52198-RA
6   4683076 4683853 +   3' UTR of GB52198-RA
6   4691339 4691384 +   GB52199-RA.Exon1
6   4692448 4692491 +   GB52199-RA.Exon2
6   4693914 4694249 +   GB52199-RA.Exon3
6   4691384 4692448 +   GB52199-RA.Intron1
6   4692491 4693914 +   GB52199-RA.Intron2
6   4691339 4691339 +   5' UTR of GB52199-RA
(...)

with biostar.code:

stream().
    flatMap(GENE->GENE.getTranscripts().stream()).
    flatMap(TRANSCRIPT->{
        final List<Interval> L = new ArrayList<>();
        TRANSCRIPT.getExons().stream().forEach(E->L.add(E.toInterval()));
        TRANSCRIPT.getIntrons().stream().forEach(I->L.add(I.toInterval()));
        TRANSCRIPT.getUTRs().stream().forEach(U->L.add(U.toInterval()));
        return L.stream();
        }).forEach(R->println(R.getContig()+"\t"+(R.getStart()-1)+"\t"+R.getEnd()+"\t"+R.getStrand()+"\t"+R.getName()));
ADD COMMENT
1
Entering edit mode
5.2 years ago

Search for gtf/gff files, these should contain the information you are looking for. Make sure the chromosome notation matches with your reference fasta genome.

ADD COMMENT

Login before adding your answer.

Traffic: 2281 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6