where to get the file of hg19 exon, intron, UTR region ?
I read lots of post in biostar. However it's been a long time post.
where to get the file of hg19 exon, intron, UTR region ?
I read lots of post in biostar. However it's been a long time post.
Get the respective GTF (annotation) file for your genome. Once you have this you can basically follow A: how to get intronic and intergenic sequences based on gff file? to get the respective features. GTFs can be found at NCBI, Ensembl or GENCODE.
Be aware that UTRs are part of exons. So when you get UTR regions from a GTF/GFF, they will overlap with the regions annotated as exons in the GTF. Perhaps what you want is not things annotated as exons, but things annotated as CDS?
using bioalcidaejdk : http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
$ wget -q -O - "ftp://ftp.ensemblgenomes.org/pub/release-45/metazoa/gtf/apis_mellifera/Apis_mellifera.Amel_4.5.45.gtf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -f biostar.code
(...)
6 4682136 4682216 + GB52198-RA.Intron16
6 4682387 4682473 + GB52198-RA.Intron17
6 4682696 4682760 + GB52198-RA.Intron18
6 4682905 4682967 + GB52198-RA.Intron19
6 4676837 4677042 + 5' UTR of GB52198-RA
6 4683076 4683853 + 3' UTR of GB52198-RA
6 4691339 4691384 + GB52199-RA.Exon1
6 4692448 4692491 + GB52199-RA.Exon2
6 4693914 4694249 + GB52199-RA.Exon3
6 4691384 4692448 + GB52199-RA.Intron1
6 4692491 4693914 + GB52199-RA.Intron2
6 4691339 4691339 + 5' UTR of GB52199-RA
(...)
with biostar.code:
stream().
flatMap(GENE->GENE.getTranscripts().stream()).
flatMap(TRANSCRIPT->{
final List<Interval> L = new ArrayList<>();
TRANSCRIPT.getExons().stream().forEach(E->L.add(E.toInterval()));
TRANSCRIPT.getIntrons().stream().forEach(I->L.add(I.toInterval()));
TRANSCRIPT.getUTRs().stream().forEach(U->L.add(U.toInterval()));
return L.stream();
}).forEach(R->println(R.getContig()+"\t"+(R.getStart()-1)+"\t"+R.getEnd()+"\t"+R.getStrand()+"\t"+R.getName()));
Search for gtf/gff files, these should contain the information you are looking for. Make sure the chromosome notation matches with your reference fasta genome.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.