Question

Basic understanding of genome sequences

2

Entering edit mode

10.8 years ago

maria.kesa ▴ 30

Hello,

My name is Maria. I'm a master's student from Estonia. I want to ask some basic (okay, maybe a little bit silly questions). I'm starting to work with 1000Genomes data and I've never worked with genome sequences before.

I want to download sub-sequences of a genome. The instruction says to indicate it like 1:1-50000. I understand that 1 in front of : refers to chromosome number, is that correct? And 1-50000 would be the first 50000 nucleotides?

Are the genomes of different people of different lengths due to copy number variations? Does the sequencing according to a reference genome take account of these differences or would all the genomes in 1000Genomes be of the same length as they are aligned to a reference genome?

What if I wanted to obtain specific genes from the sequences? Is there any tool to do that?

Thank you!

alignment genome • 2.4k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by maria.kesa ▴ 30

Ram · Accepted Answer · 2015-02-17

I want to download sub-sequences of a genome. The instruction says to indicate it like 1:1-50000. I understand that 1 in front of : refers to chromosome number, is that correct? And 1-50000 would be the first 50000 nucleotides?

Yes, you are correct

Are the genomes of different people of different lengths due to copy number variations?

Yes. Indels will also contribute to differences in genome lengths.

Does the sequencing according to a reference genome take account of these differences or would all the genomes in 1000Genomes be of the same length as they are aligned to a reference genome?

Reads from different genomic samples are aligned to the same reference genome so that multiple genomes can be easily compared to each other for the presence/absence of a genomic variant. Otherwise it would be tough to carry out any comparisons. In short, you can say that all the genomes in 1000 Genomes are of same length w.r.t to the coordinate location of a variant/gene. Though bam files have enough information to predict copy number variants and identify insertions and deletions differing between individuals.

What if I wanted to obtain specific genes from the sequences? Is there any tool to do that?

You can download coordinates of your gene of interest from Ensembl (gtf file) or UCSC genome browser (gtf/bed) and then use those coordinates (for e.g. chr2:100000-1020000) to fetch the reads overlapping that region from the bam file. You can use samtools view function.