Question

Finding Mutations In A Gene From 69 Genomes

0

Entering edit mode

12.1 years ago

Pappu ★ 2.1k

So first I downloaded files like vcfBeta-NA10851-200-37-ASM.vcf.bz2 from the complete genomics website. Then I had to convert it to gz and index by tabix for fast extraction of variation in chromosome positions. I realized that I have to do the same for 69 genomes which is tiresome.

Then I downloaded CompletePublicGenomes69genomesall_testvariants.tsv.bz2. After unzipping, the file size is 7.2 GB. Let me know how can I work with this file since the format is unknown to me. Thank you.

mutation • 2.0k views

ADD COMMENT • link updated 12.1 years ago by Sean Davis 27k • written 12.1 years ago by Pappu ★ 2.1k

score 1 · Answer 1 · 2012-10-15

1

Entering edit mode

12.1 years ago

Sean Davis 27k

This is a tab-separated-value format file. There are many tools for working with .tsv files, but some (Excel is a prime example) will probably not work with such a file unless you split it into smaller pieces. A scripting language or the linux/unix command line utilities like grep and awk are probably necessary. For your problem, I would consider using grep to pull out lines in the original file that contain variants in your gene of interest. Then, use this subset of the file in Excel.

ADD COMMENT • link 12.1 years ago by Sean Davis 27k

0

Entering edit mode

Thank you for your suggestion. I actually tried sed which seemed to be too slow for such huge files. I am worndering if it would be possible to convert it to vcf and index by tabix for fastest access to chromosome regions.

ADD REPLY • link 12.1 years ago by Pappu ★ 2.1k

1

Entering edit mode

You do not need to convert to VCF. Tabix will happily index any sorted, bgzipped file given the correct arguments for chromosome, start, and end columns.

ADD REPLY • link 12.1 years ago by Sean Davis 27k