Finding Mutations In A Gene From 69 Genomes
1
0
Entering edit mode
12.1 years ago
Pappu ★ 2.1k

So first I downloaded files like vcfBeta-NA10851-200-37-ASM.vcf.bz2 from the complete genomics website. Then I had to convert it to gz and index by tabix for fast extraction of variation in chromosome positions. I realized that I have to do the same for 69 genomes which is tiresome.

Then I downloaded CompletePublicGenomes69genomesall_testvariants.tsv.bz2. After unzipping, the file size is 7.2 GB. Let me know how can I work with this file since the format is unknown to me. Thank you.

mutation • 2.0k views
ADD COMMENT
1
Entering edit mode
12.1 years ago

This is a tab-separated-value format file. There are many tools for working with .tsv files, but some (Excel is a prime example) will probably not work with such a file unless you split it into smaller pieces. A scripting language or the linux/unix command line utilities like grep and awk are probably necessary. For your problem, I would consider using grep to pull out lines in the original file that contain variants in your gene of interest. Then, use this subset of the file in Excel.

ADD COMMENT
0
Entering edit mode

Thank you for your suggestion. I actually tried sed which seemed to be too slow for such huge files. I am worndering if it would be possible to convert it to vcf and index by tabix for fastest access to chromosome regions.

ADD REPLY
1
Entering edit mode

You do not need to convert to VCF. Tabix will happily index any sorted, bgzipped file given the correct arguments for chromosome, start, and end columns.

ADD REPLY

Login before adding your answer.

Traffic: 1827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6