How to get 1000 Genomes data in bulk?

0

Entering edit mode

7.0 years ago

kynnjo ▴ 70

I am looking for an efficient way to get 1000 Genomes data for ~70k dbSNP ids. (I am primarily interested in putative impact and allele frequencies.)

Is there a convenient way to do this?

A good solution would be some way to query 1000 Genomes programmatically and in bulk (as opposed to one dbSNP id at at time), but I have not found it yet.

Another possibility would be do download files from 1000 Genomes that I can process locally, but I have not been able to locate a reasonably-sized download that has the information I'm looking for. I could download all of ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502, but that could take a long time, and pretty much fill up my hard disk, without any guarantee that what I'm looking for is in that massive download.

next-gen 1000-genomes databases api • 4.8k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 89k • written 7.0 years ago by kynnjo ▴ 70

2

Entering edit mode

If you are able to use Amazon AWS then the data is available there and won't require a download.

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

How to get 1000 Genomes data in bulk?

Title of this post can be refined to indicate your exact requirement.

You know how to get the data in bulk but you are looking for an efficient way to just get the data for 70k dbSNP id's that you want. Is that correct?

ADD REPLY • link 7.0 years ago by GenoMax 153k

2

Entering edit mode

7.0 years ago

Pierre Lindenbaum 166k

The following java program should do the trick:

    Compilation:
javac Biostar332826.java
Usage
wget -O -  "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" |\
  gunzip -c |\
  java Biostar332826 file_containing_rs_id_one_per_line.txt 

  

        view raw
        
          README.md
        
        hosted with ❤ by GitHub
      

	import java.io.*;
	import java.nio.*;
	import java.nio.file.*;
	import java.util.*;
	import java.util.regex.*;

	public class Biostar332826
	{
	public static void main(final String args[]) throws Exception {
	final Set<String> rs = new HashSet<>(
	Files.readAllLines(Paths.get(args[0]))
	);
	final Pattern tab = Pattern.compile("[\t]");
	final BufferedReader r = new BufferedReader(new InputStreamReader(System.in));
	String line;
	while((line=r.readLine())!=null)
	{
	if(line.startsWith("#")) {
	System.out.println(line);
	continue;
	}
	final String tokens[]=tab.split(line,4);
	if(rs.contains(tokens[2])) {
	System.out.println(line);
	}
	}

	r.close();
	}
	}

view raw Biostar332826.java hosted with ❤ by GitHub

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

7.0 years ago

Kevin Blighe 89k

Thought to give a quick answer as this thread will likely get a fair bit of traffic in the future.

If you follow steps 1-3 of my tutorial, here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format, then you'll have the data in BCF format. On my disk, the entire 1000 Genomes data (phased genotypes) in a single BCF file occupies 8.3 gigabytes. I interrogate it frequently for diverse projects.

Kevin

ADD COMMENT • link 7.0 years ago by Kevin Blighe 89k

Login before adding your answer.