I am currently working on a research project that requires analyzing metagenomics data. I want to download metagenomic sequence data using a command line such as Linux and the associated metadata of such data from a public repository National Center for Biotechnology Information (NCBI) or the European Bioinformatics Institute (EBI). Could anyone provide a step-by-step guide or recommend tools and methods for efficiently downloading this data?
I have tried downloading SRA projects from NCBI on a PC. However, I did it manually and it took me more time to download 200 samples. Later, I didn't know where to get the metadata of these samples downloaded. For these 2 reasons;
I would like to be guided on how I can batch-download such large files. Additionally, I would like to know how to access metagenomic datasets from NCBI's Sequence Read Archive (SRA) and EBI's European Nucleotide Archive (ENA).
Thank you in advance for your assistance!
You can do this in multiple ways.
Use NCBI's SRA Run selector to get metadata (and get data delivered to Galaxy or to a cloud storage instance of your own) : https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA111397&o=acc_s%3Aa
You could use the
sra-explorer
to identify and access download links for data. Here is a guide to use the program: sra-explorer : find SRA and FastQ download URLs in a couple of clicksSRA datasets can be very big. You are going to be limited by resources on your PC, if that is all you have access to. Storage and also the network connection.
In theory three main DNA sequence databases are synced overnight, so identical information should be available in all three. Though I had recently come across some examples where the data was only available in ENA (perhaps because of some restrictions, especially with human samples).