Question

Retrieve chromosome sequence as FASTA or fastq file from SRA whole genome data in NCBI- Galaxy

0

Entering edit mode

2.6 years ago

audreyrosemary11 ▴ 10

Hello there! I am trying to do variant calling on sorted genome sequences for different individuals of the same chimpanzee species, data that is retrieved from the Great Ape Genome project website, from https://www.ncbi.nlm.nih.gov/bioproject/189439 NCBI datasets that shows for each individual of said species, the genome sequenced data. Each genome sequence can be accessed from the SRA projects, downloaded via SRA-download NCBI links (for instance: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR748138). I have been experiencing a lot of issues with the project given that each genome sequence is very large (6-7 GB) and I cannot download them properly on my laptop. Eventually, I decided to use Galaxy as an online tool, download chimpanzee genome sequences via fastq-dump command from SRR number directly, map them to my own reference genome to generate a SAM file, then use Medaka for variant calling.

The issue is that this is still computationally expensive. Initially, I ran snippy for variant calling using a sam file originally downloaded from thee SRR number sam-dump feature on Galaxy which couldn't be processed because of lack of memory. Right now I am running the mapping via BWA of a chimp genome sequence to the reference, which started one hour ago and is still going. Due to time constraints, I want to minimize the size of the ape sequences and basically use either a small sample from each genome or specifically chromosome 1 information from each genome sequence. So my question is, is there any way I can access the chromosome 1 sequence of a genome sequence from the ape project datasets and download it as FASTA or fastq file directly from the SRR number via Galaxy? Or, at least find a way to visualize chromosome 1 sequence and download it from the NCBI datasets? I can't find a way to do it, it only lets me download the whole genome sequence which is too large to work with, not individual chromosomes. I was also thinking of splitting each genome sequence after downloading it into let's say 20 batch files, but I can't find a command to do that for fastq files, only fasta on Galaxy.

Thank you very much! I would deeply appreciate any input.

chromosome sra fasta fastq genome • 2.0k views

ADD COMMENT • link 2.6 years ago by audreyrosemary11 ▴ 10

score 1 · Answer 1 · 2022-05-05

1

Entering edit mode

2.6 years ago

Matthias Zepper 5.0k

May I ask what the goal of your research is and if you are affiliated with an academic institution?

As you have already realized, analyzing 79 genomes is a computationally heavy task, for which neither a private laptop nor a public Galaxy instance is really suitable. I would advise against reducing your reference genome to just one chromosome, because it may result in a significant number of falsely mapped reads, if your reference e.g. contains a pseudogene of a gene from a different chromosome that is no longer part of the search space. Furthermore, it would still require that all reads are being processed (you could use bloomfilter.sh from BBTools to eliminate unmatching reads prior to mapping, but this tool also requires a lot of memory...)

Since the data you are analyzing is from the Great Ape Genome Project: What is the reason that you are not using the published variant calls?. You can download the whole set or only the variants from chromosome 1 directly from dbVar, so unless you have a good reason to repeat the analysis from scratch, you might want to try this first.

In case you wish to redo the analysis, getting access to better computational resources is inevitable - since the raw data alone is 4.3 TB, so you will temporarily need >25TB of disk space alone. Maybe your university has a high-performance computing facility, or you can collaborate with researchers elsewhere you could provide you with compute resources? (Sometimes the big cloud providers also give some free credits to academic institutions / work groups, but certainly not enough to run this analysis)

Good luck!

ADD COMMENT • link 2.6 years ago by Matthias Zepper 5.0k

1

Entering edit mode

Thank you very much for your reply! I am an undergrad student, this is only the work done for a final project in my bioinformatics class. The initial idea of the group was to call SNPs using different genome sequences of chimpanzee (about 5 genome sequences) and compare the results using the taxonomy browser data, but after downloading multiple variant calling programs that experienced issues (including bcftools which I'm pretty sure cannot be supported on Windows since I have been trying to troubleshoot unsuccessfully an error regarding recognizing the plugins for multiple days and other students experienced a similar situation), I decided last minute to access the great ape genome project datasets as last resort and use Galaxy instead due to the limitations of my laptop memory. It took me some time to ultimately realize that the the taxonomy data I had initially retrieved and parsed came from only one individual of the same species, sequenced different years. Due to the computational expense and large size of the data I wanted to switch to only looking at the first chromosome in circa 5 genome sequences and then perform that analysis, but again, I cannot find a way to download the sequence only for chromosome 1 from the published data of the ape project which was the only place I could find genomes sequenced from different individuals of same species.

I am aware that the Great Ape Genome Project has published processed data and called SNPs already, but then I could only perform an analysis myself of their data which I feel like it might be too little work at the end for the final project, even if I also find useless repeating their variant calling work. At this point it might be unfortunately too late to change the purpose of the project - if it were me who decided on the main research purpose, I would have started with a different scope looking at shorter sequences from a specific tissue for instance so I don't need to work with large genomic data. I would appreciate though if you could mention whether there is a way to download .VCF files for each species from dbVar? Maybe I am too tired at this moment, but I see you can only download the whole data for each chromosome and not individual files, unless I am wrong. I found a .VCF file from their original website, but for the entirety of each species and not individual ones.

Thank you very much again for your help!

ADD REPLY • link 2.6 years ago by audreyrosemary11 ▴ 10

1

Entering edit mode

is there any way I can access the chromosome 1 sequence of a genome sequence from the ape project datasets

You can download one chromosome of perhaps all ape genomes. Here is chromosome 1 for Chimpanzee. Use the drop down menu to select Fasta and then send to a file to download the sequence file.

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much! Sorry for the late reply. I had found before and downloaded the reference chromosome 1 from Clint the chimpanzee from NCBI, but my issue was finding chromosomes for different individuals also chimpanzees without having to download the entire genome for each individual. From my search, NCBI only shows individual chromosome files for one chimpanzee.

ADD REPLY • link 2.6 years ago by audreyrosemary11 ▴ 10

1

Entering edit mode

I would appreciate though if you could mention whether there is a way to download .VCF files for each species from dbVar?

Since NCBI now only hosts human SNP data you will need to look elsewhere for Apes.

Ensembl has Chimp Genome VCF files here: http://ftp.ensembl.org/pub/release-106/variation/gvf/pan_troglodytes/
Orangutan : http://ftp.ensembl.org/pub/release-106/variation/gvf/pongo_abelii/
Rhesus macaq : http://ftp.ensembl.org/pub/release-106/variation/gvf/macaca_mulatta/

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

Maybe I am too tired at this moment, but I see you can only download the whole data for each chromosome and not individual files, unless I am wrong. I found a .VCF file from their original website, but for the entirety of each species and not individual ones.

Admittedly, I was unaware that dbSNP now hosts only human data. I had just noticed the link Download 1329329 Variant Calls on this page this page and assumed it would be what you are looking for. Unfortunately, this file indeed doesn't contain individual SNPs, only deletions and duplications.

Advantageously, this makes the file at least a lot smaller than a full SNP-VCF would be and thus well manageable on a personal computer - one could nicely read it with R or Python and then create some nice visualizations like genomic trees. It is clearly resolved to the individual (e.g. "Pan_troglodytes_ellioti-Kopongo", "Gorilla_gorilla_gorilla-9752_Suzie") and also partly cross-mapped to other reference genomes.

I'm pretty sure cannot be supported on Windows since I have been trying to troubleshoot unsuccessfully an error regarding recognizing the plugins for multiple days and other students experienced a similar situation)

Possible. I think most Bioinformaticians use either MacOS or Linux for their work. On the long run, you should tehn probably install Docker on your computer and run the software containerized. It takes a while to get started, but once you learned the ropes, it works like a charm: Pull an image from Biocontainers, Dockerhub, Quay.io, Galaxy Depot etc. and just run it - version control and reproducibility included.

Good luck with the analysis and your class assignment!

ADD REPLY • link 2.6 years ago by Matthias Zepper 5.0k