Hello there! I am trying to do variant calling on sorted genome sequences for different individuals of the same chimpanzee species, data that is retrieved from the Great Ape Genome project website, from https://www.ncbi.nlm.nih.gov/bioproject/189439 NCBI datasets that shows for each individual of said species, the genome sequenced data. Each genome sequence can be accessed from the SRA projects, downloaded via SRA-download NCBI links (for instance: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR748138). I have been experiencing a lot of issues with the project given that each genome sequence is very large (6-7 GB) and I cannot download them properly on my laptop. Eventually, I decided to use Galaxy as an online tool, download chimpanzee genome sequences via fastq-dump command from SRR number directly, map them to my own reference genome to generate a SAM file, then use Medaka for variant calling.
The issue is that this is still computationally expensive. Initially, I ran snippy for variant calling using a sam file originally downloaded from thee SRR number sam-dump feature on Galaxy which couldn't be processed because of lack of memory. Right now I am running the mapping via BWA of a chimp genome sequence to the reference, which started one hour ago and is still going. Due to time constraints, I want to minimize the size of the ape sequences and basically use either a small sample from each genome or specifically chromosome 1 information from each genome sequence. So my question is, is there any way I can access the chromosome 1 sequence of a genome sequence from the ape project datasets and download it as FASTA or fastq file directly from the SRR number via Galaxy? Or, at least find a way to visualize chromosome 1 sequence and download it from the NCBI datasets? I can't find a way to do it, it only lets me download the whole genome sequence which is too large to work with, not individual chromosomes. I was also thinking of splitting each genome sequence after downloading it into let's say 20 batch files, but I can't find a command to do that for fastq files, only fasta on Galaxy.
Thank you very much! I would deeply appreciate any input.
Thank you very much for your reply! I am an undergrad student, this is only the work done for a final project in my bioinformatics class. The initial idea of the group was to call SNPs using different genome sequences of chimpanzee (about 5 genome sequences) and compare the results using the taxonomy browser data, but after downloading multiple variant calling programs that experienced issues (including bcftools which I'm pretty sure cannot be supported on Windows since I have been trying to troubleshoot unsuccessfully an error regarding recognizing the plugins for multiple days and other students experienced a similar situation), I decided last minute to access the great ape genome project datasets as last resort and use Galaxy instead due to the limitations of my laptop memory. It took me some time to ultimately realize that the the taxonomy data I had initially retrieved and parsed came from only one individual of the same species, sequenced different years. Due to the computational expense and large size of the data I wanted to switch to only looking at the first chromosome in circa 5 genome sequences and then perform that analysis, but again, I cannot find a way to download the sequence only for chromosome 1 from the published data of the ape project which was the only place I could find genomes sequenced from different individuals of same species.
I am aware that the Great Ape Genome Project has published processed data and called SNPs already, but then I could only perform an analysis myself of their data which I feel like it might be too little work at the end for the final project, even if I also find useless repeating their variant calling work. At this point it might be unfortunately too late to change the purpose of the project - if it were me who decided on the main research purpose, I would have started with a different scope looking at shorter sequences from a specific tissue for instance so I don't need to work with large genomic data. I would appreciate though if you could mention whether there is a way to download .VCF files for each species from dbVar? Maybe I am too tired at this moment, but I see you can only download the whole data for each chromosome and not individual files, unless I am wrong. I found a .VCF file from their original website, but for the entirety of each species and not individual ones.
Thank you very much again for your help!
You can download one chromosome of perhaps all ape genomes. Here is chromosome 1 for Chimpanzee. Use the drop down menu to select
Fasta
and thensend to a file
to download the sequence file.Thank you very much! Sorry for the late reply. I had found before and downloaded the reference chromosome 1 from Clint the chimpanzee from NCBI, but my issue was finding chromosomes for different individuals also chimpanzees without having to download the entire genome for each individual. From my search, NCBI only shows individual chromosome files for one chimpanzee.
Since NCBI now only hosts human SNP data you will need to look elsewhere for Apes.
Ensembl has Chimp Genome VCF files here: http://ftp.ensembl.org/pub/release-106/variation/gvf/pan_troglodytes/
Orangutan : http://ftp.ensembl.org/pub/release-106/variation/gvf/pongo_abelii/
Rhesus macaq : http://ftp.ensembl.org/pub/release-106/variation/gvf/macaca_mulatta/
Admittedly, I was unaware that dbSNP now hosts only human data. I had just noticed the link Download 1329329 Variant Calls on this page this page and assumed it would be what you are looking for. Unfortunately, this file indeed doesn't contain individual SNPs, only deletions and duplications.
Advantageously, this makes the file at least a lot smaller than a full SNP-VCF would be and thus well manageable on a personal computer - one could nicely read it with R or Python and then create some nice visualizations like genomic trees. It is clearly resolved to the individual (e.g. "Pan_troglodytes_ellioti-Kopongo", "Gorilla_gorilla_gorilla-9752_Suzie") and also partly cross-mapped to other reference genomes.
Possible. I think most Bioinformaticians use either MacOS or Linux for their work. On the long run, you should tehn probably install Docker on your computer and run the software containerized. It takes a while to get started, but once you learned the ropes, it works like a charm: Pull an image from Biocontainers, Dockerhub, Quay.io, Galaxy Depot etc. and just run it - version control and reproducibility included.
Good luck with the analysis and your class assignment!