Hello
I am interested in specific loci of the genome in TARGET WGS data and so am trying to download only those regions as opposed to the whole SRA files which takes forever to download. So far, I figured that I can download the SRA files, convert them to BAM, take a slice of interest and delete the rest of the unneeded data. However, the problem is converting SRA to BAM takes a lot of time and I have quite a number of files to process.
I know GDC has BAM slicing built-in function, however, it does not contain the complete data from TARGET yet. I am looking for something similar but doable in other platforms.
https://docs.gdc.cancer.gov/API/Users_Guide/BAM_Slicing/
Is there any API that I can do the slicing with SRA files on dbGAP or SRA website so that I do not have to download the whole WGS files?
Thanks
Fastq are unaligned data. Therefore, they do not have positional records, so they cannot be subsetted prior to alignment. What you can do is to check if your data-of-interest are mirrored at the European Nucleotide Archive (ENA). They mirror most NCBI data as fastq instead of this terrible SRA junk, and have the option to download via Aspera (allowing download with up to 100Mb/s), speeding up at least the data acquisition step. Still, the alignment will take time.