I'm trying to download whole genome sequencing data from the 1000 genomes project to pad out some GVCF files I have for use in variant recalibration. There seems to be a problem somewhere between converting cram to bam, and extracting my region of interest, as the resulting bam seems corrupted. Here's the workflow I'm using.
Download 1000 genomes files from here: https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. I've downloaded the first 100 cram files, along with the reference sequence they used (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)
Convert cram to bam
samtools view -b -T "$REF" -o "${DIR}/bam/${SN}.bam" "${DIR}/${SN}.cram"
Index the bam file
samtools index "${DIR}/bam/${SN}.bam"
Extract the region of interest (it's in MSH3 on chromosome 5)
samtools view -h "${DIR}/bam/${SN}.bam" "chr5:80654720-80655181" > "${DIR}/MSH3/${SN}_MSH3.bam"
Sort the bam file
picard SortSam I="${DIR}/MSH3/${SN}_MSH3.bam" O="${DIR}/MSH3/${SN}_MSH3.sorted.bam" SORT_ORDER=coordinate
Index the bam file
samtools index "${DIR}/MSH3/${SN}_MSH3.sorted.bam"
But I get this result from attempted indexing:
samtools index: "/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram/MSH3/HG00118.final_MSH3.sorted.bam" is in a format that cannot be usefully indexed
Ok thanks, I’ll try when back at the computer. It sounds like you’re suggesting calling
samtools view -h "${DIR}/${SN}.cram" "chr5:80654720-80655181" > "${DIR}/MSH3/${SN}_MSH3.sam"
on the original cram file to generate a sam for the region of interest?Hallelujah that works!! Thank you
Could this be adapted to extract a list of regions of interest from crams? Thank you
read the manual. see options
-M
and-L
Thanks, I've got that going I think with this:
But it's really slow with -L! Is there a faster option??
read the manual. see options
-M
and-L
Thank you. The manual suggests -M may speed things up, and it looks like "its path has to be preceded by -L option". I'm sorry if I'm being dense, but I've tried several ways of adding in the -M option, but it keeps erroring. It's only running when I use -L alone, but I've still not got that to run to completion yet - still running! I've only got 7 regions of interest.
Ah, cracked it with this. Thank you!