Extract a gene region from a whole genome cram file from 1000 genomes
1
0
Entering edit mode
3.0 years ago

I'm trying to download whole genome sequencing data from the 1000 genomes project to pad out some GVCF files I have for use in variant recalibration. There seems to be a problem somewhere between converting cram to bam, and extracting my region of interest, as the resulting bam seems corrupted. Here's the workflow I'm using.

  1. Download 1000 genomes files from here: https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. I've downloaded the first 100 cram files, along with the reference sequence they used (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)

  2. Convert cram to bam

    samtools view -b -T "$REF" -o "${DIR}/bam/${SN}.bam" "${DIR}/${SN}.cram"

  3. Index the bam file

    samtools index "${DIR}/bam/${SN}.bam"

  4. Extract the region of interest (it's in MSH3 on chromosome 5)

    samtools view -h "${DIR}/bam/${SN}.bam" "chr5:80654720-80655181" > "${DIR}/MSH3/${SN}_MSH3.bam"

  5. Sort the bam file

    picard SortSam I="${DIR}/MSH3/${SN}_MSH3.bam" O="${DIR}/MSH3/${SN}_MSH3.sorted.bam" SORT_ORDER=coordinate

  6. Index the bam file

    samtools index "${DIR}/MSH3/${SN}_MSH3.sorted.bam"

But I get this result from attempted indexing:

samtools index: "/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram/MSH3/HG00118.final_MSH3.sorted.bam" is in a format that cannot be usefully indexed
samtools cram bam • 3.0k views
ADD COMMENT
3
Entering edit mode
3.0 years ago

you don't need to convert CRAM to BAM

you don't need to sort after extracting the region

samtools view -h "${DIR}/bam/${SN}.bam" "chr5:80654720-80655181" > "${DIR}/MSH3/${SN}_MSH3.bam" generates a SAM file, not a BAM file.

ADD COMMENT
0
Entering edit mode

Ok thanks, I’ll try when back at the computer. It sounds like you’re suggesting calling samtools view -h "${DIR}/${SN}.cram" "chr5:80654720-80655181" > "${DIR}/MSH3/${SN}_MSH3.sam" on the original cram file to generate a sam for the region of interest?

ADD REPLY
2
Entering edit mode
samtools index in.cram #if needed
samtools view -O bam -o sub.bam -T ref.fa in.cram "chr1:234-567"
samtools index sub.bam
ADD REPLY
0
Entering edit mode

Hallelujah that works!! Thank you

ADD REPLY
0
Entering edit mode

Could this be adapted to extract a list of regions of interest from crams? Thank you

ADD REPLY
0
Entering edit mode

read the manual. see options -M and -L

ADD REPLY
0
Entering edit mode

Thanks, I've got that going I think with this:

samtools view -L "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"

But it's really slow with -L! Is there a faster option??

ADD REPLY
1
Entering edit mode

read the manual. see options -M and -L

ADD REPLY
0
Entering edit mode

Thank you. The manual suggests -M may speed things up, and it looks like "its path has to be preceded by -L option". I'm sorry if I'm being dense, but I've tried several ways of adding in the -M option, but it keeps erroring. It's only running when I use -L alone, but I've still not got that to run to completion yet - still running! I've only got 7 regions of interest.

(bioinfo) skgtmdf@rds-gw-003:/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram $ samtools view -LM "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"
samtools view: Could not read file "M": No such file or directory
(bioinfo) skgtmdf@rds-gw-003:/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram $ samtools view -L "${CRAMREF}/ROI.bed" -M "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"
[main_samview] fail to read the header from "/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram/refs/ROI.bed".
(bioinfo) skgtmdf@rds-gw-003:/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram $ samtools view -M "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"
[main_samview] fail to read the header from "/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram/refs/ROI.bed".
(bioinfo) skgtmdf@rds-gw-003:/mnt/gpfs/live/rd01__/ritd-ag-project-rd018o-mdflo13/refs/1000G/cram $ samtools view -L "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"
ADD REPLY
0
Entering edit mode

Ah, cracked it with this. Thank you!

samtools view -M -L "${CRAMREF}/ROI.bed" -O bam -o "${DIR}/ROI/${SN}_ROI.bam" -T "${CRAMREF}/GRCh38_full_analysis_set_plus_decoy_hla.fa" "${DIR}/${SN}.cram"
ADD REPLY

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6