Getting information on CRAM files from headers inside the files
1
0
Entering edit mode
24 months ago
langziv ▴ 70

Hello.

I wish to know if one can find the following information in CRAM files' headers:

1) Whether or not sequencing data in CRAM files is from WGS or WES, and if so, where?

and

2) In case one file can consist of data from multiple genomes (for instance, from multiple patients), can genomes information can be found in CRAM headers? such as in

   @PG  ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-517\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_517_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_517_1.00_2.fq.gz
   @PG  ID:bwa.1    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-518\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_518_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_518_1.00_2.fq.gz
   @PG  ID:bwa.2    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-519\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_519_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_519_1.00_2.fq.gz
   @PG  ID:bwa.3    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-520\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_520_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_520_1.00_2.fq.gz
   @PG  ID:bwa.4    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-521\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_521_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_521_1.00_2.fq.gz
   @PG  ID:bwa.5    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-522\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_522_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_522_1.00_2.fq.gz
   @PG  ID:bwa.6    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-523\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_523_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_523_1.00_2.fq.gz
   @PG  ID:bwa.7    PN:bwa  VN:0.7.15-r1140 CL:/opt/bin/bwa-0.7.15/bwa mem -t 16 -M -Y -R @RG\tID:180213_I006_CL100063425_L1_PL1802120047-524\tPL:ILLUMINA\tPU:CL100063425_L1\tLB:PL1802120047\tSM:1000000047 Homo_sapiens_assembly38/Homo_sapiens_assembly38.fa /l3bioinfo/CL100063425_L01_524_1.00_1.fq.gz /l3bioinfo/CL100063425_L01_524_1.00_2.fq.gz

I'll provide more headers if needed.
Thanks.

cram-file • 1.2k views
ADD COMMENT
1
Entering edit mode
24 months ago
GenoMax 147k

Only way you may know about that is if the file names have some indicative information in them and the complete aligner command line is embedded in the CRAM file with the file/index names.

Otherwise you may need to examine the files in a genome browser or use mosdepth to determine if you see "peaky" coverage which would be indicative of exome sequencing as long as it matches gene models.

ADD COMMENT
0
Entering edit mode

Thanks @GenoMax.
Will providing relevant headers from a CRAM file allow giving a specific answer?

ADD REPLY
0
Entering edit mode

Thinking about this again, it may be safer to look for coverage rather than guess based on file names,.

Based on the @PG lines you included above the alignment appears to have been done to the entire genome but that does not tell us if the data is WGS or WES. Sample file names don't mean anything to us (unless they do to you).

ADD REPLY
0
Entering edit mode

I added these @PG lines in order to show that the CRAM file include results from 8 separate BWA runs, which made me think that it might mean that the file includes data for 8 separate genomes.

ADD REPLY
0
Entering edit mode

Only part that differs in the names is the 5** number --> CL100063425_L01_**524**. If you think that indicates a different sample then perhaps. Otherwise these may simply be data from multiple lanes for the same library.

Otherwise no way to conclude from the information you have at hand.

ADD REPLY

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6