I have received some CRAM files with paired-end WGS data. I would like to remap the reads in a specific region to another reference.
However, most of the template/read names have the format filename.cram:21937611
, so I suspect that the CRAM files have been stored with samtools ... lossy_names=1
.
When converting the CRAM file into FASTQ files (samtools fastq), I loose most of the reads, and from the CRAM file, I can see that <10% of the read names are unique.
Is it possible to generate new read names based on the information present in the CRAM file? Of course it would not be possible to recreate the old names, but is there a solution that will allow me to generate fastq files and perform realignment?
I've inserted one example for a read read name used for multiple reads below. Note that the same sample/library has been analysed in 8 lanes.
From samtools documentation:
lossy_names=0|1 CRAM output only; defaults to 0 (off). If 1, templates with all members within the same CRAM slice will have their read names removed. New names will be automatically generated during decoding. Also see the name_prefix option.
An example where filename.cram:21937611
is found as read name for multiple reads (note that the the reads may come from different read groups: RG:Z:AHH7F7CCXX.sample_name.#
, but RG:Z:AHH7F7CCXX.sample_name.6
is found 4 times)
filename.cram:21937611 99 6 28999893 60 150M = 29000058 315 AACATACCCATGTTCCCATTATGACGTTTCTTCATTTAATTAATAAATTAGAAAAAAATTCTTGTTGGATGAGTTACTAATGCCCTGAAGAATTGGATTAACCACTGGTCATACTGACACTACAGTGCCATTCACACTTAAATGCAACAG AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3 NM:i:0 MQ:i:60 AS:i:150 XS:i:0
filename.cram:21937611 147 6 29000058 60 150M = 28999893 -315 ATAGAAGTCTCTATTTAATGTGGATATTGGAAGTAAACTAAATGTGGGACTGGTGAAAATCCTTAATTAGGTTTGGTTAAATATATTTTCGGTTGGCTATTTGATGTCTTTTTAATGTATACTCTTGTTATTCATATTTACAGCTAGATT KKKKKFA<FFKKKFFFKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3 NM:i:0 MQ:i:60 AS:i:150 XS:i:20
filename.cram:21937611 1187 6 29028730 60 150M = 29028863 185 ACATCCCAGGGATGAAGCTGACTAGCTCATGGTGAATAAGCTTTTTGATGTGCTGTTGAATTGTTTGCTAGTATTTTATTGAAGATTTTCACATCAATGTCCATCAGGGATATTGGCCTGAAATTTTCTTTTTTGTTGTTGTTGTGTCTC AAA<FKKFKFK,,AAKKKFK<FFKKFKKKAFKKFKKKFAF,,77A,77,F7<KKK7FFA,K7FAAAFA7AFKFK<F7<FKKKKKF7FFKKKKKAKK,,AFAA<FFKKK,7,7F,A,,,A7AA,,<<FKKF,7,,<AF,,AFA,,,,<,7< MC:Z:18S52M80S MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4 NM:i:0 MQ:i:40 AS:i:150 XS:i:66
filename.cram:21937611 1107 6 29028863 40 18S52M80S = 29028730 -185 GTCTTTTATTTGTCTTTGTGTTGTTGTTGTGTCTCTGCCCTGTTTTGTTAACAATATGATGCTGTCTTCAAACTTAGAGTTATGTAGGAAATCCCCTCTTACTTTATTGTGGATTAGTTTCTGAAGGAAGGATACCAGTTCATCTTAGTA ,,A,,,,,,,,,,,<,,,A<A,,K<<,,7,AA7A,,(AA<,,,7,7,,,,,7<,7,,<,7KFA77,KF<7,7,,,,,7,FFA7,,,,,7,,,(,,,,,,,,KK7F,KF,7<FA,7,AF,,A,<<,,7,,,FF7KFA<,,A,F,A<FAA,, XA:Z:2,-117731944,14S36M100S,1;4,+75904852,111S26M13S,0; MC:Z:150M MD:Z:29G2T3G9A5 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4 NM:i:4 MQ:i:60 AS:i:32 XS:i:31
filename.cram:21937611 99 6 29086353 60 119M31S = 29086353 119 AGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGT AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKAKAFKKKKKKKKKA MC:Z:31S119M MD:Z:119 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7 NM:i:0 MQ:i:60 AS:i:119 XS:i:20
filename.cram:21937611 147 6 29086353 60 31S119M = 29086353 -119 ACTGGAGTTCAGACGTGTGCTCTTCCGATCTAGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGC KKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MC:Z:119M31S MD:Z:119 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7 NM:i:0 MQ:i:60 AS:i:119 XS:i:20
filename.cram:21937611 163 6 29115837 60 150M = 29116033 346 GTAACCCCTTTTACTACCTACCCCAGAGTTTCTTTTATTATTAACAACTTGCATTCATGTGGTACATTTGTTATAATTGATAAGCCAATATTAATACATTATTAGTAACCAAATTCCATAGTTTACATTAGGGTTGATGGTGTGTGTTTT AAFFFAKKKKKFKKA,F<F,A,AFKKKKAFKAKKKK7KKKKKKFKF<<AFK<AFK,AAKKKKKKKKKKKKKFKKKKFAAKK<FFF,7<FKKKFKKF,AFAKKK7AKKF<<,A7,<A<FA7<,AK77,<FAFAF<AFKKKKKKKKKKAKKF MC:Z:150M MD:Z:15C3C130 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8 NM:i:2 MQ:i:60 AS:i:140 XS:i:34
filename.cram:21937611 83 6 29116033 60 150M = 29115837 -346 TTCAGTACCATAAAGAATGGTTTCACTGCCTTAAAAATTCCCTGTTCTCCTTCCATTCATCATTTCCTCCCCTCCTCCCCGGGAGCCCCTGACAACCACTGATTTTTTATTGTTTCCATAACTGTGCTTTTTCCAGAATATCATACAATT <,,,7<,,FKAA,,,KFF<,AFA7FF7KKKAFF,7KAF,AKKFFF,KKF,FFKF7F,KKKF,KAF,FFFA,FA(F(KFA77A7<(FFKKFAFFF<FKAKK,AKKKKKKKKKKKKFK<,AFFFKKKKKKKFF,KKKKKKAKFKKFKFFAAA MC:Z:150M MD:Z:7T142 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8 NM:i:1 MQ:i:60 AS:i:145 XS:i:23
filename.cram:21937611 99 6 29144947 40 150M = 29145234 437 NACACTGACTTCCACAATGGTTGAACTAGTTTACAGTCACACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTGTCCAGCACCTGTTGTTTCGTGACTTTTTAATGATTGCCATTCTAACTGGTGTGAGGTGATATCTCCTTGTG #AAAFF<AFKKKKAKKF7FKAKFK<FK,FKKKKKKFFA,7F,FKKKFKKF,,<F77FKKKKKKFFFKKKK,AFFKKKAKKK7FKKKKFF<KFKK7<<7<<7AFKKKKFF,,7F,<,A<,,,7,,,7,,,,,,,,A7A,,,,,,,,,,,,, MC:Z:150M MD:Z:0C37C58C39G6A5PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5 NM:i:5 MQ:i:60 AS:i:129 XS:i:128
filename.cram:21937611 147 6 29145234 60 150M = 29144947 -437 CTTGTAAATTTGTTTGAGTTCATTGTAGATTCTGGATATTAGCCCTTTGTAAGATGAGTAGCTTGCAAAAATTTTCTCCCATTCTGTAGGTTGCCTATTCACTCTGATGGTAGTTTCTTTTGCTGTGCAGAAACTCTTTAGTTTAATTAG FF7<<777FKKKFKKA,7A7,,,<A7A,FFKKKKKKFF<AFKKKFFAF7F7A,F,,KKFA7AFF,,FF7FF,,FKKFKKKKKKKKKKKKKFFFAKKKKKAA7<AKKKAFFKFFKKKKKKFKAKKKKAFAAKKKKKKKKKFKKKKKF<F<< MC:Z:150M MD:Z:50C99 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5 NM:i:1 MQ:i:40 AS:i:145 XS:i:135
filename.cram:21937611 163 6 29174463 60 150M = 29174749 436 TGTATTGAGATATAATTTATTTACAATACAGTTCACCCATTTAAAGTGTACAAGTCAATGATTTTTGGTATATTTCATTACTAATTTATAACATTGTGGTAATATATAACATAAAATTTGCCATTTTAACTATTTTTAAGTGTACAATTT AAFFFKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFAAFKKKKKKKKKKKKKKFKAFFKKAAFFAFKKKKFKKK7,A,AA,AAFK< MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6 NM:i:0 MQ:i:60 AS:i:150 XS:i:26
filename.cram:21937611 83 6 29174749 60 150M = 29174463 -436 CTGCCTTATGTCTGTATGAATTTGCATGTTCCAGATATTTCATATTATTGGAAACATACAATTTTTGCCCTTTTAGGTCTGGCTTATTCATTTAGCATAAGATGTACATATTGCTTTATAGCCTGCTTTTTCACTTAGCAGCATATTGTG ,,,,,<A,<<7<A,AFFA<,<FA,,<A<AFFFA<,F<KKKAAKKKKKKFFKKKKKKKKKKFKKKKA7AFAAA<F,FFKKF<7FFFKKKKFF7KKKKKKKKKKKKKKKFKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFFFAA MC:Z:150M MD:Z:2A147 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6 NM:i:1 MQ:i:60 AS:i:147 XS:i:24
filename.cram:21937611 99 6 29202989 60 150M = 29203110 271 CATGAAATATTTAAAGAGTTTTATTCTGAGCCAAATGTGAGTAATTGACGGCCTGAGGCGCAGTCTCAAGACATCCTGAGAACAAGTGCCCAAGGTGATTGAATTACAACTTGATTTACATTTTAGGAGCATGTAAAAAATCAGTCAATA AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKA<FKKKKKKKKKKKKKKKKKKKAFAFKKKKKKKKKFFKKFKFKFKKFKKKAK MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6 NM:i:0 MQ:i:60 AS:i:150 XS:i:30
filename.cram:21937611 147 6 29203110 60 150M = 29202989 -271 TTTAGGAGCATGTAAAAAATCAGTCAATACATGTGGGGTCTGTGTTGGTTCAGTCATGAAAGGTGGAACAACTCAAAGTGGGGCCTTCCACATCAAAGGTAAATTCAAAGATTTTCTTATTGGCAATTGGTTGAAAAACTTGTTACTATT FKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6 NM:i:0 MQ:i:60 AS:i:150 XS:i:26
filename.cram:21937611 163 6 29292208 60 150M = 29292473 415 AAGAAAAAATATTAATAGAGCTATTTAACCTGGTTTGTTGAATGTGATAATGATTACAGAAAACATTGACTAATGCTGCTTATAATAATGAAAAACTGGAAAAGTGACTGGTATACAATAGGCAATCAATAAATATTTGTTAAATAAATA AAAFAKAKFFKKKKKKKKFKKKFFAAKAKKFKK7FKFFKKKKAKAKKKK<FKKAKFFKFFFFFKKKK7KKKFKFFFKKKAAFFFF<KAKFK7FA<FA7FAAFAKKFFKKFKKKKKAK7<FFKKKK<AAFAAFFFFF7,,A7,<AA<,AF7 MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5 NM:i:0 MQ:i:60 AS:i:150 XS:i:20
filename.cram:21937611 83 6 29292473 60 150M = 29292208 -415 TATTAAAAATACTAAAATGTGTATACAAGATGGTGAGACAACTCCAAAATATAAAACAACATTATGGACTTACCTTTATTTCTAAGTACTTACAAAGTTGTTGCAAAAATTTTAATATACATATTAATGGATTTAGCTATCATATGAGAT FA7A,77<A,<<AA,A<A<F<<7<FA,7A,,,A<7AF7<,7KKKKKKKF7FAFKA7FKKKKFFKFF7AKFF,<FFFKKKKKKKKKKKKFKKKKKFKKKFKFFKKKKFAKKFFAK<7AFAKKKKKKKKKKKKKKKFAAFAKKKKKKFFFA, MC:Z:150M MD:Z:150 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5 NM:i:0 MQ:i:60 AS:i:150 XS:i:20
And read group information (one sample, one library, 8 flow cells):
@RG ID:AHH7F7CCXX.sample_name.1 PU:AHH7F7CCXX.sample_name.1 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.2 PU:AHH7F7CCXX.sample_name.2 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.3 PU:AHH7F7CCXX.sample_name.3 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.4 PU:AHH7F7CCXX.sample_name.4 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.5 PU:AHH7F7CCXX.sample_name.5 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.6 PU:AHH7F7CCXX.sample_name.6 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.7 PU:AHH7F7CCXX.sample_name.7 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.8 PU:AHH7F7CCXX.sample_name.8 LB:SX441_sample_name.v1 SM:sample_name CN:U PL:Illumina
Is this part not working? Or the names you show above turn out to be identical?
New names have been automatically generated during decoding, but the names are identical for many reads. To give one example: After converting a 5 mb region from a CRAM file into a BAM file, I have 2,797,216 reads in the BAM file. However, in this BAM file, I only have 177,983 unique read names. Most of the read names have been converted into a format such as
filename.cram:21937611
, and multiple read pairs are assigned identical read names.Sounds like only way would be to write some code to rename these so they become unique. Scan this file and each time you see a new
filename.cram:NNNNNN
start numbering themfilename.cram:NNNNNN_1
,filename.cram:NNNNNN_2
until you hit next number.What is
Also see the name_prefix option.
? Does that allow you to make the read names unique?That would of course be a possibility, but then each read would become a single-end read, and I would of course like to retain the paired-end read (i.e. one uniqe read name for a pair of reads). I wonder if it is possible to match paired reads from the other information in the alignment section of the BAM/CRAM file? Column 4 has information about a the current read and column 7-8 the position of the paired read. On top of that, the RG information in column ~15 will give information about the lane, so to some extent I should be able to identify paired reads and give them the same name. But for reads mapping to multiple positions, split reads etc., this may become very complex.
Absolutely. This is going to be a big task and there is nothing off-the-shelf you can use.
I am surprised that CRAM group did not think of someone like you wanting to convert the data back to fastq.