I want to save my fastq files as CRAM files to save space. However, I do want to be able to get the reads back that I have in my fastq, in case I have to reallign in a different way in the future. Below I have the code that I use to do the conversions to BAM and back again. However, when I then check the number of lines in the original fastq with the converted fastq, the original is much smaller:
original: 17460203 (lines)
converted: 147610328 (lines)
The flagstat of the BAM file gives 71038660 + 2766504 in total (QC-passed reads + QC-failed reads) reads. I don't understand why the newly made has more reads, unless the conversion from BAM to fastq also adds reads that map on multiple locations as multiple reads. Should I remove all duplicated reads and then convert?
# align with STAR
STAR --readFilesIn fastq1.fq fastq2.fq \
--genomeDir Gencode_human/ \
--outFileNamePrefix example.
# fix mates with picard
java -jar $EBROOTPICARD/FixMateInformation.jar \
INPUT=example.bam \
OUTPUT=example.fixmates.bam \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true \
SORT_ORDER=coordinate
# convert to cram with scramble from io_lib
scramble \
-I bam \
-O cram \
-r Gencode_human/GRCh38.p5.genome.fa \
example.fixmates.bam \
example.cram
# convert back to bam
scramble \
-I cram \
-O bam \
-m \
-r Gencode_human/GRCh38.p5.genome.fa \
example.cram \
example.bam
#sort by name
samtools sort \
-n \
-o example.sorted.bam \
example.bam
#convert to fastq
samtools fastq \
-@ 4 \
-1 converted.fastq1.fq \
-2 converted.fastq2.fq \
example.sorted.bam
I like my solution above because I can be (quite) sure that my reads aren't processed in an unexpected way.
Nevertheless AFAIK the compression algorithm for CRAM depends on the given reference and the alignment information. As all of our reads are unmapped the compression isn't optimal.
What you could try is to filter out secondary aligned reads when converting to
cram
:I haven't test it. So I'm not sure whether there is an unexpected behavior.
fin swimmer
Thanks, I'm testing this now and seeing how it compares to removing secondary aligned reads
Fine, please report your result then.
I made a stupid mistake, was counting lines of the gzipped fastq file and comparing to an unzipped fastq file. After fixing that there was still a small difference, as STAR by default doesn't put unmapped reads in the output file, so had to --outSAMunmapped Within option.
After that I got the correct number of lines both with the method I posted above as well as your method. The CRAM file after alignment is 1/2 the size of the CRAM file from the unalinged BAM.
Hello,
Apologies but what does the -@4 means in your last line please?
Many thanks
The -@ 4 option means to use 4 additional processing threads when decoding the CRAM file. It won't change the result, just speed it up on multi-core systems.
Thank you very much !
You could add
SORT_ORDER=unsorted
to the picard command since fastq files (if they come right from the sequencer) are already name-sorted, and by default picard would sort it again. Picard can also write tostdout
if you useO=/dev/stdout
so one could pipe it directly tosamtools
.