I tried to use BWA MEM to map reads from an interleaved FASTQ.
fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"
#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"
(The FASTQs are CHM1; I used prefetch
to fetch .sra
files from three different runs from NCBI, then used fastq-dump
to convert the SRAs to FASTQs, then cat
ed them all together into one FASTQ.)
The SAM is 515Gb but has no obvious problems. samtools quickcheck
says it's valid. But when I run GATK4's FixMateInformation
or ValidateSamFile
, I get output like this
ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
There may be even more errors, but this is what I got after two hours.
I can, in fact, see that the first line in the SAM file is
###################################################################################################################################################################################################### 4 * 0 0 * * 0 0 * * AS:i:0 XS:i:0 RG:Z:CHM1
Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it?
Your sam file is invalid.
How was your
fastq-dump
command-line?How large are the concatenated fastqs? What is the output of:
all.fastq
is 342G. For each of three.sra
files, I didfastq-dump SRR1514950.sra > SRR1514950.fastq
head all.fastq
:tail all.fastq
Your fastq is corrupted, it should start with a header (like
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
), but it starts with:And the first sequence is missing its header, which should be
@SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202
.Did you use
nohup
withfastq-dump
?Huh, you're right. I guess I'm somehow using
fastq-dump
wrong. All my FASTQs start like that.head *.fastq
:Do you think there's a way to salvage the SAM I have? Or do I need to fix the FASTQs and then realign them?
I would play it safe and fix the fastq and re-aling them. You don't know how the broken headers altered bwa interpretation of the file.