I downloaded a file of Illumina paired reads from SRA. When split into _1 and _2 using the sratools fastq_dump --split-files, the fastq record IDs looks like this (I'm showing just the identifier lines of the first record in each file)
_1.fastq
@SRR4734558.1 HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100
_2.fastq
@SRR4734558.1 HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100
i.e., they're exactly the same in files _1 and _2. BWA-mem (v 0.7.15) is giving me error messages with these files, saying it can't find any FR pairs (and soon after, a core dump). This seems to be because there isn't any indication of '1' and '2' in the IDs. I tried adding '1:' and '2:' before the 'HWI' (using sed)
_1.fastq
@SRR4734558.1 1:HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100
_2.fastq
@SRR4734558.1 2:HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100
but BWA still didn't find FR pairs.
I also tried reducing the ID to a string with no space, and adding /1 and /2 to the end of the ID lines
_1.fastaq
@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979/1
_2.fastq
@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979/2
But BWA still does not see them as a FR pair. (It only sees reads as FF and RR in all of these cases)
So, what is the proper way to indicate '1' and '2' so that BWA still them as FR? Or is there something wrong with this version of BWA? (I am requesting our HPC IT to install the latest) . NB I confirmed that there are no duplicate IDs within either file.
I've tried (almost) everything you suggested (I reduced the fastq sizes down to 20000 lines, i.e., just the first 5000 fastq records, just to make testing faster). Nothing was recognized as FR. Here for example are three different ID header versions of the first record. The first one is the header generated by fastq-dump -F. The second and third versions are created from that by BBmap's reformat.sh by manipulating addslash and spaceslash parameters (There was no way I could find to generate "1:N" format.)
here is my typical command line
I'm using the latest version of BWA (0.7.17)
run output regardless of ID formatting:
(I presume the 10,000 reads refers to the 5000 * 2?)
An output.sam file was indeed created, it's 4MB in size, beyond that I don't know whether it's 'correct' or not.
I confess I don't understand what BWA is looking for re: 'candidate unique pairs'.
You missed this
reformat.sh
option :Have you checked to make sure those reads are in sync in R1/R2 files?
Also try downloading the fastq files directly from ENA to see if they don't have this problem.
Note: Just to be sure. You are downloading a public dataset to align against your assembly of same (or similar genome)?
I also tried a completely different pair of fastq files, generated by our sequencing core (not downloaded), with headers like this
run:
So I'm stumped.
Please use
ADD REPLY/ADD COMMENT
to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.Since you have BBMap suite downloaded I am going to suggest that you try
bbmap.sh
the aligner (http://seqanswers.com/forums/showthread.php?t=41057 ).