Hi all - I'm basically trying to replicate an alignment procedure from a paper. My data is in an SRR file at www.ncbi.nlm.nih.gov/sra I've followed the pipeline
(downloads and unzips human genome file GRRch38.fa.gz...) bwa index -a bwtsw GRch38.fa fastq-dump --split-3 SRR123 (at this point, I only end up with the file SRR123.fa... one file, not three.) bwa aln GRch38.fa SRR123.fastq > SRR123.sai
at this point, the next step is (it seems) to build a SAM file, bwa sampe <ref> <sai1> <sai2> <fq1> <fq2> > ~/glob/gatk/<sample>.sam
The difficulty is that I only have one fastq file and one SAM file. Reading up on the matter, I think this means I have single-end data - which might simply mean that my data is incompatible with the "sampe" option in BWA. However, the paper I'm following clearly states that we should use the "sampe" option, so that the mate pair information will be correct.
If someone could spare some thoughts about what appears (to them) to be going on here, I'd really appreciate it. I apologize if I'm missing something obvious, but I've done lots and lots of Googling and I'm somewhat stuck-ish.
If you know that this data is paired-end then you should have used
--split-files
option when you usedfastq-dump
, that would yield the paired files.Alternatively you can search at EBI-ENA with the accession number. ENA allows you to download the fastq files directly without having to worry about sratoolkit.
Thank you. The thing is, I DON'T think this data is paired-end - I'm only using the "sampe" option because they specifically mentioned in the paper that they used it.
I'm pretty sure I tried running fastq-dump using the "--split-3" option, and only got one fastq file. But I'll try with --split-files. (For some reason I'm having trouble accessing EBI-ENA.)
It is possible that sampe is a typo in the paper?
Did you check on ENA to see if there are one or two data files. They should be clearly marked on the ENA record.
Thanks for the suggestion. I checked ENA - there's only one file for this run.
I know the distinction between "mate pair" and "pair end" is a common question - but they do definitely mention "mate pair." I'm not sure if this is amounts to a claim that their data should have two files, though.