Running BWA on single-end data (i.e. only one fastq file) with "sampe" option
1
0
Entering edit mode
8.4 years ago

Hi all - I'm basically trying to replicate an alignment procedure from a paper. My data is in an SRR file at www.ncbi.nlm.nih.gov/sra I've followed the pipeline

(downloads and unzips human genome file GRRch38.fa.gz...) bwa index -a bwtsw GRch38.fa fastq-dump --split-3 SRR123 (at this point, I only end up with the file SRR123.fa... one file, not three.) bwa aln GRch38.fa SRR123.fastq > SRR123.sai

at this point, the next step is (it seems) to build a SAM file, bwa sampe <ref> <sai1> <sai2> <fq1> <fq2> > ~/glob/gatk/<sample>.sam

The difficulty is that I only have one fastq file and one SAM file. Reading up on the matter, I think this means I have single-end data - which might simply mean that my data is incompatible with the "sampe" option in BWA. However, the paper I'm following clearly states that we should use the "sampe" option, so that the mate pair information will be correct.

If someone could spare some thoughts about what appears (to them) to be going on here, I'd really appreciate it. I apologize if I'm missing something obvious, but I've done lots and lots of Googling and I'm somewhat stuck-ish.

alignment • 4.6k views
ADD COMMENT
1
Entering edit mode

If you know that this data is paired-end then you should have used --split-files option when you used fastq-dump, that would yield the paired files.

Alternatively you can search at EBI-ENA with the accession number. ENA allows you to download the fastq files directly without having to worry about sratoolkit.

ADD REPLY
0
Entering edit mode

Thank you. The thing is, I DON'T think this data is paired-end - I'm only using the "sampe" option because they specifically mentioned in the paper that they used it.

I'm pretty sure I tried running fastq-dump using the "--split-3" option, and only got one fastq file. But I'll try with --split-files. (For some reason I'm having trouble accessing EBI-ENA.)

ADD REPLY
0
Entering edit mode

It is possible that sampe is a typo in the paper?

Did you check on ENA to see if there are one or two data files. They should be clearly marked on the ENA record.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. I checked ENA - there's only one file for this run.

I know the distinction between "mate pair" and "pair end" is a common question - but they do definitely mention "mate pair." I'm not sure if this is amounts to a claim that their data should have two files, though.

ADD REPLY
2
Entering edit mode
8.4 years ago

sampe is for sam-paired-end.

use the other algorithm samse: sam-single-end

ADD COMMENT
0
Entering edit mode

Thank you. I don't think this data IS paired-end - I'm only using the "sampe" option because they specifically mentioned in the paper that they used it.

I've tried using "--split-3" option, and only got one fastq file. but I'll try with --split-files and see if I still get just one fastq file.

ADD REPLY
0
Entering edit mode

If the data isn't paired-end and they said the used sampe then they're full of it. It'd be nice to know what dataset this is.

ADD REPLY
0
Entering edit mode

So, it's not paired-end - I tried a fastq-dump with the "--split-files" option, and just got the file SRR123_1.fastq.

Devon Ryan, I'm really sorry, but I don't feel comfortable identifying the dataset (especially since there are others in the research group), though I can certainly sympathize with your curiosity. But the paper most definitely claims to have used the "sampe" option.

Thank you to everyone for your help.

ADD REPLY

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6