I would like to threat the forward and reverse fragments as independent reads in a BWA alignment.
I'm using bwa aln in order to generate data consistent with those generated years earlier by the lab
Given that I have file_R1 and file_R2 with matched forward and reverse fragments,
bwa aln ref.fasta file_R1.fastq > file_R1.sai
bwa aln ref.fasta file_R2.fastq > file_R2.sai
bwa sampe - P ref.fasta file_R1.sai file_R2.sai file_R1.fastq file_R2_fastq > file.sam
will give a mapping that uses R1 and R2 as matched fragments. Given that I have the R1 and R2 fragments in separate files, is there a straightforward way to treat the reads as independent with bwa aln?
yes, I also wondered what is this about, especially since using
bwa aln
is so outdated.the poster should use
bwa mem
and notbwa aln
even if they are trying to reproduce a previous finding,there would be very few valid reasons for using
bwa aln
(for example checking the discrepancy with the older method)The only reason I'm using bwa aln is that I'm in a lab that generated a lot of sequence data over the past several years using a bwa aln pipeline. If I now use bwa mem, comparing recent sequence data to processed data that was generated 6-7 years ago would lead to inconsistencies. It's either continue to use bwa aln for what limited raw data we still need to process or go back and rerun everything using bwa mem.
Yes, this is correct.
One question I have is this: if I create two sam files based on alignments using R1 and R2 and then use samtools to merge them, will there be any problems downstream due to the fact that the matched reads from R1 and R2 have the same heading? In other words, is it necessary for me to change the read headings in R2 before performing the alignment and merging?
you probably should rename the reads, just in case
but I will say treating a paired end sequencing run as if it were twice as abundant single end run may introduce biases, it is important to take into account just what exactly is being measured
Two questions: first, what are the alternatives to seqkit (other than writing a bash script usign sed or a python script) to rename the reads?
I'm not sure I understand the syntax for seqkit, either for your example or the (limited) documentation provided in its website, i.e. what does it even take as input arguments?
Second, I generated two bam files, one from R1, the other from R2. When I try to merge them using samtools merge, I get a "No @HD tag found" error. Perhaps a header tag is automatically generated by bwa sampe for paired end data but isn't for bwa samse? In any case, what do I need to do to correct this error so that I may merge the two bam files?
I see that your seqkit question is answered, in general look at the website here:
https://bioinf.shenwei.me/seqkit/usage/
second, the error indicates that your input files are not quite right, check that you are using the tool correctly, and that the outputs are SAM etc
What exactly are you trying to do with this data?
I'm comparing read depth with long vs. short reads
In that case you should simply align the data as you showed in the original post since using paired-end reads would allow them to be properly treated as sequenced fragment. Then calculate the depth using
mosdepth
orsamtools depth
. I don't see the point of using reads individually.