Hi! I'm trying to do the alignment for an ATAC-seq experiment using bowtie2. We did paired-end 100 pb sequencing. By checking with FastQC, of course I have adapter contamination. So I started trimming the reads as follows:
TrimmomaticPE $(DATAPATH)$(FILE1) $(DATAPATH)$(FILE2) $(SCRATCH_JOBDIR)/$(NAME)_output1_paired_trimmed.fastq $(SCRATCH_JOBDIR)/$(NAME)_output1_unpaired_trimmed.fastq $(SCRATCH_JOBDIR)/$(NAME)_output2_paired_trimmed.fastq $(SCRATCH_JOBDIR)/$(NAME)_output2_unpaired_trimmed.fastq ILLUMINACLIP:/home/ecorrales/Programs/Trimmomatic-0.36/adapters/NexteraPE-PE.fa:2:30:10:8:TRUE TRAILING:10 MINLEN:25
Afterwards I don't see any longer adapter contamination, and only few (0.11%) of the reads are removed. Then I proceeded to do the mapping with bowtie2:
bowtie2 -X 2000 --very-sensitive -x $(GENOMEDIR) -1 $(SCRATCH_JOBDIR)/$(NAME)_output1_paired_trimmed.fastq -2 $(SCRATCH_JOBDIR)/$(NAME)_output2_paired_trimmed.fastq -S $(SCRATCH_JOBDIR)/$(NAME)_output.sam
I then checked the insert size distribution and there were no inserts smaller than 100 bp. I assumed that this could be related to those fragments where the insert size was smaller than the read length. So I tried trimming the reads to different lengths before mapping. Indeed all the inserts smaller than the length that I trimmed to were not mapped.
In the plot below I show the insert size distribution after trimming the reads to 50 bp or 75 bp (or 100 bp for no trimming):
I thought that adding the --dovetail parameter to the mapping would do the trick, but nothing changed. Is there any other parameter that I could be missing?
How are you calculating fragment size? You also may need to add
--soft-clipped-unmapped-tlen
in addition to--dovetail
.Thank you for your reply. I tried with
--soft-clipped-unmapped-tlen
but doesn't seem to change. For calculating the fragment size I usedHave you checked what happens to the reads with small insert sizes when you don't trim?
I guess the problem might be indeed be in the insert size calculation rather than on the mapping. I initially thought that the small inserts were somehow considered discordant, although I get the following metrics:
And the insert distribution looks like this (below is log transformed):
However, I accidentally re-run the mapping after cropping the first three bases of the reads using
HEADCROP:3
parameter from Trimmomatic. Afterwards the insert size distribution looked much better (still I see a gap at 100 bp):The mapping efficiency looks very similar, so this is why I guess the mapping was not the problem:
Still I'm a bit confused why this happens or what I'm doing wrong. I don't think is a matter of the quality of the first bases since it looks good, and the mapping efficiency looks similar regardless of removing or not these three bases.