Weird insert size distribution plot
0
0
Entering edit mode
4 months ago
Ankit ▴ 500

Hi everyone,

I was trying to do analysis on my sample and observed a weird pattern of insert size distribution.

Run is PE: 2X150

The pipeline is as follows:

1 UNTRIMMED

Alignment (bwa mem) --> samtools stats --> grep only "IS" rows --> Plot insert size (R)

Untrimmed fastq file insert size dist

or

2 TRIMMED

Trimming (cutadapt) --> Alignment (bwa mem) --> samtools stats --> grep only "IS" rows --> Plot insert size (R)

Trimmed fastq file insert size dist

Could any one explain why such a pattern of insert size distribution?

1. Why there is a sharp drop exactly at 150 bp in both trimmed and untrimmed data?

2. Why the gap comes after trimming with cutadapt?

Please suggest. I would appreciate any help.

Thanks

adapter-trimming reads fastq insert-size • 808 views
ADD COMMENT
0
Entering edit mode

Why the gap comes after trimming with cutadapt?

Have you checked the (part(s) of)) reads that get trimmed to get an idea of what is getting removed. Obvious candidates are primer dimers since they should be about 150 bp

What is the underlying motivation for doing this analysis?

ADD REPLY
0
Entering edit mode

Thank you for the response. Actually It was just an observation while plotting the insert size metrics with output obtained from samtools stats. I Google how the plot looks like and my plot was looking different so I wondered what went wrong. I asked lab side but for them adapter dimer could not be the issue as they checked Bioanalyser result. The more curiousity rises when untrimmed (checked only after the drop was observed in trimmed data) data also showed the drop. Any idea what could be the issue apart from primer dimer? Could insert size be the issue?

ADD REPLY
1
Entering edit mode

Difficult to say. Check what is getting removed and see if you can get a clue from the reads that are being dropped/trimmed. I am not a cutadapt user so don't know how you would do that (with bbduk.sh I can collect such reads in separate files).

ADD REPLY
0
Entering edit mode

Thank you for suggesting. I am unable to run bbduck.sh as bbmap is not installed in our server location. I tried Trimmomatic. Any other trimming tools would you suggest. May be I can do comparison for different tools.

ADD REPLY
1
Entering edit mode

No installation is required for BBTools. As long as you have java available you just need to download and then uncompress the distribution.

ADD REPLY
0
Entering edit mode

You were right no installation required. Thanks.

ADD REPLY
0
Entering edit mode

You can use cutadapt to demultiplex based on identified trimming sequences.

ADD REPLY
0
Entering edit mode

I think in cutadapt when I remove the peaks with -o and -p option to capture the output files then both the trimmed and untrimmed reads becomes the part of same file which in turn shows peaks both below and above ~150 bp. In tools where we write trimmed and untrimmed files seaprately the peak below 150 bp not observed. It is just observation from one sample. May be its more complicated then such a simple explaination. I tried to capture untrimmed and trimed reads separately from cutadapt output but the commands keep failing with some error.

ADD REPLY
0
Entering edit mode

Seems it might be related to your read size. Maybe how the aligner handles overlapping reads since insert size of ~150 is where your reads could potentially overlap completely? then maybe soft-clipping affects this behavior as well.

I did a quick test with my own data. I see the same drop off at my read size, 100bp. With ATAC-data this depends on it is "inward" or "outward" orientated pairs. enter image description here

ADD REPLY
0
Entering edit mode

Hi Thanks for response.

Yeah it seems like a adapter removal issue. I also ran different tools including bbduck.sh, fastp, AdapterRemovalv2, Trimmomatic and offcourse cutadapt. Some of them completely remove the peak below 150 bp.

My data is whole exome-seq. For your data which is ATAC-Seq I'm curious to know does it more related to peaks of mono-, di- or tri-nucleosome or insert size. In the first figure I can see three peaks which could be of nucleosome organisation pattern. What do you think?

ADD REPLY

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6