weird insert size post trimming
4
3
Entering edit mode
7.3 years ago
badribio ▴ 290

I am analyzing rna seq data from illumina stranded protocol 126bp -PE , sequences have nextera adapters using cutadapt I trimmed all the sequences but now I have reads with different lengths, also my insert size form picard has weird peaks, find below has any one experienced this before? enter image description here

EDIT

Apologies I did little more digging around the pipeline and found out that flexbar was used to remove adapters

and the command used was

flexbar --adapters adapters.file --adapter-trim-end RIGHT --length-dist --threads 12 --adapter-min-overlap 7 --max-uncalled 250 --min-read-length 25

adapter used

Read_1_Sequencing_Primer_3_to_5 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Read_2_Sequencing_Primer_3_to_5 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Read1

enter image description here

Read2enter image description here

Using leeHom tool.

enter image description here

log file leeHom

Total reads :231670526  100%
Merged (trimming) 22238730      9.59929%
Merged (overlap) 0      0%
Kept PE/SR 93263449     40.2569%
Trimmed SR 0    0%
Adapter dimers/chimeras 533995  0.230498%
Failed Key 0    0%
RNA-Seq • 7.3k views
ADD COMMENT
0
Entering edit mode

Please give the full command for cutadapt. I saw this kind of odd shape before, but it turned out that adapters were improperly trimmed. What are the Nextera sequences that you trimmed for?

ADD REPLY
0
Entering edit mode

Agreed, this graph indicates 2 things:

1) incomplete adapter trimming 2) incorrect insert size calculation

The gap is present because reads with only a few (say, under 10 or so) adapter bases are not getting trimmed. Then when they map and appear to overlap the insert size is not calculated correctly. I suggest using a different method for both trimming and insert-size calculation.

ADD REPLY
0
Entering edit mode

updated post, with more details

ADD REPLY
0
Entering edit mode

Also be sure that the adapter file is correct. CTGTCTCTTATACACATCTCCGAGCCCACGAGAC is what you have to trim for in case that the Nextera adapter was used for library prep (on both strands, same adapter sequence).

ADD REPLY
0
Entering edit mode

I have edited my post, looks like i have two diffrent adapters and not the same one

ADD REPLY
0
Entering edit mode

They have been trimmed using these adapters and flux bar command

ADD REPLY
0
Entering edit mode

Any chance you might post part of the data somewhere on an ftp or dropbox?

ADD REPLY
0
Entering edit mode

Sorry I wish I could..

ADD REPLY
0
Entering edit mode

could you try to run leeHom on it and post the plot for size of the insert?

ADD REPLY
0
Entering edit mode

I am not sure what leeHOM means? is that a tool?

ADD REPLY
0
Entering edit mode

leeHom is a read merging tool. Insert sizes can be calculated by mapping or merging.

ADD REPLY
0
Entering edit mode

it trims and merges using a Bayesian algorithm. Could you try it and plot the results?

ADD REPLY
0
Entering edit mode

Well installing it on our cluster would be pain... any other way out?

ADD REPLY
0
Entering edit mode

running it on your local account?

ADD REPLY
0
Entering edit mode

looks like its not an easy tool to install, not able to install it on cluster or on my local account make: *** [leeHom.o] Error 1

ADD REPLY
1
Entering edit mode

I just ran:

git clone --recursive https://github.com/grenaud/leeHom.git
cd leeHom/
cd bamtools/
mkdir build/ && cd build/ && cmake .. && make -j 5
cd ../..
make
src/leeHomMulti

the last command ran fine.

ADD REPLY
1
Entering edit mode

wow...I have no idea why when I pulled git repository this did not work! now it works...

ADD REPLY
0
Entering edit mode

could you give it a try and redo the plot you showed?

ADD REPLY
0
Entering edit mode

sure i will , would like to double check what I used , command please let me know if i need to add something else src/leeHomMulti -t 10 -f AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -s AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -fq1sample_R1.fq -fq2 sample_R2.fq --log sample_log.txt

ADD REPLY
1
Entering edit mode

use -fqo [put output file prefix],

I suggest you run once with --ancientdna and once without, just for fun.

ADD REPLY
0
Entering edit mode

I have uploaded the figure. What is leeHom capturing that other tools are not? could you please help me understand

ADD REPLY
0
Entering edit mode

but what about short ones? did you truncate the figure? like 100bp.

ADD REPLY
0
Entering edit mode

No i did not truncate any reads, just used the command i posted > new fq reads> hisat> picard insert size metrics

ADD REPLY
0
Entering edit mode

did you map the single end and paired end? there should be a .fq.gz file. https://github.com/grenaud/leeHom#paired-end-mode

ADD REPLY
0
Entering edit mode

I just mapped paired end.._r1.fq.gz and _r2.fq.gz

ADD REPLY
0
Entering edit mode

could you map the .fq.gz as single end and merge the BAM files?

ADD REPLY
1
Entering edit mode

Yes will upload in sometime..

ADD REPLY
0
Entering edit mode

because the reads with less than 126 bp are found there.

ADD REPLY
0
Entering edit mode

I don't think picard can handle single end reads for insert size for plots

ADD REPLY
0
Entering edit mode

how about:

samtools view mapped.bam |awk 'function abs(x){return ((x < 0.0) ? -x : x)} {if(and($2,5)==0){print length($10);}else{ if(and($2,66)==66){ print abs($9); }} }'
ADD REPLY
0
Entering edit mode

awk: line 2: function and never defined awk: line 2: function and never defined

ADD REPLY
0
Entering edit mode

get pysam and run: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys; 


bamInputFile  = pysam.Samfile(sys.argv[1], "rb");


for read in bamInputFile:
    if read.is_unmapped :
        continue;
    if read.is_paired :
        if read.is_proper_pair and read.is_read1 :
            print abs(read.template_length);       
    else:
        print len(read.query_alignment_sequence);

bamInputFile.close();
ADD REPLY
0
Entering edit mode

would you like me to email it, its a pretty huge file.

ADD REPLY
0
Entering edit mode

sure my gmail: gabriel [dot here] reno

ADD REPLY
2
Entering edit mode
7.2 years ago
Gabriel R. ★ 2.9k

If you try leeHom (https://grenaud.github.io/leeHom/), it can trim adapters and merge overlapping using a Bayesian algorithm so no cutoffs are required. It will produce 3 types of files to map:

  1. [PREFIX].fq.gz Sequences that were trimmed and merged confidently by leeHom
  2. [PREFIX]_r1.fq.gz Forward reads that were neither trimmed nor merged confidently by leeHom
  3. [PREFIX]_r2.fq.gz Reverse reads that were neither trimmed nor merged confidently by leeHo

Map the .fq.gz as single reads and r1/r2 as pairs. To plot the insert size, get pysam and run the following script as such: python insertSize.py in.bam

#!/usr/bin/python

import pysam
import sys; 


bamInputFile  = pysam.Samfile(sys.argv[1], "rb");


for read in bamInputFile:
    if read.is_unmapped :
        continue;
    if read.is_paired :
        if read.is_proper_pair and read.is_read1 :
            print abs(read.template_length);       
    else:
        print len(read.query_alignment_sequence);

bamInputFile.close();
ADD COMMENT
1
Entering edit mode
7.3 years ago
Len Trigg ★ 1.6k

My guess is that you have read-through occurring, so that while the adaptors are being trimmed off one read, the other read in the pair is reading through into the adaptor. When only a small amount of read-through has occurred (i.e fragment length just under 129bp), this is not being identified by the trimming tool, and so during alignment, the aligner will put some mismatches at the end of the alignments. When enough read through has occurred, this is either being trimmed, or the aligner is soft-clipping the mismatching end of the read. Try using a trimming tool that identifies read-through.

ADD COMMENT
0
Entering edit mode

any tools that you would reckon for this?

ADD REPLY
0
Entering edit mode

You can try AfterQC, it trims reads based on overlapping analysis and may address your problem.

ADD REPLY
0
Entering edit mode
7.3 years ago
chen ★ 2.5k

This looks like a bug.

Since your data is PE, you can use AfterQC to trim your adapters, which is based on a different algorithm and doesn't require you to input adapter sequences.

AferQC is available at: https://github.com/OpenGene/AfterQC

ADD COMMENT
0
Entering edit mode
7.3 years ago
ATpoint 85k

Ok, it seems that you have the TruSeq Ribo adapter. You can doublecheck with this file from Illumina, containing the common adapters. In any case, it is not Nextera. Retrim with the proper adapter sequence and the issue should be solved.

ADD COMMENT

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6