I know this topic has come up many times before but I wanted to get feedback on what people are doing now and to consolidate things into one post.
Originally I was using allprep which is a simple python script but I found that I was still getting some adapter contamination afterwords.
I then found out about tagdust which did a great job of removing the adapters but split things into two files: a 'clean' file and an 'artifact' file which leads me to my 1st set of questions: Do people usually work off the 'clean' file or mask the adapter in the 'artifact' file and merge it back in? I am using 50bp reads so even with the adapters removed I should still have quite a bit of sequence left. If I mask the adapters and merge the files, should I set a length cutoff in case a read is too short for some reason?
Next I have to de-barcode the files, I've heard many good things about fastx toolkit since it can match barcodes up to one mismatch but for pair-end reads the read-pairing is off afterwards (I actually tried to put it into BWA and it gave many "fail to infer insert size" messages. brentp has code posted to fix this and I also found another program to re-pair but seems overly complicated so now I am slowly reading the code to figure out the differences between the two methods.
For future reference, I found a couple more discussions on barcodes/adapters here here here
While we are on the subject of preprocessing, I noticed that fastx toolkit has a remove duplicates feature which I thought might be nice to do before the alignment process since after aligning I would run markDuplicates anyways. I found some code (thanks again brentp) to output fastq instead of fasta but ideally you would want to only remove the read if both fwd and reverse reads were identical right? since that way the whole fragment would be identical (PCR duplicate)
I was also wondering if it was appropriate to do the quality filtering at this step using something like this or this since I usually use a -q20 option in bwa. I am currently using reads generated from Illumina 1.5 pipeline and I run bwa with -I command, does anyone know if the -q20 or -I command would be sufficient to remove the poor quality reads mentioned here (biostar Q)?
Here is the final bwa lines
bwa aln -t 4 -q 20 -I -B 6 ../human-assembly/human_g1k_v37.fasta ./run1_0h-1.fq > ./tmp1a.sai
bwa aln -t 4 -q 20 -I -B 6 ../human-assembly/human_g1k_v37.fasta ./run1_0h-2.fq > ./tmp1b.sai
bwa sampe ../human-assembly/human_g1k_v37.fasta tmp1a.sai tmp1b.sai ./run1_0h-1.fq ./run1_0h-2.fq | samtools view -Shu - | samtools sort - ./run1_0h_q20
tl;dr
remove adapter (tagdust)
de-multiplex (fastx barcode splitter)
remove duplicates? (fastx collapser)
change quality value/ filter on quality?
redo pairing (brentp's code)
run bwa
am I missing anything?
Update
Since I posted this question a couple years ago, I figure I should share some insight into what I am doing now. Sequences with adaptors and barcodes included are assumed to not map well so instead of removing them a-priori or trimming I set a mapping quality filter. Duplicates are removed after mapping using either MarkDuplicates from picard or quality filter for non-uniquely mapping reads from BWA (using something like samtools view -F 1548
)
@Manu, that link you posted (for FLASH) is for merging paired-end reads where you have a short insert, the re-pairing reads discussed above has to do with discarding orphaned reads (where one of the pair is removed from the quality trimming step) and creating the correct order of pairs in both files. This step is necessary so you can safely use programs like Velvet.
For the re-pairing step, take a look at FLASH: http://www.cbcb.umd.edu/software/flash/
I have downloaded illumina paired end data from DDBJ. Should there be any barcodes present in the files?? Or are the barcodes removed before a person gets the sequenced data?? Who is responsible for removing these barcodes? Should there be any barcodes or other adapter parts present in 5' of forward or reverse reads in the files? If contamination is present, should the whole read be deleted?? (Specifically in case of 5' contamination)