I'd like to introduce a new member of the BBMap package, FilterByTile. It's intended to increase the quality of libraries without incurring bias, or to help salvage libraries with major positional quality problems (like flow-cell bubbles).
Overview
The quality of Illumina reads is dependent on location in a flowcell. Some areas are in poor optical focus, or have weaker circulation, or air bubbles, and thus can have very low-quality reads that nonetheless pass Illumina’s filter criteria. While these reads usually have below-average quality scores, it requires very aggressive quality-filtering to remove all of the reads with positionally-related low quality. Aggressive quality filtering and trimming can, in turn, cause detrimental impacts on analysis because sequence quality is also sequence-dependent; thus, aggressive filtering can incur bias against extreme-GC portions of the genome, or specific motifs. This may yield poor assemblies, incorrect ploidy calls, bad expression quantification, and similar problems.
FilterByTile is designed to filter low-quality reads based on positional information. By removing only a small fraction of reads - those in the lowest-quality areas of the flowcell – the overall quality of the data can be increased without incurring sequence-specific bias. The default settings of FilterByTile typically remove on the order of 2% of the reads, while reducing the overall error rate by substantially more than 2% (on the order of 10%). Essentially, it gets rid of the worst of the worst.
FilterByTile was originally developed after observing spikes in the kmer-uniqueness plot used to calculate library complexity, in what should have been a monotonically-declining exponential decay curve (generated by bbcalcunique.sh); these spikes corresponded to low-quality locations on the flow-cell. Interestingly, the spikes often have a regular period, indicating a structured pattern such as flow-cell edges, tile edges, or a “streak”. The initial goal of FilterByTile was simply to eliminate these spikes to allow better estimation of library size and complexity, but it can be useful for generally improving library quality as well.
Notes
How it works:
Illumina read names contain information about each cluster’s lane, tile, and X,Y coordinates. FilterByTile scans all reads in the file and calculates the average quality score for a given position. Additionally, the average kmer-uniqueness rate is calculated by position; for data with sufficient depth, this can be used as a proxy for error-rate, allowing filtering of data with inaccurate quality scores.
To calculate a useful average quality for a position, sufficient reads are needed. So, reads are aggregated by position into rectangular “micro-tiles”; these micro-tiles are iteratively expanded until the average micro-tile contains at least X reads (default 800). Then, the averages are calculated on a per-micro-tile basis, standard deviations are calculated, and for tiles at least Y standard deviations worse than normal, all reads are discarded together. Thus, smaller micro-tiles allow more precise positional filtering, but larger micro-tiles yield more accurate quality-score averages. Arbitrary shapes such as circles outlining bubbles would be optimal, but there are no plans for this.
How and when to use, or not:
FilterByTile is applicable to any Illumina HiSeq, MiSeq, or NextSeq sequence. However, it depends on large volumes of data for statistics; it’s useless to run it on a set of 4000 reads demultiplexed from a much larger run. In that case, it would be better to use the “dump” flag to dump the statistics from all libraries in the run together, then use the “indump” flag to filter the libraries individually. That way, quality statistics gathered from all reads will be applied to each individual library.
The filtering should be beneficial in most cases - particularly when you want to salvage a library that obviously had bubbles or low-flow streaks in the lane, but also for libraries with no dramatic positional quality issues. However, there are some cases – such as complex metagenomes - in which more coverage is strictly beneficial, so throwing away even low-quality reads is a bad idea. In these cases, or any situation where very low coverage is expected, filtering will often lead to inferior results. With high coverage, FilterByTile should be strictly beneficial, but with extremely low coverage (or with single cell/metagenome data, in which coverage is highly variable and so low coverage is expected), I do not recommend using it. In those cases it might be useful, but at least in my tests of metagenomes, it yields inferior results. All coverage is useful when you have insufficient coverage and are trying to assemble things with 2x coverage, as in the case of single cells and metagenomes. Therefore, any program that removes reads tends to be detrimental in that case. But I recommend it for isolates, or basically, in any situation where you expect >5x average coverage.
Read names:
FilterByTile depends on read headers to identify flowcell location. It has been validated with HiSeq, MiSeq, and NextSeq data, but different Illumina demultiplexing/base-calling software versions have different naming conventions, so please contact me if you see Illumina names that it can’t parse. Renamed reads (such as those in the SRA) probably won’t work.
Memory:
FilterByTile should not need too much memory, but if it runs out of memory it will generally be due to calculating kmer uniqueness for a large genome. In this case, the “usekmers=f” flag will ignore kmers and just use quality scores; in that case, it won’t run out of memory.
Usage Examples
Single-ended or paired/interleaved files:
filterbytile.sh in=reads.fq.gz out=filtered.fq.gz
Paired reads in twin files:
filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq
Filtering using a statistical profile from multiple libraries:
cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell
Filtering aggressively (when you know there’s a serious problem):
filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
Disabling kmer uniqueness to increase speed and decrease memory usage:
filterbytile.sh in=x.fq out=y.fq usekmers=f
Hello thanks for the nice explanation. I have exactly this problem, indeed in my reads sequenced with Illumina Machine I have noticed (after performing the quality control with fastqc) that specific tiles (or flow cells) had problems during the sequencing (bubbles during the loading etc..). Now from the quality control I know where exactly are these tiles and which read correspond to that tile (fifth value in the illumina reads below in black):
@ST-E00204:114:HHKTJALXX:4:1101:22962:1538_1:N:0:1/1
Is there a way this tool is able to remove reads according to specific reange of "tile" numbers? LIke for example from 1101 to 1400?
Thanks in advance, Cheers
If you have access to the original data folder and
bcl2fastq
software (or MiSeq resporter) you can add an option called--ExcludeTiles 1101-1400
(to the samplesheet in case of MiSeq reporter reanalysis) to exclude tiles between those numbers during the analysis.--ExcludeTilesLaneX
(replace X by lane #) should be used for HiSeq FC.Thanks for the reply. No I do not have the access to all the files (at the moment) so I was trying to do it directly from my fastq files (already merged in a single one)...
Despite the name, FilterByTile does not actually filter out reads from specific tile numbers that you give it. It's all automatic and it filters areas much smaller than tiles. I suggest you try it; even though you can't tell it which tiles to filter, it should automatically detect which areas need to be filtered. I'd be interested in hearing if it does a worse job than you can do manually.
let's see I will try.. thanks Brian
Hi Brian, I have tried 'filterbytile' from bbmap and am doing now the quality check with fastqc. Meanwhile i made a unix command to remove all the reads that are in those tiles signed as red in the heat map.
Before removing:
pic upload edit
After Removing:
get url for picture
As you can see there is an improve already removing those tiles. Now I will perform the mapping to see if there is an improve in the mappability of the reads also. Lets wait for the fastqc output based on the bbmap "filterbytile.sh" script. Cheers
It's odd that the results would be identical, if FilterByTile was operating correctly. Would you mind posting the stderr output when running it?
o wait, i did the quality with the same file :((, let me modify the previous report and lets wait for the output of the quality check with the output of "filterbytile.sh" script.. i think in less than 1 hours i have the output in the meantime here is the output of the script:
I have formatted your code correctly. In future use the icon shown below (after highlighting the text you want to format as code) when editing.
yes thanks lets wait for the output also...
Dear Brian, Dear fusion.slope, Could you add the fastqc pictures how FilterbyTile affects the quality? Do I understand right that the pictures above show the results of a script you wrote yourself and that you removed complete tiles? I believe your picture after FilterbyTile is still missing.
I can provide my input fastqc and result fastqc with default settings:
input at https://ibb.co/i2JPzF and result at https://ibb.co/eA3Dta
Is this correct behavior by BBtools and/or FASTQC. It looks almost worse after filtering. Or is this way Fastqc calculates averages for this image?
Any feedback or help appreciated. Best!
It's hard to say; I don't personally use FastQC and I don't know how it chooses colors. But I think the red tiles probably had 100% of data removed. It might be helpful if you could post the screen output of FilterByTile.
If you want to evaluate the quality of the data, and you have a reference, I would recommend mapping the data (before and after filtering) to the reference with BBMap and posting the resulting histograms:
BBMap also prints to the screen the mapping rate and error rates. The mapping rate should be high and the error rate should be lower post filtering.
Dear Brian, thanks a lot for your answer. In between I wrote an awk-oneliner and manually filtered the base pairs according to tile and position. Further I checked the fastq-output and I can confirm that the red-lines are generated because there is no data left.
For completeness I added the screen output of your tool, in case this helps. I will map the data and compare pre and post filtering.
Hi br.m,
sorry for the late reply. Basically i did a bash script removing the tiles in the red color. Unfortunately the mappability of my samples removing those tiles remains the same.. As I am understanding from your pictures, you have even worst "per tile sequencing quality". I am sorry but I do not know the reason.
To be sure try to make your own script and remove the tiles from 2100 to 2116 and check if this improve your mappability. In my case it did not..
I can't help more. Best
Mappability issues may be unrelated to tiles. Are you sure that only certain tiles are exhibiting mappability problem? What fraction of reads are not mapping and what exactly is this experiment trying to achieve? Have you taken some of the un-mapped reads and blasted them at NCBI? You may find that you simply may have contamination that is unexpected.
I took the un mapped reads and blasted them without any success. I have mapped the un mapped reads to the genome of lambda virus and micoplasma with not success (in both cases very few reads mapped ). I am speculating that the mappability is related with the tiles because according to the quality of my tiles i had a linear relationship with the mappability (i.e. the bad the quality of the tiles = the lower the mappability of reads). But then given that removing those tiles did not improve the mappability I thought that it was not related to that.
That is odd. What fraction of reads are unmapped? Are the reads that are mapping doing so with good quality (are you using BBMap for this)?
If you are certain that you are doing everything right then I would suggest contacting the service provider and checking if the run this sample was on ran into some kind of software/hardware issue.
in two samples we had ~ 60% of the reads un mapped in other 2 we have 35% (each sample had ~ 450 mil reads in paired end so ~900 mil). Why hardware and software issues? I have done all the checksums and the files were transfered in the correct way (output of checksums was ok). Could you please tell me why you think is a software issue?
Also i have used BBMap but I did not have any interesting information.
Thanks in advance
Hardware/software/reagent issues on the sequencer itself. Generally sequencing providers should not release data that had such problems (we don't) but one never knows.
That is a fairly large % of unmapped reads that do not seem to belong to your genome of interest (or to anything interesting after blast) correct? I suggest that you contact the sequence provider and see if they can shed any light on what is going on.
Hi there,
I am a masters student and this is my first time assembling a genome.
I have paired-end Illumina sequencing data from a plant genome (approximately 2.5Gb, with a high percentage of repetitive sequences).
The quality of the heatmaps from the data obtained from the HiSeq 2500 run, all fail. I ran FilterByTile on each file, and visually the quality improved, but the quality check still failed on FastQC (after adapter removal, trimming and quality filtering were performed).
My question is whether I should rather use the data without running FilterByTile (and have a higher percentage of coverage), or use the data on which I ran FilterByTile (even though the quality check still fails).
When using the aggressive parameter with FilterByTile, I lose 11-14% of my data. When using the normal setting, I lose 5% of data.
Kind regards, Allison
@Allison,
Can you post the initial QC plot of your data (How to add images to a Biostars post )?
Failing
a test in FastQC is to be taken with a grain of salt. It does not immediately mean that you have bad data. You need to consider those result in context of experiment.Thank you for the reply.
Here are screenshots from one of my datasets.
Per base Per sequence Per tile
Are you sure the was no hardware/software problem with this run? I would start by asking the sequence provider first. If the per tile plot is accurate my hunch is there was a problem with this run.
So I contacted the sequence provider, and this is the reply:
"It does not appear that we resequenced that library. I pulled the data and it does appear that we have lower clusters passing filter (50-60%) than I would expect from a good run, and a slightly higher error rate. The only other thing that pops out as questionable is the second index (read 3) has pretty low quality. On the off chance, we did look to see if we still had libraries stored, and we do not. We generally only store for ~6 months."
How usable is the data? Should the sequence provider not have re-sequenced the run?
Kind regards, Allison
Since you are de novo assembling a genome using this data is probably not advisable but if you have no other option then you could with a big lump of salt.
They should have but it looks like you (or someone else) waited for 6 months+ before looking at this data set. If you do have some of the sample stored you can always make new libraries and sequence them. If you do have some libraries left then you could ask the sequence provider to sequence them (at least for a reduced cost, if not for free).
Thank you for the advice!
Hi,
I've downloaded some SE RNA-Seq samples and ran FastQC on the files. After inspection of the output, I decided that I would want to filter out the worst reads from the tile if FastQC reports a "fail" for both the "per base sequence quality" and the "per tile sequence quality" before I map the reads. I hope to this way improve the phred score of the reads to an acceptable level. The pictures below are an example of a sample whose per base/tile quality both resulted in a "failed" FastQC output.
That said, I have a problem with filterbytile.sh. I tried to use the tool in a loop, but I can't seem to generate any output files, which prevents me from checking if redoing FastQC improves the results. This, because I receive an AssertionError. I unfortunately don't understand how to solve the error. So could someone explain to me what the error is and how I could solve it? I posted my code and the output that I got.
Code that I used:
Output:
I would suggest just do regular
bbduk.sh
scan and trimming on this data. While it does look like a few final cycles have reads bad Q-scores the median value still is high (so it must be a few bad reads from that block of tiles). If you are aligning to a good reference then things should work fine.BTW: It looks like you must have
fastq-dump
ed this data from SRA. Looks like the loop is failing with the first file so you could start the loop at1121293
and see if the jobs go through. You may have a corrupt file.Hi, I tried to run the filterbytile.sh and return me a parsing errror. I report here the error:
Can you please help me.
Thanks.
Try to extract the reads using
-F
option forfastq-dump
that should generate reads in original Illumina format.