Hello everybody,
May I ask for your opinion on the importance of having the read number correctly embedded in the FastQ headers?
For example:
@A00689:468:H2CKFDSX3:1:1101:25437:1016/1
and@A00689:468:H2CKFDSX3:1:1101:25437:1016/2
or
@A00689:468:H2CKFDSX3:1:1101:25437:1016 1:N:0
and@A00689:468:H2CKFDSX3:1:1101:25437:1016 2:N:0
Are there tools that rely on those patterns or you do take a look when receiving new FastQs from the sequencing facility?
I am asking, because we started to use kits that have a UMI embedded in the sequencing adapter, which is read before the second read and output into a separate FastQ file. Because we output the UMI before the second read, bcl-convert
will embed the read number 2
into the UMI reads and 3
into the headers of the mate reads.
Therefore, we ponder how big of an issue this will be, e.g. cause malfunction with downstream tools and confuse bioinformaticians? Should the read number in your opinion be changed back to e.g. /1
and /2
or would /1
and /3
be fine as well?
Sharing your opinion on this would be greatly appreciated! Thanks a lot
Matthias
Thank you very much for your insightful and quick response!
I guess, I should just run a few tests with those tools...thanks for pointing out which ones might be affected. But don't they rather rely on the accordance of the
lane:tile:x_pos:y_pos
part of the read ID to verify pairs? In this case, it might be acceptable if the read number of the mate is3:N:0
instead of2:N:0
?I was just aware that there are different notations (sometimes even using an underscore), but didn't know which one is the current standard. Thanks!
We discussed, if we should deliver the files with already embedded UMIs, but eventually felt that delivering three FastQ files would be more flexible. Subsequently, it would still be possible to embed the UMIs as required for the tool of choice, whereas users not interested in using UMIs throughout the analysis could just ignore the third file.
Do files with
UMI
reads get the nameI1
or do the files get the nameR2
? If so most of the software would probably key off those file names rather than the fastq header. You are surely not running UMI's for every run you are doing?To know for sure, I would need to ask, since I am unfortunately one of those bioinformaticians for whom sequencing data comes into existence as FastQ ;-), but to my best knowledge the indexes used for demultiplexing are read separately from the UMI.
When I get the files, they are for example called
Sample_L001_R1_001.fastq.gz
,Sample_L001_R2_001.fastq.gz
,Sample_L001_R3_001.fastq.gz
and the R2 is the UMI.Indeed, we do not read UMIs for every run, but since we are using the IDT adapters, they are often in there anyway. Therefore, the decision was to start sequencing them when appropriate, basically for all quantitative experiments. Apparently, this is now possible since Illumina upgraded their kits to contain enough reagents to run the regular cycle number plus UMIs. Before that, I was told, we often had to snatch a few cycles away from the reads, if we wanted to have UMIs, too.
IDT has a nice tech note available that details how xGen Prism data needs to be processed. I assume this is the kit you are referencing. It does require sort of non standard processing.
Thank you very much!