Forum:Please, please, define what a 'read' is.
4
2
Entering edit mode
8.2 years ago

Half a year into bioinformatics and I can't find a decent glossary. Seems like those who work with bioinformatics like to confuse everyone out of the field. :(

The words "read-reads" are simply jargon, I am unable to find any definition whatsoever. All index or quantifying pursuits are completely biased towards mapping data, which hinders non-experimental experiments. No one talks about how many false annotations there are out there.

Bioinformatics tools are brutally dependent on one another, but nobody seems to try and build a "suite" of tools. You have to download and download packages here and there, just to fail to use them or to test which one fits better, because I get quickly criticized by a teacher and colleagues for not approaching the vague "standards".

For what I understand, a read can be:

  • All of the fragments mapped: The sequence a machine managed to detect, with all its hardware and methodical quirks.
  • The sum of small sequences that were managed to be compiled into a .fasta file, for example by programs like velvet or bowtie2.
  • A section of a file that I want to interact with.

I must ask out there, shouldn't there be a different, alternative, technical name for a read? I mean, when you google it, it gets confused with the verb 'to read' and the indicative, the imperative, past participle and whatnot, all of them called 'read'.

I understand why the name is a thing, that it suits well, and perhaps I have not been doing as much bioinformatics to fully understand them, but I believe scientific knowledge should avoid these language issues if it can. Or am I criticizing something uncriticizeable?

read • 6.7k views
ADD COMMENT
2
Entering edit mode

I think there is a lot of misunderstanding here. For starters:

  1. Not all quantifying tools are based on mapping. Consider a distribution of per-base quality scores.
  2. What is a non-experimental experiment?
  3. Annotations can be derived experimentally or informatically. The latter is known to not always be precise.
  4. Generally, a read is simply a sequence fragment produced by a machine. They can be long, short, paired, single, etc. Sequences do not have to be mapped to be reads.
  5. Velvet and Bowtie2 do very different things; one is an assembler, one is a mapper. Mappers place reads in a genomic context and do not produce FASTA files.
  6. A section of a file that you want to interact with is defined by a region, generally in a BED or interval format.
  7. I bet if you Google 'read bioinformatics,' 'high-throughput sequencing read,' or 'illumina read' you'll get better results. Our most basic units are reads; chemists' most basic units are atoms.
ADD REPLY
3
Entering edit mode
8.2 years ago
d-cameron ★ 2.9k

The SAM specifications do define what a read is, as well as a number of other terms. Unfortunately, not all tools use SAM/BAM/CRAM files follow the terminology as defined in the specification and even the specifications themselves later refer to a 'fragment' without formally defining what that is.

1.2 Terminologies and Concepts

Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.

Segment A contiguous sequence or subsequence.

Read A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced.

Linear alignment An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e. one portion of the alignment on forward strand and another portion of alignment on reverse strand). A linear alignment can be represented in a single SAM record.

Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for 0x40 and 0x80 flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary. Read alignment A linear alignment or a chimeric alignment that is the complete representation of the alignment of the read.

Multiple mapping The correct placement of a read may be ambiguous, e.g. due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for 0x40 and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary.

ADD COMMENT
1
Entering edit mode

This definition of the "read" leaves a lot to imagination - it first talks about "a sequencing machine" that makes it really complicated. Then it talks about indexing, again it is unrelated to the concept of reads. Finally I really don't consider reads as multiple segments.

IMHO a read is simply a sequence measurement. Just like when we say take the temperature, measure weight we write down some numbers that indicate that measurement. When we measure sequences we call that thing that we jot down: a read.

ADD REPLY
0
Entering edit mode

Wouldn't the measuring unit be a base then?

ADD REPLY
0
Entering edit mode

A base is not an independent measurement, it's part of the read.

ADD REPLY
0
Entering edit mode

Why wouldn't it be an independent measurement? You get a quality score for each base, not for the read as a whole.

ADD REPLY
1
Entering edit mode

A base doesn't mean anything without the context in a read. The result of your sequencing is not that you sequenced 5496541564796461x A, 6546541313614351x T, 6543541354135135 C and 5465411361361361x G

ADD REPLY
0
Entering edit mode

not anymore how a single digit is a measurement of anything (unless the number that expresses our measurement is just one digit long) - you'd need the entire number to know what you got.

a single base is a read only when the read happens to be just one base - but not otherwise.

ADD REPLY
1
Entering edit mode
8.2 years ago
colin.kern ★ 1.1k

Reads are definitely not fragments. For most assays, you are going to have fragments much bigger than the size of your reads. If you did single-end sequencing, you will extend all aligned reads to your predominant fragment size (hopefully you did size selection), or if you did paired-end reads you can infer each fragment from the area between each pair of aligned reads.

I would define a read as a single nucleotide sequence output from a sequencing machine.

ADD COMMENT
1
Entering edit mode
8.2 years ago
GenoMax 148k

Process of analyzing DNA sequence (akin to "reading" a sentence) has been termed "reading" since the discovery of commonly practiced "sequencing" method by Dr. Fred Sanger. The end result is a "read", a stretch of single strand of nucleotides (sequence). Same basic principle described by Dr. Sanger is still being used for many extant sequencing technologies and results in a "read".

Out of curiosity what background are you coming to bioinformatics from?

ADD COMMENT
0
Entering edit mode
8.2 years ago
John 13k

I am not well read in these matters, but i've read that if you read a read's quality and get a reading of less than 1, you can read between the lines and assume the data is not worth reading in to.

ADD COMMENT
1
Entering edit mode

That sounds readtarded to me :-/

ADD REPLY

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6