I'm QCing some raw RNA-seq data and I have a question about the meaning of 'columns' in the readout of the
fastx_quality_stats
program.
The website's definition (column number (1 to 36 for a 36-cycles read solexa file). Mine has 51) is confusing to me; it would be greatly appreciated for someone to explain to me what these columns are (possibly in relation to the actual read/ sequence fragment)
In addition, the data seem to have reads of only length 51bp. Is this normal? I'm used to variability and longer read lengths, but I've only worked with DNA seqs. These data come from a GEO study, I'm sure the description implies the files are raw bout could they be trimmed (of adapters?)?
It sounds like those data were generated by a sequencer which employs a fixed read length for each sequencing run. This is the case for Illumina HiSeq and MiSeq instruments. A 51-cycle read is definitely not uncommon for Illumina sequencing.
What you're seeing in the fastx_quality_stats output is one output row (confusingly called column) for each cycle in the read. Ergo, 51 cycles per read means you'll have 51 columns. Each column contains quality stats for that cycle. If indeed this a fixed-length read, then the value of the count field should be identical--or at least very close--for all 51 sets of stats. If this is the case then the data are likely not quality or adapter trimmed, though someone could have trimmed a fixed number of bases off, or even trimmed every read down to a fixed size.
If you can post an example header line (starts with @) from the FASTQ file I can probably tell you more about what instrument generated the data. You can get this data sample by typing, on the Linux command line:
zcat [your fastq file name] | head if the data are gzipped
or
head [your fastq file name] if the data are uncompressed.