identify sequences deep
1
0
Entering edit mode
10.0 years ago
hana ▴ 190

Hi all

I received 8 RNA sequences data from company. I would like to check the sequences deep and be sure that the company performed the sequencing with exact deep as requested. would you please let me know how I can do it?

thanks in advance

RNA-Seq • 2.0k views
ADD COMMENT
1
Entering edit mode
10.0 years ago

I assume that "deep" means "depth" here.

Assuming that you have gzipped fastq files, then a simple zcat foo.fq.gz | wc -l will let you know. Just divide the resulting number by 4 and that will tell you if they met the minimum read number they guaranteed. Note that "depth", as typically defined, has no real meaning in RNAseq and should almost* never be used.

ADD COMMENT
2
Entering edit mode

I figured out a single liner for that!

echo "$((`eval 'zcat foo.sq.gz | wc -l'` / 4))"
ADD REPLY
0
Entering edit mode

Thanks for your reply, would you please let me know why we should divided the resulting number by 4?

Thank you

ADD REPLY
1
Entering edit mode

There are 4 lines per read.

ADD REPLY
0
Entering edit mode

Thank you for your comment. I am very new in RNA seq data analysis. Would you please let me know why we should divided the number by 4?

ADD REPLY
1
Entering edit mode

Because in fastq file, every fourth line denotes your reads, e.g. 2,6,10,14,18,22,26 and so on these lines in fastq is reads sequences, other lines are of different purpose;

  1. 1st line starts with '@' is your sequence ID for your read
  2. 2nd line is your read sequence
  3. 3rd line starts with '+' is something something
  4. 4th line is quality value of your read which is in 2nd line
ADD REPLY
2
Entering edit mode

3rd line is usually left blank to conserve space. It usually held the ID of the read (again). Had a hearty chuckle at "something something" though :)

ADD REPLY
1
Entering edit mode

ha aha ha! I was not sure that what to write there, because I never understood the importance of third line :)

ADD REPLY
3
Entering edit mode

The purpose of the + line was to indicate that the sequence lines were finished (the sequence can be multiline, even if that will break most tools since it's almost never done). These days, however, the + line is just an extra useless 2 bytes.

ADD REPLY
1
Entering edit mode

The third line seems to have a rich glorious heritage!

ADD REPLY
0
Entering edit mode

Thank you for sharing the information

ADD REPLY

Login before adding your answer.

Traffic: 2048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6