Question

Inconsistent number Pandaseq vs grep

0

Entering edit mode

6.0 years ago

oscar.nvergara • 0

Hi, I have a doubt about the number of sequences in a paired end merging using PANDASEQ and the sequencing counting. Let me explain, I merged a forward and reverse set of sequences using pandaseq, before this I checked the number using the command

EDIT: I forgot to mention I converted the fastq to fasta with fastq_to_fasta script.
    grep -c ">" sample_R1.fasta

It gives me around 200.000 sequences, when I execute PANDASEQ the last report shows this.

0x7fffc5dabb50:2        STAT    TIME    Tue Mar 12 02:53:23 2019
0x7fffc5dabb50:2        STAT    ELAPSED 47
0x7fffc5dabb50:2        STAT    READS   63600
0x7fffc5dabb50:2        STAT    NOALGN  4286
0x7fffc5dabb50:2        STAT    LOWQ    156
0x7fffc5dabb50:2        STAT    BADR    0
0x7fffc5dabb50:2        STAT    SLOW    8547
0x7fffc5dabb50:2        STAT    OK      59158
0x7fffc5dabb50:2        STAT    OVERLAPS        0 0 0 0 0 648 3666 3069 224 201 57 1076 101 57 231 47 170 4537 2147 4845 393 56 193 1327 334 57 1460 121 278 859 3108 26842 1541 75 16 7 7 3 10 1 5 3 7 2 0 0 0 7 98 833 68 3 4 1 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 6 1 0 2 22 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 4 27 2 3 0 1 0 0 0 1 0 2 0 0 0 0 0 0 2 0 0 0 0 7 16 0 0 0 39 4 0 0 1 0 1 0 2 0 0 0 0 0 11 153 0 1 1 0 28

For that reason I recounted the merged pairs and I obtain again 200.000 sequences. I also checked the merged paired end fastq with fastQC and it showed me 200.000 sequences, and the size of the sequences and their Q shows the behaivour of merged sequences.

So, my questions is. Why pandaseq shows that last report showing around 63000 READS and other softwares shows the merged sequences are properly consistent with what I expected?

I will be really grateful with any suggestion,

Oscar :)

sequence assembly • 1.3k views

ADD COMMENT • link 6.0 years ago by oscar.nvergara • 0

score 3 · Answer 1 · 2019-03-12

3

Entering edit mode

6.0 years ago

michael.ante ★ 4.0k

Fastq read ids should not start with a > but rather with a @.

You are maybe counting the occurrences of the quality score encoded by >.

[Edit] Fastq files can be counted by wc -l my.fastq | awk '{print $1/4} , since fastq files have a defined structure. Each read is described by 4 lines.

ADD COMMENT • link 6.0 years ago by michael.ante ★ 4.0k

0

Entering edit mode

Hi, sorry for my mistake. I forgot to mention that I converted the fastq to fasta and counted. Them with that command. In the other hand fasQC also red the fastq and counted around 200.000 sequences.

I tried to use your suggestion replacing my.fastq with my file names, but it doesn´t work.

edit: I could run the fastq count, there was just an ' missing at the end of the awk command

wc -l my.fastq | awk '{print $1/4}'

ADD REPLY • link 6.0 years ago by oscar.nvergara • 0

score 0 · Answer 2 · 2019-03-12

Hi! So sorry for taking your time. I could find by my self the answer.

First of all, I got confused about the file I was counting, it didin´t have 200.000 but 370.000. Second and more important, I haven't figured out that PANDASEQ does me merging using all your cores, so it does a parallel merging. My system is a six-core platform and when I used htop to check the running process I saw how it behaves.

As you can see, pandaseq starts and distributes a 6 threads run to perform the merging (63.000 * 6 = (around) 370.000)

Thanks for your help, I hope if somebody has the same question could find it here.