Question

Corrupted FASTq files with missing "+" under some sequences.

1

Entering edit mode

3.6 years ago

akh22 ▴ 120

Hi,

I have been trying to recover corrupted fastqs files. I had a decompression error;

invalid compressed data--crc error.

I got around the crc error by using gzrecover and then used a seqkit sana to fix sequence inconsistencies. Now, the issue is when I run FastQC, it complains that some sequences lack “+” under the sequence. I thought about using sed but am not sure how to add missing "+" to where it should be.

Any help will be appreciated.

Update:

I run ValidateFasta and found an issue;

INFO  [2021-05-21 16:13:40,878] [ValidateFastq$$anonfun$main$1] - 107300000 reads processed
Exception in thread "main" htsjdk.samtools.SAMException: Quality header must start with +: GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC at line 429343625 in fastq /Volumes/Aura/rec.test.fastq

I should be able to add "+" right below this line by sed ?

fastq RNAseq corruption recover • 3.7k views

ADD COMMENT • link updated 15 months ago by Tommaso • 0 • written 3.6 years ago by akh22 ▴ 120

3

Entering edit mode

This sounds like a lost cause. Trying to fix corrupt data is not good strategy. You can't be certain of results you will generate doing this. Please go back and re-download the data.

If this was your only copy and it is now corrupt then you learned a valuable lesson. Always keep backup copies of all data.

ADD REPLY • link 3.6 years ago by GenoMax 148k

0

Entering edit mode

can you post a small extract of some of those corrupted lines ?

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

0

Entering edit mode

This is a output aronb the problem line;

gsed -n '429343624,429343626p' rec.test.fastq
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
P�;8���>-�T��T
              L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�`P`P`P`P`P`�XQ@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC

I have to delete this garbage

P�;8���>-�T��T
                  L_�:q����J{/�bh[�li3�=c�/>8�7���/w8Zd�7n�`P`P`P`P`P`�XQ

and add "@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT".

gsed -n '429343623,429343628p' rec.test.fastq
+
FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT
GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC
+


FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF

ADD REPLY • link 3.6 years ago by akh22 ▴ 120

0

Entering edit mode

Yep, sorry seeing this I have to go with GenoMax 's point of view I'm afraid.

Moreover, you don't need to change it to a '+' you need to change it to the read header line , starting with @ and containing crucial info for correct processing of your fastq file.

better not to waste any more time on this. Those files are lost!

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

0

Entering edit mode

I noticed you changed your post, omitting the replacement with '+' .

How do you know the line should be: @A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT ?

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

0

Entering edit mode

This is not garbage, these are chunks of binary data that somehow got mixed with the uncompressed text. Looks like data are lost. Seconding genomax and lieven.sterck here, give it a rm *, not much you can reliably do about it.

ADD REPLY • link 3.6 years ago by ATpoint 86k

score 0 · Answer 1 · 2023-09-09

0

Entering edit mode

15 months ago

Tommaso • 0

This package won't add missing characters and won't heal your files, but it will at least clean your corrupted fastq files by recovering the remaining "healthy" reads:

Give it a try: https://github.com/mazzalab/fastqwiper

ADD COMMENT • link 15 months ago by Tommaso • 0