Hi,
I have been trying to recover corrupted fastqs files. I had a decompression error;
invalid compressed data--crc error.
I got around the crc error by using gzrecover
and then used a seqkit sana
to fix sequence inconsistencies. Now, the issue is when I run FastQC, it complains that some sequences lack “+” under the sequence. I thought about using sed
but am not sure how to add missing "+" to where it should be.
Any help will be appreciated.
Update:
I run ValidateFasta and found an issue;
INFO [2021-05-21 16:13:40,878] [ValidateFastq$$anonfun$main$1] - 107300000 reads processed
Exception in thread "main" htsjdk.samtools.SAMException: Quality header must start with +: GCCCTGAAAAACAACAGTAATGATATTGTAAATGCTATTATGGAATTAACAATGTAACTATTTGACAGCGAAGACAACTCCCCCTTTCCCC at line 429343625 in fastq /Volumes/Aura/rec.test.fastq
I should be able to add "+" right below this line by sed
?
This sounds like a lost cause. Trying to fix corrupt data is not good strategy. You can't be certain of results you will generate doing this. Please go back and re-download the data.
If this was your only copy and it is now corrupt then you learned a valuable lesson. Always keep backup copies of all data.
can you post a small extract of some of those corrupted lines ?
This is a output aronb the problem line;
I have to delete this garbage
and add "@A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT".
Yep, sorry seeing this I have to go with GenoMax 's point of view I'm afraid.
Moreover, you don't need to change it to a '+' you need to change it to the read header line , starting with @ and containing crucial info for correct processing of your fastq file.
better not to waste any more time on this. Those files are lost!
I noticed you changed your post, omitting the replacement with '+' .
How do you know the line should be: @A00165:69:HKJ3YDMXX:1:1127:28203:31297 2:N:0:GTCCTTCT ?
This is not garbage, these are chunks of binary data that somehow got mixed with the uncompressed text. Looks like data are lost. Seconding genomax and lieven.sterck here, give it a
rm *
, not much you can reliably do about it.