I'm aligning RNA-seq paired-end reads from ENCODE project, but some FASTQ has reads with Phred error lines longer than their sequence, what makes my pipeline fails, due to a FASQT format error. Here is an example of the conflictive reads:
@30DY0AAXX_HWI-EAS229_75:7:1:1:1661/1
TTCTACTACCAGCCCTTGGGGCACTCACCCCTGTGATCAAGCAATCATTGTCAATGACAAAGTGACTATTGAAGTT
+
IIIIIIIIIIIIIIIIIIIIII<IIIIIIIIII3I*II:IIIIGIIIIIIII/5IIIIB9&F+IIII*I<I/+>9F
@30DY0AAXX_HWI-EAS229_75:7:1:1:1818/1
TTAACCACGATTATGTGCACGTACCTATGTATTATTTCTATGCGTGTCGTTGTCGAGTGCAACAACTAACTGTGCG
+
+I$##I%%'5/I#($(&"$%$-+/)%+(,(%%*/%$$4@$%*8&%+/F+.%(6#%%#2(I&#%$&%'$%;&#%*+%1II@GI+(&&%ID$#&I#&$.#11'IIGI%&+'5%&&&?%$.&+%#($%'&5#+%%)'%%&&-#%%*0%)&$'&$
@30DY0AAXX_HWI-EAS229_75:7:1:1:1976/1
TTTGTGTCTTGTTCACGTTTCTGGTCTCGTAGCTTCTCCTCTCATCTCTTTGCATTTCTGTCCTTCCATGTCTGTG
+
@30DY0AAXX_IIIIIIIII&I;0&IIIII%III1,I#I0II'II,BI=$48IIB&II33&III5I3C+H?+6,'"&I-.#4./$6IIIIIIIIGIII&III/IIGII'I28I%-GAI(II123#G*$,8III3B-/,B221(3%?$;((#+2<$*.'%C"/1
@30DY0AAXX_HWI-EAS229_75:7:1:1:791/1
CTCAAGATGACATCAGTCCCATTTGTCTTAAGTCCTGGTGTTGTGTGGATGACAAGCAGAAGCCAGTTATGATGAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII/IIIIIIIIIIIIIII?IIIIIIICIII7III@IIIII
The frequency of this "bad reads" isn't very high, but neither small enough to manually remove these from the big FASTQ files. Do you know a tool to identify reads with bad FASTQ fortmat in order to remove those from both paired-end FASTQ file?.
I tried to do a script in python, but I'm used to use SeqIO module from biopython libraries and it also fail due to the conflictive reads.
Any advice will be welcome, Thanks for for time.
I also met such problem before. and be careful, if need recorded the line number of the error, since for the pair-end fastq, you need remove the same lines in another paired fastq file.