Hi,
Whatever language you choose the files will remain big anyway, but I do not think it's worth to write a C code just for this.
I do not know on which operating system you are working on, but supposing that you are under Linux (or equivalent) you can have a look at the sed command.
You must pay attention to the fact that the characters '/', '1' and '2' are in the range of those used for encoding quality. So when you do a substitution, you have to make sure that it is indeed a line of read name and not a quality string.
Take this example (file example.fasq):
@HWI-ST1019:196:/121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######
There are 3 '/1' appearances : 2 in the read name, one in the quality string, you only want the one at the end of the read name to be modified.
Then execute (assuming all your read names begin with HW):
sed "/^@HW/ s/\/1$/-1/g" example.fastq
And it produces :
@HWI-ST1019:196:/121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######
Is it REALLY necessary that you read such a huge file just to substitute a '/' with a '-' ?
Well, files are never too big to parse with Perl if you don't keep the file in RAM.
note: as far as I can see, all the examples below assume that it takes four lines for one FASTQ record .
This is generally the case with fastq parsers. I'd be curious if you have an alternative. One of the flaws of fastq in my opinion is that each record is spread over multiple lines with no guaranteed record separator
see
How common are multi-line fastq files?
I made a tool to convert the nonstandard multi-line fastq files into 4-line entries. Fastq files should be 4 lines per entry because they are too difficult to parse otherwise. https://sourceforge.net/p/cg-pipeline/code/425/tree/cg_pipeline/branches/lkatz/scripts/run_assembly_convertMultiFastqToStandard.pl
While I agree multi-line fastq is inconvenient, it is not nonstandard as no official documentations invalidate multi-line fastq. Seqtk is probably the most efficient way so far to convert between multi-line and 4-line fastq.
Are there cases where there are fewer lines for a FASTQ record? (I write quickie .fq parsers every now and again, so I'm genuinely curious.)