Hi all,
I have a couple of fastq files containing reads starting with different name like: @HWI-ST865:463:C7C8KACXX:2:2316:21016:100943 1:N:0:TAAGGCGA @HWI-ST1178:227:C7C95ACXX:7:1101:1581:2125 1:N:0:TAAGGCGA
My question is: how can I split them in two parts? I tried to use some tools like fastx_toolkit but I cannot create a proper barcode file Is there any easy way to do that such as a grep command, cause i also tried with grep but i got an output containing only the first line of the reads and missed the other three
Thank you in advance!
So you want to split on the "HWI-STXXX" bit? Or every unique ID should be a different output file?
Probably a mixed data set. Of late some submitters have been merging data from multiple flowcells/machines into one file for SRA submission (beats me why they do it) and this could be a case of that sort.
yes, this is exactly the case, but it was accidentally done. 2 different persons sequenced the same sample in 2 different sequencers without being aware of and then they decided to merge the outputs
Hahahah :D Awesome. Was it the exact same biological sample? If so, is the data publicly avalible? If so, i'd be interested in looking at the QC data. See how much of an effect sequencing machine/etc really plays on the downstream statistics.
Yes exactly the same sample but different capture processes with the same kit (exome sequencing). Unfortunately data are not publicly available....my boss will kill me if i do that!sorry....hahaha... Anyway, I suppose this fact will affect the analysis anyhow, cause the capture process was different despite the fact they used the same kit and protocol. You know sometimes things are working almost 100% and sometimes not.
No worries man - there's more than enough data to go around :) And yeah, maybe a different capture process will highlight different exons better, who knows, it might not be a waste at all!
Split them by "HWI-STXXX"