split fastq by @SEQID
1
1
Entering edit mode
8.7 years ago
2nelly ▴ 350

Hi all,

I have a couple of fastq files containing reads starting with different name like: @HWI-ST865:463:C7C8KACXX:2:2316:21016:100943 1:N:0:TAAGGCGA @HWI-ST1178:227:C7C95ACXX:7:1101:1581:2125 1:N:0:TAAGGCGA

My question is: how can I split them in two parts? I tried to use some tools like fastx_toolkit but I cannot create a proper barcode file Is there any easy way to do that such as a grep command, cause i also tried with grep but i got an output containing only the first line of the reads and missed the other three

Thank you in advance!

sequencing next-gen • 3.0k views
ADD COMMENT
0
Entering edit mode

So you want to split on the "HWI-STXXX" bit? Or every unique ID should be a different output file?

ADD REPLY
2
Entering edit mode

Probably a mixed data set. Of late some submitters have been merging data from multiple flowcells/machines into one file for SRA submission (beats me why they do it) and this could be a case of that sort.

ADD REPLY
1
Entering edit mode

yes, this is exactly the case, but it was accidentally done. 2 different persons sequenced the same sample in 2 different sequencers without being aware of and then they decided to merge the outputs

ADD REPLY
0
Entering edit mode

Hahahah :D Awesome. Was it the exact same biological sample? If so, is the data publicly avalible? If so, i'd be interested in looking at the QC data. See how much of an effect sequencing machine/etc really plays on the downstream statistics.

ADD REPLY
1
Entering edit mode

Yes exactly the same sample but different capture processes with the same kit (exome sequencing). Unfortunately data are not publicly available....my boss will kill me if i do that!sorry....hahaha... Anyway, I suppose this fact will affect the analysis anyhow, cause the capture process was different despite the fact they used the same kit and protocol. You know sometimes things are working almost 100% and sometimes not.

ADD REPLY
0
Entering edit mode

No worries man - there's more than enough data to go around :) And yeah, maybe a different capture process will highlight different exons better, who knows, it might not be a waste at all!

ADD REPLY
0
Entering edit mode

Split them by "HWI-STXXX"

ADD REPLY
4
Entering edit mode
8.7 years ago
Ram 44k

You can use either Heng Li's bioawk or grep -A 3. The former is a wrapper on awk to make it work with separators used in biological data formats, and the latter is a grep that picks up the matching line+3 lines that follow.

ADD COMMENT
1
Entering edit mode

I did not know about the -A flag, awesome! Thank you Ram :)

ADD REPLY
2
Entering edit mode

You're welcome. There are also the -B (before) and -C (around) flags.

ADD REPLY
0
Entering edit mode

Ram, you 're the best!!!! grep -A 3 worked! fastly and accurately.

It was such a simple addition of the A parameter in my grep command script.

Sequencing God bless u!

ADD REPLY
0
Entering edit mode

You're welcome. It is good that you were on the right track with grep. You may benefit from reading man grep and other such manuals when you have time - UNIX commands have a ton of features that are not evident at the outset.

ADD REPLY

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6