Split a sequence in a fastq file
1
0
Entering edit mode
6.1 years ago
ste.lu ▴ 80

Hi All,

Could you suggest a way to split a read in a fastq file (on a particular motif) and keep the 2 resulting sequences as 2 independent reads?

I'll give an example of what I want to do:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATATATTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT +

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ<f#ajaa&lt;< p="">

Then grep for a sequence. e.g TATATATATA and cut on that string and keep the 2 resulting as 2 reads:

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TGTGACCTTCAGGACAGTCCTAAGGCTGTGGGAAAAACACTNAAAACATGAGTTCAAAAATATATATAT

+

AAAAJJ77<7JJJ7FAJJJJJJJFFFJF< FFF7AFJJJJFA#JFJJFJJJJ< AA-F-< JJFJAJFAAJ< JJJJJ

@K00252:388:H2LM2BBXY:3:1101:1397:1138 1:N:0:ATCACG

TTTTCCCAACTATGCAAAAATATAAGGATGCAATATGGATTGTATAATGAGCTTCACAGATATAAAGGAACAGNGGCAT

+

--<<< -FFFF7AJJJJFFJJAFFFFA<<-7< FFJA< JJJJAJF< AAFF7-F< AF-A7A-< -< J-FFJ< F#AJAA<

Thank you

fastq sequencing sequence next-gen • 2.7k views
ADD COMMENT
0
Entering edit mode

I'd suggest writing a biopython script for something like that. Do you have any programming experience?

ADD REPLY
0
Entering edit mode

Thank for your answer. I've coded a bit my background is different. What would you suggest? a link to out me on the right track is more than enough.

ADD REPLY
1
Entering edit mode

I'd recommend going through some sections of the Biopython cookbook and tutorial. That would put you on track on how to solve this and further questions about handling common file formats.

While one-liners like the one of Pierre are pretty (and efficient) it would probably take me less time to write it in Python, especially if I have scripts saved from earlier/similar applications which I just have to adapt a bit.

ADD REPLY
3
Entering edit mode
6.1 years ago

linearize, use awk to detect the position of the patern, print the two sequences, convert back to fastq

cat input.fastq |\
paste - - - - |\
awk -F '\t' 'BEGIN{S="TATATATATA";N=length(S);}{i=index($2,S);if(i==0) {print} else {printf("%s\t%s\t+\t%s\n%s\t%s+\t%s\n",$1,substr($2,1,i),substr($4,1,i),$1,substr($2,i+N),substr($4,i+N));}}' |\
tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Hi Pierre,

Thanks for your script! In this way I keep all the reads, the original one and the 2 derived, isn't it?

ADD REPLY
1
Entering edit mode

no, you will only get the two substrings as output. But that's what you asked for, no?

ADD REPLY
0
Entering edit mode

yeah, definetly. Thanks!

ADD REPLY
0
Entering edit mode

lovely oneliner Pierre Lindenbaum !

some remarks though: I think the 'motif' is missing in your output (at least that's what I understood from OP's example, to also still include the 'motif' , and there might be an off-by-one mistake in it as well ?

ADD REPLY
0
Entering edit mode

an off-by-one mistake in it as well ?

may be :-D

ADD REPLY

Login before adding your answer.

Traffic: 2134 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6