BBTools randomreads.sh simulator creates headers too long for samtools?
0
0
Entering edit mode
4.1 years ago
dp ▴ 50

It seems that the read simulator from BBTools randomreads.sh) creates reads with names that are too long for samtools. Has anyone else run into this? How can I get around it?

bbtools bbsuite samtools • 1.1k views
ADD COMMENT
0
Entering edit mode

Can you post your command and example of read headers?

Simple reads should be like

@0 1:
TGGTTACAGAGAAGGCTATTTGAACTCTACTAATGTCACTATTGCAACCTACTGTACTGGTTCTATACCTTGTAGTGTTTGTCTTAGTGGTTTAGATTCGTTAGACACCTATCCTTCTTTAGAAACTATACAAATTACCATTTCATCTTT
+
8;9=898:;@>B9DAD89AB;C?EDEH<?B8AE9AD>?ACBC8>:@98=CFB?A;A=?9>:D;@BAAE>>@8A<>@=<AE=G>:C@@8;;8B=>>D@E?88=;CA89::;?ABD>=@8:><8=C=E?@888?@A8D<8C@9=8?888888
@1 1:
TACTGGCGATAGTTGTAATAACTATATGCTCACCTATAACAAAGTTGAAAACATGACACCCCGTGACCTTGGTGCTTGTATTGACTGTAGTGCGCGTCATATTAATGCGCAGGTAGCAAAAAGTCACAACATTGCTTTGATATGGAACGT
+
565;7:;77?@><DBEADBC@B?DAAEAD@?CA@E@>?:B>?CCD>>A@AACC?C@BA=A?DBC?C@<BBB?????>7DD>A;B<AB?;>A>A?<=@@>C>?@?9??>E>@>C?<?>?6<=>C=@=A>D<9?=8:<::5;6B57555555
ADD REPLY
0
Entering edit mode

Here's an example:

@SYN_0_24445_24594_23974_+339279274_1.NZ{CP029979.1$Escherichia$coli$strain$99-3165$plasmid$unnamed1&0_621_770_23974-339255450_1._NZ{CP029979.1$Escherichia$coli$strain$99-3165$plasmid$unnamed1 1:

This one isn't over the length limit (252 I believe), but there seems to be one that is and that causes samtools to error out in the middle.

I'm not sure what the command used was - I'm trying to help out a user and realised that this was the problem, is there an option in the simulator to suppress these long names?

ADD REPLY
0
Entering edit mode

This is likely because the fasta input file had an overly long header. Additional issue must be the $ and { in the names (not sure what they are there) which samtools likely does not like. I think you are best off regenerating these reads after modifying the fasta header to something acceptable.

If that is not possible then you could chop the remainder of fastq header off after@SYN_0_24445_24594_23974_+339279274_1. I think the first part should be unique for all reads but you can confirm.

ADD REPLY
0
Entering edit mode

OK thanks. If I understood correctly this is coming from the header of the reference sequence that the reads are simulated from? Do you know about what comes after the & sign?

How would you suggest cutting the headers - just cut every header after the first . ?

Is there a way to get the header to just be a number as in your example above - is there a specific flag to pass to the simulator to get this behaviour? I assume the current format with all the numbers etc is to keep track of where each read came from?

Thanks again

ADD REPLY
0
Entering edit mode

If you just want them to be numbers you can simply use illuminanames=t.

ADD REPLY

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6