bbMAP repair.sh killed line #: random number
1
0
Entering edit mode
3.2 years ago
efountain ▴ 20

Hi,

I am trying to repair a .fq file using bbMap repair.sh command and I am running into an error. My .fq file is pair-end but is not sorted or interleaved, hence, needing to repair it before moving on to my next analysis. When I run repair.sh it starts and then always errors out at the same spot, Line 87: 23354 Killed I have changed the amount of memory allocated (-Xmx) as I thought I might be running into memory issues but this is not fixing the problem.

I am not sure what this error is pertaining to. I looked at my .fq on line 87 and saw no errors. Pasted below is lines 85-88 from my .fq.

@38_1_1101_31566_1016/1
CTGCAGGATCCTTCTCTGGGTTTCCCACCCCGTCCTCCTGGAATTTCACCACTTTCCTCCTGCCCAGCTGATGAGCTGATCCCACAGCAGATTCGGGGCTCCCTGATCCTGGAATTTTCTCCTTGTTTTTCTCAGGTTTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFF::FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFF,FFF,,FFFFFFFFFFFF

I do not see any errors with the file, so I am thinking this may be an error somewhere with how I am running the command.

Any help would be appreciated in figuring out what the error means and how I can get bbmap to repair this file.

Thank you!

bbMAP • 4.7k views
ADD COMMENT
0
Entering edit mode

I want to see how you are running repair.sh. Can you post the entire command?

ADD REPLY
0
Entering edit mode

Hi, I am running on local machine. Here is the command I am entering:

./repair.sh -Xmx16G in=4-B.fq out=4-B_sorted.fq outs=4-B_singletons.fq
ADD REPLY
0
Entering edit mode

My .fq file is pair-end but is not sorted or interleaved,

I suggest you add fint=t to the command and try. You may need more than 16G of RAM depending on how large your input file is.

fint=f          (fixinterleaving) Fixes corrupted interleaved files using read
                names.  Only use on files with broken interleaving - correctly
                interleaved files from which some reads were removed.

If you reads don't have corresponding /2 designation indicating read 2 then you may also need to use ain=t.

ain=f           (allowidenticalnames) When detecting pair names, allows
                identical names, instead of requiring /1 and /2 or 1: and 2:

Where did you get this file from? Show us four reads from the file. If this file does not contain interleaved data then you can't use repair.sh.

ADD REPLY
0
Entering edit mode

Update to my problem: interleaving was broken I think. I now have bbMap working, or so it appears, I use this command instead:

./bbmap/repair.sh -Xmx64G in=4-B.1.fastq in2=4-B.2.fastq out=4-B_sorted.1.fq out2=4-B_sorted.2.fq outs=4-B_singletons.fq

Our outputs files looked like they were repaired, here are a few lines from the sorted R1 and R2.

R1 file:

 @38_1_1101_1416_1000/1
CTGCAGGAATGATTTTGTAGCTTAAAGTGTAGCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCTTCTATCTCGTATGCCGTCTTCTGCTTGAAACGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF,FFFFF,FFF::,FF,F,,:F,F,,,,F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@38_1_1101_4453_1000/1
CTGCAGAACGTGGCAGTGCAGCTGCTGCTGCCCCAAAACACACGGAAGAAAACTTCAGTGCAGGGACACCTCCCACTGTCCCAGGCTGCTCCAAGCCCCAATGTCCAGCCTGGCCTTGGGCACTGCCAGGGATCCAGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@38_1_1101_8793_1000/1
CTGCAGTGCTTAGTTACAAAAAATGAACTCTGTGCTGCTTGGGCAGTCTGTTGATATTTGTATTGCCATGAAATTGCAGACTCAGTGACACTCAGTGGGCTGCTGCCGACCACAAGCTCTGAACTACTGTAGTGACCTTG

R2 file:

@38_1_1101_1416_1000/2
TTAAGCTACAAAATCATTCCTGCAGTGACCATGAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTGCAAACCGTGTAGATCTCGGTGGTCGCCGTATCATTAAATAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:F,FFFF,:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFF::,F:FF:,,,,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@38_1_1101_4453_1000/2
TTAACATCTTGCAGTGGTGGGATTCCAAAGCAGGTCAGGTGTTTTAGGTTTTCCCTTGCTGTTGAGTACTTTGGTTGTACTGCTCAGAAATTAGGAATAAATCCTTTACTTCAGGAAGGAATTGTTCCCTGTGAGGGTGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@38_1_1101_8793_1000/2
TTAAAAAAGTCAACAGAAGTGTTGCAAAATGTTCAGTTCTGTTCAAAAATACCTACTAGCTATACATTAGGCAACTCTCCGAATACATGATAACTTCTGGGAATGTCAACCAAAGAGTGTACAGACTGGTTGGTAAATTG

Our new problem is when trying to align against the reference genome using bwa mem none of the reads are recognized as pair-end. I am not sure if this is an issue with the bbMap output or something new with running bwa. Do you think this is a bbMap output issue and that something did not work when repairing the .fastq?

ADD REPLY
0
Entering edit mode

none of the reads are recognized as pair-end. I am not sure if this is an issue with the bbMap output or something new with running bwa.

This likely has nothing to do bbmap. You reads have old format illumina fastq identifiers (note the /1 and /2 in headers, that was the reason I asked you where you got this data from) which bwa mem may not be recognizing.

ADD REPLY
0
Entering edit mode

I think the OP needs to post the exact error, I believe bwa mem does recognize and ignore the /1 and /2 at the ends. Even the simulator wgsim used to evaluate bwa has those markers.

What is the exact error that you get?

ADD REPLY
0
Entering edit mode

Let me get the error from bwa and will repost for you. Thank you

ADD REPLY
0
Entering edit mode

Here is a subset of the output from bwa. I only copied several of the lines because there are over 9000 lines of the same thing. When I use samtools to check the stats of the .bam it shows there are no paired reads.

ADD REPLY
0
Entering edit mode

and I did not add the error.. so sorry.

[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 26148, 1, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (129, 288, 331)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 735)
[M::mem_pestat] mean and std.dev: (233.22, 108.66)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 937)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 71430 reads in 48.594 CPU sec, 48.536 real sec
[M::process] read 71430 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 26006, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (130, 288, 331)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 733)
[M::mem_pestat] mean and std.dev: (233.85, 108.75)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 934)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 71430 reads in 47.258 CPU sec, 47.043 real sec
[M::process] read 71430 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 25881, 0, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (129, 288, 331)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 735)
[M::mem_pestat] mean and std.dev: (232.94, 108.87)
ADD REPLY
0
Entering edit mode

Those are not errors. It is normal to see those messages.

ADD REPLY
0
Entering edit mode

exactly these are progress reports on how the alignment is progressing, it even gives you estimates on the fragment length mean and standard deviations, thus evidently, it recognizes the data as paired

ADD REPLY
0
Entering edit mode

I have never had every read come back as skipping FF, RR, and RF orientation as not enough pairs being found and then having zero matched pairs being found by the end of the alignment. I realize that output is just the log of the alignment progressing, but generally my log does not show not enough pairs being found for everything. That to me would be an error if my input .fqs all seem to be matched and the stats of the .bam has no paired reads.

ADD REPLY
0
Entering edit mode

if the input were not paried bwa would stop and print an error explicitly stating which read pair did not match.

The messages that state something about pairs being skipped are not errors, those are also a log messages that tell you that specific orientation is missing, not that the reads are not paired. Usually, you would not want FF or RF orientation to be present unless the data was of that type.

it is true that the messages are highly confusing, as they seem to suggest there aren't enough pairs, where what it says that there are no FF pairs. In fact there shouldn't be any FF, RF pairs if the data is FR

if it does not stop with that error it means the data is all paired.

What does your samtools flagstat bamfile say about the alignments?

ADD REPLY
0
Entering edit mode

Thank you. This is NovaSeq data that we just got back from the sequencing facility and we had no problem with older HiSeq data so I just used the same methods as we used with the HiSeq data and now we have all these issues. At least I now have bbmap figured out and will move on to trying to figure out my bwa problem.

ADD REPLY
0
Entering edit mode

That is interesting. It looks like the facility is actively changing the headers produced by illumina software.

ADD REPLY
0
Entering edit mode
3.2 years ago

I suspect the error comes from line 87 of the script itself:

cat -n `which repair.sh` | tail -7

prints:

84  repair() {
85      local CMD="java $EA $EOOM $z -cp $CP jgi.SplitPairsAndSingles rp $@"
86      echo $CMD >&2
87      eval $CMD
88  }

the task of repairing reads is somewhat simple, you can achieve that in a few lines of programming a script like this would do it

import sys

def parse(fname):
    store = {}
    stream = open(fname)
    for name in stream:
        seq, tmp, qual = next(stream), next(stream), next(stream)
        store[name] = (seq, qual)

    return store

s1 = parse(sys.argv[1])
s2 = parse(sys.argv[2])

o1 = open("fixed1.fq", "wt")
o2 = open("fixed2.fq", "wt")
for key in s1:
    if key in s2:
        seq1, qual1 = s1[key]
        seq2, qual2 = s2[key]
        o1.write(f"{key}{seq1}+\n{qual1}")
        o2.write(f"{key}{seq2}+\n{qual2}")

run it as:

python repair.py  file1.fq  file2.fq 

creates the outputs fixed1.fq and fixed2.fq (some bugs may be present as I just dumped this out with minimal checking :-) )

ADD COMMENT
0
Entering edit mode

I didn't even think about the fact that it could be the script itself. I was just auto assuming it was my file causing problems and I couldn't figure out what it was.

I will try your code and let you know how I get on.

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6