Question

Remove duplicates reads Ids

2

Entering edit mode

7.6 years ago

BioGeek ▴ 170

I mapped reads with

bwa mem -M -t 40 allCombinedFinalSet.fa Seq.R1.fastq Seq.R2.fastq > aln.sam

Extracted the mapped reads

samtools view -f 0x2 -b aln.bam > output.bam

Extracted the fastq

bamToFastq -i output.bam -fq R1.fq -fq2 R2.fq 

grep @HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1 R1.fq             []
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1

I notice it has duplicated ....

I think this because read was mapped twice (i.e. BWAmem).

I tried fastuniq but it does not remove the duplicated reads.

Can you please help me to remove duplicated reads from fastq files.

reads ids duplicates mapping fastq • 6.1k views

ADD COMMENT • link 7.6 years ago by BioGeek ▴ 170

0

Entering edit mode

What exactly are you trying to do?

ADD REPLY • link 7.6 years ago by GenoMax 147k

0

Entering edit mode

I am trying to mapped the allReads against "filtered" genome and extract mapped fastq for re-assembly.

ADD REPLY • link 7.6 years ago by BioGeek ▴ 170

score 1 · Answer 1 · 2017-05-02

1

Entering edit mode

7.6 years ago

Santosh Anand 5.8k

Those duplicates are secondary/supplementary alignments. You may try Picard SamToFastq for a better control over what to convert in Fastq from BAM. See the discussion here: http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html

ADD COMMENT • link 7.6 years ago by Santosh Anand 5.8k

score 0 · Answer 2 · 2017-05-02

0

Entering edit mode

7.6 years ago

k.balaji93 ▴ 10

Read Samtools man for flag descriptions. You want to filter out secondary and chimeric alignments, using samtools in the bam file. Check out Picard explain flag to get the exact flag to keep only primary alignments.

ADD COMMENT • link 7.6 years ago by k.balaji93 ▴ 10

score 0 · Answer 3 · 2017-05-02

0

Entering edit mode

7.6 years ago

Pierre Lindenbaum 164k

(assuming that the reads in R1.fq are in the same order + same number than R2.fq)

paste <(cat R1.fq | paste - - - - )  <(cat R2.fq | paste - - - - ) | LC_ALL=C sort -t $'\t' -k1,1  | tr "\t" "\n" > interleaved.fq

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

bio@bio214b[biogeek] paste <(cat see.R1.fq | paste - - - - )  <(cat see.R2.fq | paste - - - - ) | LC_ALL=C sort -t $'\t' -k1,1  | tr "\t" "\n" > interleaved.fq
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/1 interleaved.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/2 interleaved.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2

The duplicated still exists.

ADD REPLY • link 7.6 years ago by BioGeek ▴ 170

0

Entering edit mode

If all you need to do is remove duplicates then use dedupe.sh from BBMap suite. dedupe.sh in=orig.fq out=dedupe.fq Should work with in1= in2= out1= out2= for PE files.

ADD REPLY • link 7.6 years ago by GenoMax 147k

0

Entering edit mode

$ dedupe.sh in1=see.R1.fq in2=see.R2.fq out1=tt.R1.fq out2=tt.R2.fq          []
java -Djava.library.path=/home/Tools/bbmap/jni/ -ea -Xmx200291m -Xms200291m -cp /home/Tools/bbmap/current/ jgi.Dedupe in1=see.R1.fq in2=see.R2.fq out1=tt.R1.fq out2=tt.R2.fq
Executing jgi.Dedupe [in1=see.R1.fq, in2=see.R2.fq, out1=tt.R1.fq, out2=tt.R2.fq]

Exception in thread "main" java.lang.RuntimeException: Unknown parameter out1=tt.R1.fq
    at jgi.Dedupe.<init>(Dedupe.java:392)
    at jgi.Dedupe.main(Dedupe.java:82)

It gave an error !!!

ADD REPLY • link 7.6 years ago by BioGeek ▴ 170

0

Entering edit mode

While dedupe.sh will work to remove duplicate read ID's I realized that it will have an additional unintended consequence. It will actually de-duplicate your input dataset based on sequence. If you are able to accept that caveat then read on.

dedupe.sh requires input to be interleaved in case of paired-end reads so you would be able to use it like this:

Interleave reads
de-dupe reads
Split the deduped reads into separate R1/R2 files

reformat.sh in1=R1.fq in2=R2.fq out=stdout.fq | dedupe.sh in=stdin.fq out=stdout.fq | reformat.sh in=stdin.fq out1=newR1.fq out2=newR2.fq

I am going to tag Brian Bushnell to see if he has any other suggestion.

My apologies for not thinking this through completely.

ADD REPLY • link 7.6 years ago by GenoMax 147k

0

Entering edit mode

I don't really have a tool for this purpose, once you already have a bam file. Instead, I'd recommend using BBMap, like this:

bbmap.sh t=40 ref=allCombinedFinalSet.fa in1=Seq.R1.fastq in2=Seq.R2.fastq outm=mapped.fq outu=unmapped.fq

The mapped reads will go to mapped.fq, and the unmapped reads will go to unmapped.fq. Both files will be interleaved, and pairs will be kept together, so if only one read in a pair maps to the reference, both reads will still go to mapped.fq.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

score 0 · Answer 4 · 2017-05-02

    #!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
$/ = '@HISEQ';
my $add='@HISEQ';
while(<>){
    chomp;
    my ($defLine, @seqLines) = split /\n/, $_;
    my $sequence = join("\n",@seqLines);
        $id2seq{$defLine} = $sequence;
}

foreach(keys %id2seq){
    my $name="$add"."$_";  
    print join("\n",$name,$id2seq{$_}),"\n";
}

I tried my above mentioned perl script to remove, and in first look it seems to be working.

bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/2 test.R2.fq 
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/2
bio@bio214b[biogeek] grep HJ2KCBCXX:1:1104:14672:39678/1 test.R1.fq
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1

Is that safe to use? It seems distorting the format.