Question

Filtering out incorrect index combinations in fastq file

0

Entering edit mode

7.1 years ago

craigdj91 • 0

I have a combined paired-end FASTQ file where each sequence begins with a 14 bp unique index, allowing me to group reads together that come from the same original template molecule--very similar concept to sample barcoding where reads are grouped together based on what sample they belong to; however, these are randomly-generated indexes, so I don't know the sequences, just that they are the first 14 of every read. The issue is that sometimes the R1 index hops to a different R2 index (again, very similar problem occurs in sample barcoding).

@Sequence1:Read1
CTGGCGCTCAGTATGATCAGGCGTCTGTCGTGCTCGCCTT
+
CCCCCCCCCCFFGGGGGGGGGGGGGGGHHGGGGGHGGGGG
@Sequence1:Read2
TCGTCGTATACAGAGATCAGGCGTCTGTCGTCGAGCCGCG
+
CCCCCCCFFFFFGGGGGGGGGGGGGGGHHGGGGEGGGGGG

I want to filter out aberrant combinations by looking at each R1 index individually and finding what combinations of R2 indexes are present. Then, I would take the combination with the most reads as "truth"

A--B has 100 reads
A--C has 6 reads
A--D has 11 reads

In the above example, I would take A--B as the "true" combination, and I can eliminate (or export to "trash" file) all of the false combinations. Then I want to look at the R2 indexes and see if those correspond to what I found in the R1 indexes.

B--A has 100 reads
B--X has 2 reads
B--Y has 7 reads

Therefore, I've confirmed that A--B is the true combination. The tricky part about this is ensuring that quality scores and sequence headers are carried through, and that the fastq paired-end format (R1 followed by R2 of same header) is followed.

The current code (below) is taking days to run (on smaller 200 MB files). I'm a biologist with limited bioinformatics (perl) experience, so if anyone has some insights, it would be much appreciated!

    my $inputfile = $ARGV[0];

##  Input and Output File handles
#####################################################################################
open (IN1, $inputfile                          ||die "Cannot open $inputfile $!\n");
open (OUT1, ">Filtered_$inputfile.fastq"       ||die "Cannot open selected out file      $!\n");
open (OUT2, ">Discarded_$inputfile.fastq"      ||die "Cannot open selected out file      $!\n");
while ($in1=<IN1>) {push (@InArray,$in1);}   #Loads R1 file into @R1 array
$MinimumFamilySize = 0;                      #Minimum family for forward R1UMI. Set to 0 for it to do nothing


##  Loads the Header, Quality score, and Reads into an array
#####################################################################################
$i=0;                                   #Sets counter to 0
while ($i<scalar(@InArray)){            #Goes through, line by line, the Read1 file

    $f = substr($InArray[$i],0,-1);     #Grabs read header and deletes the new line character
    push (@R1Headers,$f);               #Pushes the read header into an array

    push (@R1Reads  ,$InArray[$i+1]);   #Pushes the actual read into an array
    push (@R1UMI, substr($InArray[$i+1],0,14));

    chomp($InArray[$i+3]);              #Removes the new line character from the quality string
    push (@R1Quality,$InArray[$i+3]);   #Pushes the quality string into an array

    $f = substr($InArray[$i+4],0,-1);   #Grabs read header and deletes the new line character
    push (@R2Headers,$f);               #Pushes the read header into an array

    push (@R2Reads  ,$InArray[$i+5]);   #Pushes the actual read into an array
    push (@R2UMI, substr($InArray[$i+5],0,14));


    chomp($InArray[$i+7]);              #Removes the new line character from the quality string
    push (@R2Quality,$InArray[$i+7]);   #Pushes the quality string into an array
    $i=$i+8;                            #Increments counter by 4 to get to next read sequences
}

##Removes duplicate UMIs and UMIs that dont have large enough families
#####################################################################################
$counts{$_}++ for @R1UMI;                               #Creates a hash of R1UMI counts
while (($string1,$value1) = each(%counts)){

$value1 = $counts{$string1};
if($value1 >= $MinimumFamilySize){push (@R1UMIcomp,$string1);}
}
undef %counts;

## Pairs the forward UMI with its most common reverse UMI
#####################################################################################
$i=$j=0;
while ($i<scalar(@R1UMIcomp)){
    $j=0;

    while ($j<scalar(@R1UMI)){   
        if ($R1UMIcomp[$i] == $R1UMI[$j])
            {push(@TempArray,$R2UMI[$j]);} $j++;}

    $counts{$_}++ for @TempArray;                               
    while (($string1,$value1) = each(%counts)){
    $value1 = $counts{$string1};
    }

    @Occurences = values %counts;
    $max = max(@Occurences);
    %TempHash = reverse %counts;
    $winner = $TempHash{$max};

    %MasterList =();
    $MasterList{$R1UMIcomp[$i]} = $winner;

    $i++;
    undef %counts;
    undef %TempHash;
}

$i=$j=0;
while ($i<scalar(@R1UMI)){
    if ($R2UMI[$i] =~ $MasterList{$R1UMI[$i]}){

        $NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
        $NewRead2 = substr($R2Reads[$i],14);

        $R2UMIQ = substr($R2Quality[$i],0,14);
        $NewR1Q = $R2UMIQ.''.$R1Quality[$i];
        $NewR2Q = substr($R2Quality[$i],14);

        print OUT1 "$R1Headers[$i]\n";
        print OUT1 "$NewRead1";
        print OUT1 "+\n";
        print OUT1 "$NewR1Q\n";
        print OUT1 "$R2Headers[$i]\n";
        print OUT1 "$NewRead2";
        print OUT1 "+\n";
        print OUT1 "$NewR2Q\n";
    }
    else{

        $NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
        $NewRead2 = substr($R2Reads[$i],14);

        $R2UMIQ = substr($R2Quality[$i],0,14);
        $NewR1Q = $R2UMIQ.''.$R1Quality[$i];
        $NewR2Q = substr($R2Quality[$i],14);

        print OUT2 "$R1Headers[$i]\n";
        print OUT2 "$NewRead1";
        print OUT2 "+\n";
        print OUT2 "$NewR1Q\n";
        print OUT2 "$R2Headers[$i]\n";
        print OUT2 "$NewRead2";
        print OUT2 "+\n";
        print OUT2 "$NewR2Q\n";

    }

    $i++;
}
return -1;} #For Sub button

next-gen sequencing software error perl • 2.3k views

ADD COMMENT • link updated 6.9 years ago by 5heikki 11k • written 7.1 years ago by craigdj91 • 0

score 0 · Answer 1 · 2018-01-09

One way to do this would be to estimate the barcode combinations present in sequence file using awk code in this post: Demultiplexing reads with index present in the labels . Once you have the barcodes you want to extract, you can do that by using demuxbyname.sh from BBMap suite.

demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...
outu=filename

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

score 0 · Answer 2 · 2018-01-09

0

Entering edit mode

6.9 years ago

5heikki 11k

I'm not sure if I understood correctly:

awk 'NR%4==2{print substr($0,1,14)}' file.fq | paste - - | sort | uniq -c

ADD COMMENT • link 6.9 years ago by 5heikki 11k