I have a combined paired-end FASTQ file where each sequence begins with a 14 bp unique index, allowing me to group reads together that come from the same original template molecule--very similar concept to sample barcoding where reads are grouped together based on what sample they belong to; however, these are randomly-generated indexes, so I don't know the sequences, just that they are the first 14 of every read. The issue is that sometimes the R1 index hops to a different R2 index (again, very similar problem occurs in sample barcoding).
@Sequence1:Read1
CTGGCGCTCAGTATGATCAGGCGTCTGTCGTGCTCGCCTT
+
CCCCCCCCCCFFGGGGGGGGGGGGGGGHHGGGGGHGGGGG
@Sequence1:Read2
TCGTCGTATACAGAGATCAGGCGTCTGTCGTCGAGCCGCG
+
CCCCCCCFFFFFGGGGGGGGGGGGGGGHHGGGGEGGGGGG
I want to filter out aberrant combinations by looking at each R1 index individually and finding what combinations of R2 indexes are present. Then, I would take the combination with the most reads as "truth"
A--B has 100 reads
A--C has 6 reads
A--D has 11 reads
In the above example, I would take A--B as the "true" combination, and I can eliminate (or export to "trash" file) all of the false combinations. Then I want to look at the R2 indexes and see if those correspond to what I found in the R1 indexes.
B--A has 100 reads
B--X has 2 reads
B--Y has 7 reads
Therefore, I've confirmed that A--B is the true combination. The tricky part about this is ensuring that quality scores and sequence headers are carried through, and that the fastq paired-end format (R1 followed by R2 of same header) is followed.
The current code (below) is taking days to run (on smaller 200 MB files). I'm a biologist with limited bioinformatics (perl) experience, so if anyone has some insights, it would be much appreciated!
my $inputfile = $ARGV[0];
## Input and Output File handles
#####################################################################################
open (IN1, $inputfile ||die "Cannot open $inputfile $!\n");
open (OUT1, ">Filtered_$inputfile.fastq" ||die "Cannot open selected out file $!\n");
open (OUT2, ">Discarded_$inputfile.fastq" ||die "Cannot open selected out file $!\n");
while ($in1=<IN1>) {push (@InArray,$in1);} #Loads R1 file into @R1 array
$MinimumFamilySize = 0; #Minimum family for forward R1UMI. Set to 0 for it to do nothing
## Loads the Header, Quality score, and Reads into an array
#####################################################################################
$i=0; #Sets counter to 0
while ($i<scalar(@InArray)){ #Goes through, line by line, the Read1 file
$f = substr($InArray[$i],0,-1); #Grabs read header and deletes the new line character
push (@R1Headers,$f); #Pushes the read header into an array
push (@R1Reads ,$InArray[$i+1]); #Pushes the actual read into an array
push (@R1UMI, substr($InArray[$i+1],0,14));
chomp($InArray[$i+3]); #Removes the new line character from the quality string
push (@R1Quality,$InArray[$i+3]); #Pushes the quality string into an array
$f = substr($InArray[$i+4],0,-1); #Grabs read header and deletes the new line character
push (@R2Headers,$f); #Pushes the read header into an array
push (@R2Reads ,$InArray[$i+5]); #Pushes the actual read into an array
push (@R2UMI, substr($InArray[$i+5],0,14));
chomp($InArray[$i+7]); #Removes the new line character from the quality string
push (@R2Quality,$InArray[$i+7]); #Pushes the quality string into an array
$i=$i+8; #Increments counter by 4 to get to next read sequences
}
##Removes duplicate UMIs and UMIs that dont have large enough families
#####################################################################################
$counts{$_}++ for @R1UMI; #Creates a hash of R1UMI counts
while (($string1,$value1) = each(%counts)){
$value1 = $counts{$string1};
if($value1 >= $MinimumFamilySize){push (@R1UMIcomp,$string1);}
}
undef %counts;
## Pairs the forward UMI with its most common reverse UMI
#####################################################################################
$i=$j=0;
while ($i<scalar(@R1UMIcomp)){
$j=0;
while ($j<scalar(@R1UMI)){
if ($R1UMIcomp[$i] == $R1UMI[$j])
{push(@TempArray,$R2UMI[$j]);} $j++;}
$counts{$_}++ for @TempArray;
while (($string1,$value1) = each(%counts)){
$value1 = $counts{$string1};
}
@Occurences = values %counts;
$max = max(@Occurences);
%TempHash = reverse %counts;
$winner = $TempHash{$max};
%MasterList =();
$MasterList{$R1UMIcomp[$i]} = $winner;
$i++;
undef %counts;
undef %TempHash;
}
$i=$j=0;
while ($i<scalar(@R1UMI)){
if ($R2UMI[$i] =~ $MasterList{$R1UMI[$i]}){
$NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
$NewRead2 = substr($R2Reads[$i],14);
$R2UMIQ = substr($R2Quality[$i],0,14);
$NewR1Q = $R2UMIQ.''.$R1Quality[$i];
$NewR2Q = substr($R2Quality[$i],14);
print OUT1 "$R1Headers[$i]\n";
print OUT1 "$NewRead1";
print OUT1 "+\n";
print OUT1 "$NewR1Q\n";
print OUT1 "$R2Headers[$i]\n";
print OUT1 "$NewRead2";
print OUT1 "+\n";
print OUT1 "$NewR2Q\n";
}
else{
$NewRead1 = $R2UMI[$i].''.$R1Reads[$i];
$NewRead2 = substr($R2Reads[$i],14);
$R2UMIQ = substr($R2Quality[$i],0,14);
$NewR1Q = $R2UMIQ.''.$R1Quality[$i];
$NewR2Q = substr($R2Quality[$i],14);
print OUT2 "$R1Headers[$i]\n";
print OUT2 "$NewRead1";
print OUT2 "+\n";
print OUT2 "$NewR1Q\n";
print OUT2 "$R2Headers[$i]\n";
print OUT2 "$NewRead2";
print OUT2 "+\n";
print OUT2 "$NewR2Q\n";
}
$i++;
}
return -1;} #For Sub button