I'm trying to identify regions of a targeted NGS panel that lack 20x coverage. I've figured out how to get my desired output (described below) per sequencing run. I'm running into trouble figuring out how to use sample replicates and only output regions that are missing in both replicates.
Example: I have run1
, run2
and each run has sample1
, sample2
, sample3
I want to output regions for sample1
that are under 20x in both run1
and run2
My process for a single sequencing run that is working:
I used the LRG as described in my other question to create a ROI bed file with annotations. I used mosdepth to get per-base output per sample, then bedtools to intersect my ROI.bed to get a file that has columns
Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap
Where Chr Start End
are from my ROI.bed and Chr2 Start2 End2
are from the per-base.bed from mosdepth. I create a file that combines all samples and all regions under 20x using awk.
awk -F $'\t' '{if($10 <20) {print FILENAME"\t"$0}}' /coverage/*_gene_final_roi.txt > /run1/all_samples_under_20x.txt
Which has columns
Sample Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap
Finally, I use that file and pandas to generate the tables I need (count of regions, sum of bases and recurrently missed regions) - my script is on my SO answer if it's helpful.
To handle the replicates I tried:
file1=/run1/all_samples_under_20x.txt
file2=/run2/all_samples_under_20x.txt
file_collapsed=/coverage/20x_5bp_collapsed_replicates_r1_r2.txt
cat $file1 $file2 | awk -v OFS="\t" 'NR==1{print $0,"NEW";next}{print $0,$1"_"$5}' | sort -k13,13 | uniq -D | cut --complement -f 13 > $file_collapsed
python mosdepth_count.py --input /coverage/20x_5bp_collapsed_replicates_r1_r2.txt --output /coverage/Illumina/Illumina_r1_r2_20x_5bp
My thought was create a new field combining sample ($1) and region ($5), sort based on it and use uniq -D to keep only duplicate values (regions by sample present in file1 and file2). However, I'm missing regions that are definitely under 20x in both replicates.
I've searched SO for pandas combine functions, but the per-base output can vary in depth across a single ROI, leading to multiple rows in /run1/all_samples_under_20x.txt
where fields $1 and $5 are duplicated, which screws things up when I try to use the $1 $5 combination as a key in combining across replicates.
I'm not sure of the logic of the approach - apologies. If I am correct in assuming what you want to do, I would:
thanks, I figured I was over complicating things. Comparing the per-base output to find the common regions under 20x does seem like a better approach. I'll have to guess and check on finding the regions in common in the per-base output that are under 20x. My initial thought is combine the replicate per-base files, sort, then
bedtools merge -i combined.bed -c 4 -o max
However, since the intervals can differ in size between replicates, I'm thinking I might miss some regions under 20x that are partially overlapped by a region in the replicate that is over 20x. I'll have to test this approach out.Okay, well, I think that bedtools merge has quite a few command-line parameters that should help you to get what you need. If not, you could use an
awk
script. I have a very complexawk
script that does something along the same lines of logic (i.e., does one bp co-ordinate fall within a BED region, and also then checks the end BP), but it could take a while to adapt to your situation.In fact, here is that script: separating bed file into separate files for every 100000 base pairs