Entering edit mode
6 months ago
michael.flower.14
▴
200
I have a vector (pFRT_lacZeo) that randomly integrated at a single site in the human genome of a cell line. I would now like to find that location in the human genome. I have short read whole genome sequencing data. Below are the steps I've taken so far. I can see there are reads that map to multiple locations, but that includes reads with multiple mappings to the vector only, or to the human reference only. Is there a cleaner way to find reads that map to both the vector and human sequence, and ultimately to find where my insertion site is?
# Prepare the two reference sequences
FLPREF="/Users/michaelflower/refs/flpin_trex/pFRT_lacZeo.fa"
HUMANREF="/Users/michaelflower/refs/hg38/hg38.fa"
# Make a combined human and vector reference sequence
COMBINEDREF="/Users/michaelflower/refs/flpin_trex/hg38_pFRTlacZeo.fa"
cat "$HUMANREF" "$FLPREF" > "$COMBINEDREF"
bwa index "$COMBINEDREF"
samtools faidx "$COMBINEDREF"
# Align to combined reference
R1="U20Swt_R1_001.fastq.gz"
R2="U20Swt_R2_001.fastq.gz"
FN=$(basename "$R1" | sed 's/_R1_.*//')
HEADER=$(gzcat "$R1" | head -n 1)
ID=$(echo "$HEADER" | sed 's/@//' | cut -d ':' -f 1-4)
SM=$(echo "$HEADER" | cut -d ':' -f 2)
LB=$(echo "$HEADER" | cut -d ':' -f 3,4)
PU=$ID
RG="@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:$PL\tPU:$PU\tCN:$CN"
bwa mem -M -t 4 -R "$RG" "$COMBINEDREF" "$R1" "$R2" > "${OUT}/${FN}_combined.sam"
samtools view -Sb "${OUT}/${FN}_combined.sam" | samtools sort -o "${OUT}/${FN}_combined.sorted.bam"
samtools index "${OUT}/${FN}_combined.sorted.bam"
# Identify split reads (reads with supplementary alignments)
# Extract reads with the SA tag
samtools view -h "${OUT}/${FN}_combined.sorted.bam" | awk '$0 ~ /^@/ || $0 ~ /SA:Z:/' > "${OUT}/${FN}_supplementary_reads.sam"
samtools view -Sb "${OUT}/${FN}_supplementary_reads.sam" > "${OUT}/${FN}_supplementary_reads.bam"
samtools index "${OUT}/${FN}_supplementary_reads.bam"
see related : Extracting chimeric reads from mapping
Thanks Pierre. I'm working in a micromamba environment. I've got
jvarkit
installed. I updated your filter script as follows.However, I'm having trouble with samjdk using java.
I get this error:
I do have java installed:
I don't suppose you can help?
Hi , sorry I didn't write that jvarkit conda /env. i usually call jvarkit the following way;
You can use
bbsplit.sh
from BBMap suite to map reads to multiple references and bin them accordingly. Take a look at the in-line help (ambiguous=
ambiguous2=
options in particular).Also see Identification of the sequence insertion site in the genome for a different BBMap option.
I've had a similar question on the forum and there's some additional advice here Speeding up WGS analysis