Question

Filtering reads that map to both references

0

Entering edit mode

5 months ago

michael.flower.14 ▴ 200

I have a vector (pFRT_lacZeo) that randomly integrated at a single site in the human genome of a cell line. I would now like to find that location in the human genome. I have short read whole genome sequencing data. Below are the steps I've taken so far. I can see there are reads that map to multiple locations, but that includes reads with multiple mappings to the vector only, or to the human reference only. Is there a cleaner way to find reads that map to both the vector and human sequence, and ultimately to find where my insertion site is?

# Prepare the two reference sequences
FLPREF="/Users/michaelflower/refs/flpin_trex/pFRT_lacZeo.fa"
HUMANREF="/Users/michaelflower/refs/hg38/hg38.fa"

# Make a combined human and vector reference sequence
COMBINEDREF="/Users/michaelflower/refs/flpin_trex/hg38_pFRTlacZeo.fa"
cat "$HUMANREF" "$FLPREF" > "$COMBINEDREF"
bwa index "$COMBINEDREF"
samtools faidx "$COMBINEDREF"

# Align to combined reference
R1="U20Swt_R1_001.fastq.gz"
R2="U20Swt_R2_001.fastq.gz"
FN=$(basename "$R1" | sed 's/_R1_.*//')
HEADER=$(gzcat "$R1" | head -n 1)
ID=$(echo "$HEADER" | sed 's/@//' | cut -d ':' -f 1-4)
SM=$(echo "$HEADER" | cut -d ':' -f 2)
LB=$(echo "$HEADER" | cut -d ':' -f 3,4)
PU=$ID
RG="@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:$PL\tPU:$PU\tCN:$CN"
bwa mem -M -t 4 -R "$RG" "$COMBINEDREF" "$R1" "$R2" > "${OUT}/${FN}_combined.sam"
samtools view -Sb "${OUT}/${FN}_combined.sam" | samtools sort -o "${OUT}/${FN}_combined.sorted.bam"
samtools index "${OUT}/${FN}_combined.sorted.bam"

# Identify split reads (reads with supplementary alignments)
# Extract reads with the SA tag
samtools view -h "${OUT}/${FN}_combined.sorted.bam" | awk '$0 ~ /^@/ || $0 ~ /SA:Z:/' > "${OUT}/${FN}_supplementary_reads.sam"
samtools view -Sb "${OUT}/${FN}_supplementary_reads.sam" > "${OUT}/${FN}_supplementary_reads.bam"
samtools index "${OUT}/${FN}_supplementary_reads.bam"

bwa insertion alignment reference • 485 views

ADD COMMENT • link updated 5 months ago by Trivas ★ 1.8k • written 5 months ago by michael.flower.14 ▴ 200

0

Entering edit mode

see related : Extracting chimeric reads from mapping

ADD REPLY • link 5 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre. I'm working in a micromamba environment. I've got jvarkit installed. I updated your filter script as follows.

# Create filter script
echo 'final Set<String> vector_chrom = new HashSet<>(Arrays.asList("pFRT_lacZeo"));
if (record.getReadUnmappedFlag()) return false;
if (record.getReadPairedFlag() && !record.getMateUnmappedFlag()) {
    if (vector_chrom.contains(record.getReferenceName()) != 
        vector_chrom.contains(record.getMateReferenceName())) {
        return true;
    }
}
for (final SAMRecord other : SAMUtils.getOtherCanonicalAlignments(record)) {
    if (vector_chrom.contains(record.getReferenceName()) != 
        vector_chrom.contains(other.getReferenceName())) {
        return true;
    }
}
return false;' > /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js

However, I'm having trouble with samjdk using java.

# Run samjdk with the JavaScript filter
samjdk -e "$(cat /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js)" "$SUPPLEMENTARY_BAM" > "${OUT}/${FN}_split_reads.sam"

I get this error:

$ samjdk -e "$(cat /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js)" "$SUPPLEMENTARY_BAM" > "${OUT}/${FN}_split_reads.sam"
/Users/michaelflower/micromamba/envs/bioinfo/bin/jvarkit: line 80: /Users/michaelflower/micromamba/envs/bioinfo/bin/java: No such file or directory
(bioinfo)

I do have java installed:

$ which java
/usr/bin/java
(bioinfo) 
$ java -version
openjdk version "17.0.10" 2024-01-16 LTS
OpenJDK Runtime Environment Zulu17.48+15-CA (build 17.0.10+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.48+15-CA (build 17.0.10+7-LTS, mixed mode, sharing)
(bioinfo)

I don't suppose you can help?

ADD REPLY • link 5 months ago by michael.flower.14 ▴ 200

0

Entering edit mode

I'm working in a micromamba environment. I've got jvarkit installed

Hi , sorry I didn't write that jvarkit conda /env. i usually call jvarkit the following way;

java -jar /path/to/jvarkit.jar samjdk etc...etc...

ADD REPLY • link 5 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

You can use bbsplit.sh from BBMap suite to map reads to multiple references and bin them accordingly. Take a look at the in-line help (ambiguous= ambiguous2= options in particular).

Also see Identification of the sequence insertion site in the genome for a different BBMap option.

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

I've had a similar question on the forum and there's some additional advice here Speeding up WGS analysis

ADD REPLY • link 5 months ago by Trivas ★ 1.8k