Filtering reads that map to both references
0
0
Entering edit mode
5 months ago

I have a vector (pFRT_lacZeo) that randomly integrated at a single site in the human genome of a cell line. I would now like to find that location in the human genome. I have short read whole genome sequencing data. Below are the steps I've taken so far. I can see there are reads that map to multiple locations, but that includes reads with multiple mappings to the vector only, or to the human reference only. Is there a cleaner way to find reads that map to both the vector and human sequence, and ultimately to find where my insertion site is?

# Prepare the two reference sequences
FLPREF="/Users/michaelflower/refs/flpin_trex/pFRT_lacZeo.fa"
HUMANREF="/Users/michaelflower/refs/hg38/hg38.fa"

# Make a combined human and vector reference sequence
COMBINEDREF="/Users/michaelflower/refs/flpin_trex/hg38_pFRTlacZeo.fa"
cat "$HUMANREF" "$FLPREF" > "$COMBINEDREF"
bwa index "$COMBINEDREF"
samtools faidx "$COMBINEDREF"

# Align to combined reference
R1="U20Swt_R1_001.fastq.gz"
R2="U20Swt_R2_001.fastq.gz"
FN=$(basename "$R1" | sed 's/_R1_.*//')
HEADER=$(gzcat "$R1" | head -n 1)
ID=$(echo "$HEADER" | sed 's/@//' | cut -d ':' -f 1-4)
SM=$(echo "$HEADER" | cut -d ':' -f 2)
LB=$(echo "$HEADER" | cut -d ':' -f 3,4)
PU=$ID
RG="@RG\tID:$ID\tSM:$SM\tLB:$LB\tPL:$PL\tPU:$PU\tCN:$CN"
bwa mem -M -t 4 -R "$RG" "$COMBINEDREF" "$R1" "$R2" > "${OUT}/${FN}_combined.sam"
samtools view -Sb "${OUT}/${FN}_combined.sam" | samtools sort -o "${OUT}/${FN}_combined.sorted.bam"
samtools index "${OUT}/${FN}_combined.sorted.bam"

# Identify split reads (reads with supplementary alignments)
# Extract reads with the SA tag
samtools view -h "${OUT}/${FN}_combined.sorted.bam" | awk '$0 ~ /^@/ || $0 ~ /SA:Z:/' > "${OUT}/${FN}_supplementary_reads.sam"
samtools view -Sb "${OUT}/${FN}_supplementary_reads.sam" > "${OUT}/${FN}_supplementary_reads.bam"
samtools index "${OUT}/${FN}_supplementary_reads.bam"
bwa insertion alignment reference • 485 views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks Pierre. I'm working in a micromamba environment. I've got jvarkit installed. I updated your filter script as follows.

# Create filter script
echo 'final Set<String> vector_chrom = new HashSet<>(Arrays.asList("pFRT_lacZeo"));
if (record.getReadUnmappedFlag()) return false;
if (record.getReadPairedFlag() && !record.getMateUnmappedFlag()) {
    if (vector_chrom.contains(record.getReferenceName()) != 
        vector_chrom.contains(record.getMateReferenceName())) {
        return true;
    }
}
for (final SAMRecord other : SAMUtils.getOtherCanonicalAlignments(record)) {
    if (vector_chrom.contains(record.getReferenceName()) != 
        vector_chrom.contains(other.getReferenceName())) {
        return true;
    }
}
return false;' > /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js

However, I'm having trouble with samjdk using java.

# Run samjdk with the JavaScript filter
samjdk -e "$(cat /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js)" "$SUPPLEMENTARY_BAM" > "${OUT}/${FN}_split_reads.sam"

I get this error:

$ samjdk -e "$(cat /Users/michaelflower/my_bin/insertion_site/filter_split_reads.js)" "$SUPPLEMENTARY_BAM" > "${OUT}/${FN}_split_reads.sam"
/Users/michaelflower/micromamba/envs/bioinfo/bin/jvarkit: line 80: /Users/michaelflower/micromamba/envs/bioinfo/bin/java: No such file or directory
(bioinfo) 

I do have java installed:

$ which java
/usr/bin/java
(bioinfo) 
$ java -version
openjdk version "17.0.10" 2024-01-16 LTS
OpenJDK Runtime Environment Zulu17.48+15-CA (build 17.0.10+7-LTS)
OpenJDK 64-Bit Server VM Zulu17.48+15-CA (build 17.0.10+7-LTS, mixed mode, sharing)
(bioinfo) 

I don't suppose you can help?

ADD REPLY
0
Entering edit mode

I'm working in a micromamba environment. I've got jvarkit installed

Hi , sorry I didn't write that jvarkit conda /env. i usually call jvarkit the following way;

java -jar /path/to/jvarkit.jar samjdk etc...etc...
ADD REPLY
0
Entering edit mode

You can use bbsplit.sh from BBMap suite to map reads to multiple references and bin them accordingly. Take a look at the in-line help (ambiguous= ambiguous2= options in particular).

Also see Identification of the sequence insertion site in the genome for a different BBMap option.

ADD REPLY
0
Entering edit mode

I've had a similar question on the forum and there's some additional advice here Speeding up WGS analysis

ADD REPLY

Login before adding your answer.

Traffic: 1628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6