I want to use this reference genome Camarhynchus parvulus but it lacks information about the sex chromosomes (it only has Z chromosome, but not W). So I'm interested in using the reference genome from Taeniopygia guttata (zebra finch) where the Z and W sex chromosomes are present.
# Get the reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/806/625/GCA_902806625.1_Camarhynchus_parvulus_V1.1/GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/957/565/GCF_003957565.2_bTaeGut1.4.pri/GCF_003957565.2_bTaeGut1.4.pri_genomic.fna.gz
# Index
samtools faidx GCF_003957565.2_bTaeGut1.4.pri_genomic.fna
# extract 2 sex chromosomes
samtools faidx GCF_003957565.2_bTaeGut1.4.pri_genomic.fna NC_044241.2 > tae.gut.Z.fna.fa # NC_044241.2 ### Z
samtools faidx GCF_003957565.2_bTaeGut1.4.pri_genomic.fna NC_045028.1 > tae.gut.W.fna.fa # NC_045028.1 ### W
# index ref
bwa index GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
# Align Z and W car to reference
bwa mem -t $SLURM_CPUS_PER_TASK \
GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz \
tae.gut.Z.fna.fa > Z_aligned.sam
bwa mem -t $SLURM_CPUS_PER_TASK \
GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz \
tae.gut.W.fna.fa > W_aligned.sam
# Make BAM fles and filter
cat Z_aligned.sam | samtools view -bSh -q $MQ - > Z_aligned.bam
cat W_aligned.sam | samtools view -bSh -q $MQ - > W_aligned.bam
# Get chromosomes where sex read were mapped from filtered BAM file
samtools view W_aligned.bam | grep -v @ | cut -f 3 | sort -u > W_chr.parv.gut_filt.txt
samtools view Z_aligned.bam | grep -v @ | cut -f 3 | sort -u > Z_chr.parv.gut_filt.txt
Then I want to use the bam file to 'remove' all the reads that are from the sex chromosomes from my ID bam files.
I tried this:
bedtools intersect -v \
-a BIRD1.bam \
-b W_aligned.bam Z_aligned.bam > BIRD1_filtered.bam
or
bedtools bamtobed -i W_aligned.bam > Walg.bed
bedtools bamtobed -i Z_aligned.bam > Zalg.bed
samtools view BIRD1_filtered.bam \
-b -h \
-L Walg.bed \
-U BIRD1_outRegions.bam \
-o BIRD1_in.Regions.bam
But wasn't able to extract the sex chromosome information. How could this be done. Is this an approach that could work using BAM files or another file type should be used? The end product should be similar to a 'left' or 'right join' between 2 BAM files.
Let me confirm if I understand the aim. You are trying to use the sex chromosome fasta from a related organism to remove contigs/scaffolds from an assembly that you are interested in? What kind of alignments did you get from this?
Using
bwa mem
for this purpose does not seem like a good idea unless you create a fake set of short reads from the two sex chromosomes and then use then for bwa search.My samples (like 'BIRD1' in the example above) are aligned using
GCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
. But the sex chromosomes in this reference are not well described (there is only the Z chromosome, but I'd like to make sure that all the sex chromosomes are found).So my idea was to use another reference genome (
GCF_003957565.2_bTaeGut1.4.pri_genomic.fna.gz
) where the sex chromosomes are found and try to match these sex chromosomes to the reference ofGCA_902806625.1_Camarhynchus_parvulus_V1.1_genomic.fna.gz
. Then 'remove' the sex chromosome information that I would have intersected by doing a 'left join' (so that in the end, each bam file for all my individuals could be filtered in such a way as to remove the sex chromosomes). So the aim is to search within all my individual BAM files and remove the 'sex DNA' so that I can continue my downstream analyses without the 'sex' DNA.I want to do that because for some reason, my Admixture plots (using NGSAdmix) are partly structured based on sex.