Hi,
For the purpose of an experiment on a panel of amplicons, I have to work on bam files that I have already filtered using samtools on the positions concerned by the amplicons(with a bed file), for a total of 1840 bam files (10 bams x 184 amplicons).
I need to replace the chromosome names by the amplicon names. I already have the lists of chromosomes to replace and the lists of amplicon names (1840 entries each) in the right order. But unfortunately my algorithm does not work. It create the new files but it's not the good corresponding amplicon names and the chromosome names aren't modified. That's why I'm asking for your help.
Here is my program :
INPUTFILE=amplicon_list.txt
OLDNAMEFILE=name_to_be_replaced.txt
TEMPFILE=/tmp/$$.tmp
echo 1 > $TEMPFILE
for BAM in *.bam; do
# Create an iterator to browse the lists
i=$[$(cat $TEMPFILE)]
echo $i
# get the n-th line from $INPUTFILE and $OLDNAMEFILE
OLDNAME=$(sed "${i}q;d" $OLDNAMEFILE)
INPUT=$(sed "${i}q;d" $INPUTFILE)
# get the filename to use only a part for the output
filename=`echo $file | cut -d "_" -f 2`
samtools view -H $BAM | sed "s/${OLDNAME}/${INPUT}/" | samtools reheader - $BAM > ${INPUT}.${filename}.bam
i=$[$(cat $TEMPFILE) + 1]
# Store the new iterator value
echo $i > $TEMPFILE
done
I think it is possible to optimize the algorithm because my knowledge of bash is a bit limited. Thanks for your help.
It would be best if you post example file names and changes desired. It is unclear where the data is coming from and what it should look like unless there is an example of that.
Yes, you're right.
For example, the chromosome name I want to replace is named 2R_RaGOO (there is 5 different chromosome names). And I need to replace it by the amplicon name 2R_1194129. Unfortunaterly, there is 184 different amplicons, and I can't apply a samtools reheader easily. So I made two lists to facilitate the replacement of one name by the other.
The two lists are as follows : The amplicon_list.txt contain 1840 amplicon names in the right order; each amplicon name is repeated 10 times (because I have 10 bams files already filtered by amplicon positions for each amplicons : 10 bams x 184 amplicons) :
The second list (name_to_be_replaced.txt) contain the chromosome names to replace, in the right order of the bam files stored in the directory :
Hope it will help you to understand the problem. It's just about how to apply the samtools reheader command to multiples bams files containing different chromosome names to replace by different amplicon names. But I have some basic algorithm problems.
Do all your bam files have the same header (
samtools view -H
) ? They should if they were mapped to the same genome. If they have the same header indeed, then you are really over-complicating it because you can just create the new header once (outside the loop), and use it for every bam.If you want to keep a record of each amplicon separate (treat it as a separate reference) then you will have to create a unique name when you re-header based on co-ordinate interval.
OK, your proposal is very interesting. In fact, I need to go back to the previous step, modify the header and, after that, extract reads that fall in the given region, based on coordinate interval. Right ?
If you want to be able to look at the aligned data with reference to where it came from then yes. Otherwise why go through the trouble of reheadering. You already have aligned data. Simply search and go to the region (or extract the reads using samtools view) for each amplicon co-ordinates.
This is to do a simulation of a GT-seq experiment. The amplicon sequencing data are missing, because the sequecing didn't work. The aim of this simulation is just to obtain amplicon sequencing data from WGS, re-do a variant calling, and do an haplotype calling with the Microhaplot R-package.
I know this is completely insane... We are trying check if these regions have great informations and we already have the WGS data.
Yes. Here the header :
@HD VN:1.6 SO:coordinate
@SQ SN:2L_RaGOO LN:53691007
@SQ SN:2R_RaGOO LN:60320594
@SQ SN:3L_RaGOO LN:40972002
@SQ SN:3R_RaGOO LN:53056611
@SQ SN:Mt_RaGOO LN:15365
@SQ SN:UNKN_RaGOO LN:18066711
@SQ SN:X_RaGOO LN:25311795