Question

BAM header edit with a list of amplicons

0

Entering edit mode

3.5 years ago

ltalignani • 0

Hi,

For the purpose of an experiment on a panel of amplicons, I have to work on bam files that I have already filtered using samtools on the positions concerned by the amplicons(with a bed file), for a total of 1840 bam files (10 bams x 184 amplicons).

I need to replace the chromosome names by the amplicon names. I already have the lists of chromosomes to replace and the lists of amplicon names (1840 entries each) in the right order. But unfortunately my algorithm does not work. It create the new files but it's not the good corresponding amplicon names and the chromosome names aren't modified. That's why I'm asking for your help.

Here is my program :

INPUTFILE=amplicon_list.txt
OLDNAMEFILE=name_to_be_replaced.txt

TEMPFILE=/tmp/$$.tmp
echo 1 > $TEMPFILE

for BAM in *.bam; do
   # Create an iterator to browse the lists
   i=$[$(cat $TEMPFILE)]
   echo $i

  # get the n-th line from $INPUTFILE and $OLDNAMEFILE
  OLDNAME=$(sed "${i}q;d" $OLDNAMEFILE)
  INPUT=$(sed "${i}q;d" $INPUTFILE)

  # get the filename to use only a part for the output
  filename=`echo $file | cut -d "_" -f 2`

  samtools view -H $BAM | sed "s/${OLDNAME}/${INPUT}/" | samtools reheader - $BAM > ${INPUT}.${filename}.bam

  i=$[$(cat $TEMPFILE) + 1]

  # Store the new iterator value 
  echo $i > $TEMPFILE

  done

I think it is possible to optimize the algorithm because my knowledge of bash is a bit limited. Thanks for your help.

samtools BAM • 1.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 3.5 years ago by ltalignani • 0

0

Entering edit mode

It would be best if you post example file names and changes desired. It is unclear where the data is coming from and what it should look like unless there is an example of that.

ADD REPLY • link 3.5 years ago by GenoMax 152k

0

Entering edit mode

Yes, you're right.
For example, the chromosome name I want to replace is named 2R_RaGOO (there is 5 different chromosome names). And I need to replace it by the amplicon name 2R_1194129. Unfortunaterly, there is 184 different amplicons, and I can't apply a samtools reheader easily. So I made two lists to facilitate the replacement of one name by the other.

The two lists are as follows : The amplicon_list.txt contain 1840 amplicon names in the right order; each amplicon name is repeated 10 times (because I have 10 bams files already filtered by amplicon positions for each amplicons : 10 bams x 184 amplicons) :

2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_10478941
2R_1065567
2R_1065567
2R_1065567
2R_1065567
2R_1065567
2R_1065567
2R_1065567
...

The second list (name_to_be_replaced.txt) contain the chromosome names to replace, in the right order of the bam files stored in the directory :

2R_RaGOO  
2R_RaGOO  
...  
2L_RaGOO  
2L_RaGOO  
...
3L_RaGOO
...
3R_RaGOO
...
X_RaGOO

Hope it will help you to understand the problem. It's just about how to apply the samtools reheader command to multiples bams files containing different chromosome names to replace by different amplicon names. But I have some basic algorithm problems.

ADD REPLY • link 3.5 years ago by ltalignani • 0

0

Entering edit mode

Do all your bam files have the same header (samtools view -H) ? They should if they were mapped to the same genome. If they have the same header indeed, then you are really over-complicating it because you can just create the new header once (outside the loop), and use it for every bam.

ADD REPLY • link 3.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

If you want to keep a record of each amplicon separate (treat it as a separate reference) then you will have to create a unique name when you re-header based on co-ordinate interval.

2R_10478941_1
2R_10478941_2
2R_10478941_3
....

ADD REPLY • link 3.5 years ago by GenoMax 152k

0

Entering edit mode

OK, your proposal is very interesting. In fact, I need to go back to the previous step, modify the header and, after that, extract reads that fall in the given region, based on coordinate interval. Right ?

ADD REPLY • link 3.5 years ago by ltalignani • 0

0

Entering edit mode

If you want to be able to look at the aligned data with reference to where it came from then yes. Otherwise why go through the trouble of reheadering. You already have aligned data. Simply search and go to the region (or extract the reads using samtools view) for each amplicon co-ordinates.

ADD REPLY • link 3.5 years ago by GenoMax 152k

0

Entering edit mode

This is to do a simulation of a GT-seq experiment. The amplicon sequencing data are missing, because the sequecing didn't work. The aim of this simulation is just to obtain amplicon sequencing data from WGS, re-do a variant calling, and do an haplotype calling with the Microhaplot R-package.

I know this is completely insane... We are trying check if these regions have great informations and we already have the WGS data.

ADD REPLY • link 3.5 years ago by ltalignani • 0

0

Entering edit mode

Yes. Here the header :

@HD VN:1.6 SO:coordinate
@SQ SN:2L_RaGOO LN:53691007
@SQ SN:2R_RaGOO LN:60320594
@SQ SN:3L_RaGOO LN:40972002
@SQ SN:3R_RaGOO LN:53056611
@SQ SN:Mt_RaGOO LN:15365
@SQ SN:UNKN_RaGOO LN:18066711
@SQ SN:X_RaGOO LN:25311795

ADD REPLY • link 3.5 years ago by ltalignani • 0