How to highlight multiple references mapped reads?
1
1
Entering edit mode
7.2 years ago
vmicrobio ▴ 290

Hi all,

A very basic question but I can't solve it easily. After a raw-read mapping against a list of references, how to make a list of reads that match with more than one reference (and which ones) to highlight them?

Thank you very much!

alignment • 1.5k views
ADD COMMENT
0
Entering edit mode

define highlight please.

ADD REPLY
0
Entering edit mode

I just meant to work with them after. Sorry if my question is not clear : I'm trying to get two different things:

  • make a file containing reads (matching with 2 references or more) and with which references associated
  • a count that tell me that x reads match with references 'a' and 'b', y reads with 'b', 'c', 'd',...
ADD REPLY
1
Entering edit mode
7.2 years ago

extract the names of the first-in-pair + mapped reads and their reference, sort+unique to get the unique (name+REF), cut to get the uniq names, sort+uniq again to the duplicated names (sames read-name but different reference):

 samtools view -f 64 -F 4 in.bam | cut -f1,3 | sort | uniq | cut -f1 | sort | uniq -d > namesF.txt

same for 2nd in pair

 samtools view -f 128 -F 4 in.bam | cut -f1,3 | sort | uniq | cut -f1 | sort | uniq -d > namesR.txt

and with which references associated

re-use those files with grep -f or ( joinif the list is large) to extract those reads

count that tell me that x reads match with references

it should be a simple sort | uniq -c

ADD COMMENT
0
Entering edit mode

Thanks a lot for your answer! What about single reads dataset?

ADD REPLY
0
Entering edit mode

it's even faster.

 samtools view -F 4 in.bam | cut -f1,3 | sort | uniq | cut -f1 | sort | uniq -d > all.txt
ADD REPLY
0
Entering edit mode

Hi Pierre,

Thank you for your answer, that's what I've tried but I wanted to be sure that I miss anything

I tried this:

samtools view -F 4 file.bam | cut -f1,3 | sort | uniq | cut -f1 | sort | uniq -d > names.txt

samtools view file.bam | grep -f names.txt > sequences.sam

cat sequences.sam | sort | uniq -c > interm

output:

1 ABDC8:10138:09707 2048    ref1    1   60  23H77M185H  *   0   0   ACATGGATGTTACGCAGCAGGGCAGTCGCCCTAAAACAAAGCACGATGCACTAAGCACATACCTGCTCACAGCCTAC   :9858///484668::::::;4;:::666;3:<<<0;;;3;=5;;;::;=///6:99977717178899::994:=1   NM:i:0  MD:Z:77 AS:i:77 XS:i:0  SA:Z:ref2,1,+,64S36M1I18M1I19M1I97M1D13M35S,60,4;

to format the ouput I did

awk '{print $2,$4,$17}' interm >final

and I got:

ABDC8:10138:09707 ref1 SA:Z:ref2,1,+,64S36M1I18M1I19M1I97M1D13M35S,60,4;

question: how to put tabs between my selected columns and to select only ref2 instead of all column 17 (I have a lot of reference names with a lot of different character sizes)

Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6