Annotating reads with the name of the feature to which they map
1
0
Entering edit mode
3.7 years ago
lechu ▴ 20

Hi,

I have a long list of chromosome positions and strand information (all produced by sam2tsv (http://lindenb.github.io/jvarkit/Sam2Tsv.html). I am looking for a way to annotate this list with feature names from a bed file (or a gtf file, whatever better). The desired minimal output is a table like the one below, but including e.g., a transcript name column:

#Read-Name  Flag  MAPQ  CHROM  READ-POS0  READ-BASE  READ-QUAL  REF-POS1  REF-BASE  CIGAR-OP
r001        163   30    ref    0          T          .          7         T         M
r001        163   30    ref    1          T          .          8         T         M
r001        163   30    ref    2          A          .          9         A         M

Essentially it boils down to looking up if a position falls within a bed range and if the strand matches, so maybe something like awk with two files looks up (?) (never done so far, so I am not sure if that's anywhere near to be good for that). What would be the most efficient way to do it for a very long table? I'm hoping for some tips to get me started so that I don't get stuck with an inefficient approach.

An alternative could be to use bedtools intersect, and annotate reads with a feature name, but in this case, I need the information about QNAME::feature match for each read individually. Is that possible to get this with bedtools?

Cheers, Lech

RNA-Seq • 1.2k views
ADD COMMENT
2
Entering edit mode
3.7 years ago
A. Domingues ★ 2.7k

If I understood teh goal correctly, I have use two different strategies in the past:

bedtools interesect

bamToBed -i sample.bam | bedtools intersect -a - -b features.bed -s -wa -wb -bed -f 1.0 > $output1

You can play with the flags to make the intersections, unstranded/same/opposite strand, and define the level of overlap. The output will a text file with the read information plus the feature info (start, end, name, ...) . Using command line tools such as awk / cut, it can be cleaned up a bit. The features can also be a GFF Th caveat with this approach that there will be some reads will have more than entry when they overlap more than on feature. How to deal with these will depend on the goal of the analysis.

HTseq (> v0.9, i think)

samtools view -h ${aln} | htseq-count --samout=${exp}.test1.sam -s yes -f sam -m intersection-nonempty - ${ESSENTIAL_GENES} > ${exp}.counts

This will generate a classical table of counts, which you can discard, but also a new sam file with an extra tag with gene id (or whatever ID you have):

NB501946:201:H3HMGAFXY:1:21112:24049:15895 16 I 3115
0 22M * 0 0TGATGTTCTACGCTTAAATTTT EEEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:22 NM:i:0 XM:i:2 XF:Z:__too_low_aQual NB501946:201:H3HMGAFXY:2:11202:12920:18180
16 I 3685 255 21M * 0
0TATCTACTAGGAATAACTCGA EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:2:21105:7328:11769
0 I 3738 255 21M * 0
0TGTAAAATAGAGGATCAGACC AAEEEEEEEAAEE6EEEEEEE XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:2:11112:5811:6494
16 I 3746 255 21M * 0
0AGAGGATCAGACCCAAAATTC EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:2:21204:5691:10822 16 I 3746 255 22M * 0
0AGAGGATCAGACCCAAAATTCA EEEEEEEEEEEEEAEEEEEEEA XA:i:0 MD:Z:22 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:1:21209:12787:3406 0 I 3747 255 21M * 0
0GAGGATCAGACCCAAAATTCA AEEEEEEEEEEEEEEEEEEEE XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:4:11512:6158:17507
16 I 3747 255 21M * 0
0GAGGATCAGACCCAAAATTCA EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:3:11503:17322:16736 16 I 3747 255 20M * 0
0GAGGATCAGACCCAAAATTC EEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:20 NM:i:0 XF:Z:WBGene00023193

This can then be converted just like you have in your example, but with that added column. See also https://github.com/simon-anders/htseq/issues/65

ADD COMMENT
0
Entering edit mode

Thank you, this was extremely helpful, and gave me some more thinking. Actually, I may use mplileup file instead of TSV (TSV sizes become prohibitive). Is there a way to get an extra column in mpileup file with a feature name? Essentially just getting a feature name (if such exists) for consecutive positions in mpileup file. Thanks again!

ADD REPLY
0
Entering edit mode

I am not familiar with mpileup, but you could possible do (i) convert mpileup to bed, (ii) intersect with genomic features, (iii) use cut to removed the superfluous columns. This should give the results you are hoping for.

ADD REPLY
0
Entering edit mode

Thanks a lot for clarification @A. Dominguez!

ADD REPLY
0
Entering edit mode

No problem. If you think this solves the original question, please accept my answer or up-vote the relevant messages :)

ADD REPLY

Login before adding your answer.

Traffic: 2415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6