Hi,
I have a long list of chromosome positions and strand information (all produced by sam2tsv (http://lindenb.github.io/jvarkit/Sam2Tsv.html). I am looking for a way to annotate this list with feature names from a bed file (or a gtf file, whatever better). The desired minimal output is a table like the one below, but including e.g., a transcript name column:
#Read-Name Flag MAPQ CHROM READ-POS0 READ-BASE READ-QUAL REF-POS1 REF-BASE CIGAR-OP
r001 163 30 ref 0 T . 7 T M
r001 163 30 ref 1 T . 8 T M
r001 163 30 ref 2 A . 9 A M
Essentially it boils down to looking up if a position falls within a bed range and if the strand matches, so maybe something like awk with two files looks up (?) (never done so far, so I am not sure if that's anywhere near to be good for that). What would be the most efficient way to do it for a very long table? I'm hoping for some tips to get me started so that I don't get stuck with an inefficient approach.
An alternative could be to use bedtools intersect, and annotate reads with a feature name, but in this case, I need the information about QNAME::feature match for each read individually. Is that possible to get this with bedtools?
Cheers, Lech
Thank you, this was extremely helpful, and gave me some more thinking. Actually, I may use mplileup file instead of TSV (TSV sizes become prohibitive). Is there a way to get an extra column in mpileup file with a feature name? Essentially just getting a feature name (if such exists) for consecutive positions in mpileup file. Thanks again!
I am not familiar with mpileup, but you could possible do (i) convert mpileup to bed, (ii) intersect with genomic features, (iii) use cut to removed the superfluous columns. This should give the results you are hoping for.
Thanks a lot for clarification @A. Dominguez!
No problem. If you think this solves the original question, please accept my answer or up-vote the relevant messages :)