Hello all, I have a list of un-mapped reads, is there a way to generate a ranked list of transcripts, meaning count the most frequent ones?
Thanks!
Hello all, I have a list of un-mapped reads, is there a way to generate a ranked list of transcripts, meaning count the most frequent ones?
Thanks!
I guess you could try to assemble the unmapped reads, and blast the best looking contigs to nr, see if you can identify where they came from. Or just blast the most common unmapped reads.
If you have reads in a .sam
file and you want to count their frequency you could make a quick python
script:
inFile = open('unmapped.sam','r')
outFile = open('unmapped_read_count.txt','w')
countDict = {}
line = inFile.readline()
#Remove header starting with @ symbol
while line.startswith('@'):
line = inFile.readline()
#Read through file
while line != '':
lineList = line.rstrip().split('\t')
if lineList[9] not in countDict: #If this is the first time this sequence is seen
countDict[lineList[9]] = 1
else:
countDict[lineList[9]] = countDict[lineList[9]] + 1 #If this sequence has been seen before add one to the number of times it's been seen
line = inFile.readline()
#Loop through dictionary and print tab separated read and number of counts
for read in countDict.keys():
outFile.write(read+'\t'+str(countDict[read])+'\n')
inFile.close()
outFile.close()
It's worth noting that this will count the number of occurrences for identical reads, if you have a sequence that you expect these reads to map to it's a much better idea to map these to that template like I suggested in the above comment.
Do you have a lot of unmapped reads?
Sometimes, there may really be something else contributing to true RNA-Seq sequencing in your sample (where there may actually be benefit to using a joint reference). For example, if you have a virus infection and there are a lot of virus reads (where you may want to use a human+virus reference).
However, I think there is always some amount of cross-contamination and other factors that are going to keep all 100% of your reads from being important to understand. For example, if you have filtered things like adapters and other non-biological sequences, and you still have <2% unexplained reads, I think limits to the genome assembly, purity of de-muliplexed samples, etc. could be a factor.
Not sure if this is relevant to you, but I think there are currently top PhiX BLAST hits that aren't accurate (because someone with a 16S study didn't understand that you can get PhiX sequence in your de-multiplexed samples, even though it wasn't supposed to be there, probably with a single-barcode):
C: Calling Single-Barcode Samples from Mixed Runs as Dual-Barcode Samples | Possibl
In other words, I think it makes a difference whether you have 30% unaligned reads or 5% unaligned reads.
That does seem pretty high.
I would probably recommend converting those reads to FASTQ (if not done already) and:
1) Look for Over-Represented Sequences using FastQC
2) Perform a de novo assembly and check the largest and most highly expressed unaligned contigs. While you may need to test multiple methods, one possibility is to use Oases for assembly, Bowtie2 + eXpress for alignment / quantification, and BLAST for annotation.
Yes - good point. Thank You. FastQC will also give you other statistics as well, including read lengths (assuming you have trimmed them before your 1st alignment).
If that is the problem, your read trimmer probably has a parameter to require a minimum length of read (which is probably a good idea).
roy.granit : Have you taken a few of these and blasted them at NCBI? I would start there before you invest a lot of time. If these are contaminants then you would want to trace back to see where they originated.
30% reads mapping to an external gene seems like a lot ,unless you have insertions of the gene everywhere and all of them are being expressed. You have already received the suggestion of aligning to this "external gene" to see if they are indeed originating there.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Well, if it is unmapped this suggests those reads do not originate from the (annotated) transcriptome. What makes you think they are meaningful?
Indeed, I'm aware that I cannot align these sequences to the transcriptome/genome, but these reads are derived from a certain cell line which might express some 'external' genes that were transfected and I wish to know how many of the reads map to these genes (before I generate a hybrid genome)
It sounds like you know where these "external" genes have come from. If you have a
fasta
file with the sequence of these "external" genes then use that for mapping and see how many of the unmapped reads map to this second fasta/genome file.Depending on your mapping algorithm you might have an
unmapped.bam
file. You can runbamToFastq
and then remap those unmapped reads to your "external" genome. From there you can determine if you want to remap all genes to a hybrid genome.