I have 3 RNA seq files and i would like to compare these files together to find overlaps and unique reads between them. In fact, I have 3 files (Files1, Files2 and files3) that I think File1 is the merge of File2 and File3 but I am not sure, so I decide to compare these 3 files together to find is there any unique reads between them?
I have .fastq file (Raw data) , .bam file (after aligning) and count table file from those. I would like to know it is better to do comparing in which step and how can I compare them?
I have also checked number of their reads before alignment and after alignment and also number of mapped reads and i found that the merge of File 2 and File3 is a bit bigger than File 1.
number of read number of mapped read file size
File1 10403419 10294966 1.8 GB
File2 5539406 5487472 944.4 MB
File3 5517327 5466102 940.7 MB
You should compare them after aligning. Have a look at
bedtools intersect
andbedtools subtract
.File 1 reads =/= File 2 reads + File 3 reads
, if those numbers above are correct. So at a minimum that does not explain a simple addition.If you feel that somehow the reads in file 2 and file3 have been combined into file 1 then you can extract a subset of read headers from file 2 and 3 and see if they are present in file 1 (raw data). Comparing sequence/count data does not make a lot of sense since at that level it is not assignable to a particular file.