Do I have to align the whole data on the genome and then filter those regions I will ? (costs much time !!)
I guess you have to, if you want to retain only the reads that map on chromosome 1. If we knew the positions already, why would we even map them?
Is the following command right ? samtools view -b ¨chr1¨ > ERR188044_1_chr1 (version 0.01.19) -Does the same go for the reverse read file too ?
This command selects the bam entries that map on chromosome 1. Beware that -b
will generate a bam output (compressed) without the header (that requires the -h
option); the header is important for following usages of bam files, so keep it.
To get the reads, you need to extract them from the bam file. Since I see you're not very expert in this, I advice you against doing this yourself, try instead with bedtools bamtofastq. There you can also specify to separate the output in read_1 and read_2 files. Beware that the bamtofastq
module requires the reads to be sorted by name (samtools sort -n
).
Do I have to determine the start and end position too ? or just writing chr1 is enough ? (in manual I see only the name of the chromosome is also enough, http://www.htslib.org/doc/samtools-0.1.19.html)
Just the chromosome is enough, but I suppose you have to index your bam file (samtools index
) and you will only be available to index it if it is sorted by genome coordinate (samtools sort
). This is not the same sort as before.
In combination of two or more regions how may I write the command ? for example combination of chr1 and chr2 ?
Why not making a bed file with the coordinates you want to retain, to use with the -L
options of samtools view
instead?
How may I say how much percent of the whole data is filtered ?
samtools stats
will tell you, among a lot of things, the number of reads (and of pairs) that you have in your bam file. If you run samtools stats
for both the unfiltered and the filtered files, simply dividing those two numbers will tell you this bit.
Where is the link to your last post? Your questions are not very clear to me. But if you are interested in downsampling paired-end fasta, you may try
seqtk sample
command https://github.com/lh3/seqtkHow to down-sample a full data please have a look at the link
Please upgrade to latest samtools (if you are still using v. 0.1.19). There is no reason to continue using that ancient version.
I am not sure why you are sub-sampling aligned data when you have the original dataset available? As @Devon indicated in the last thread using raw reads (and adding more of them step-wise) should be the way to go.
Dear Genomax,
The reason why I am using the old version is that I have to work with the same version of software packages used in New Tuxedo paper (Hisat, StringTie,Ballgown).
Please have a look at theis link https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032908/
in the Overview of the protocol section, look up please the chromosome X, I have to do the same but in a bit larger than chromosome X (maybe X and 1, and the next time X and 1 and 2 chromosomes). I have to repeat it a couple of times and enlarge the files. Then the mapping is done on the larger files too. in the end results will be analyzed.
Please let me know if you have any advice.
Thanks.
From tuxedo paper:
You can use the latest version (see the
or later
part). By the time you are done with alignments samtools is not going to make any changes to the alignment results.Reads that they have provided in the download from tuxedo paper are only a subset selected that map to
chrX
. What do you hope to achieve by doing this for additional chromosomes? These example reads should not map significantly to any new chromosomes you add.How are you going to do that. Align the full original data and extract reads from chr 1 then 2 etc? It is not clear what you are trying to do, at least to me.
computational time is what I need to observe. To see how aligner x performs in comparison with another aligner.
This would not be the way to do it then.
You could sub-sample original full fastq data files and then align those subsets (n, 2n, 3n etc) against entire genome/a chromosome or whatever you choose as the reference. That said,
computational time
is not a good criteria for comparison since you are assuming that every aligner is going to work equally well (not likely) and the defaults are identical across all of them (again not the case for sure).See this for more.
Thanks Genomax.
Your comment looks very helpful.