[ensembl-dev] which version of genome to use fro RNA-seq mapping

Wed Nov 14 21:08:39 GMT 2012

Dear Ensembl team,
I am aiming at mapping RNA-seq reads to the human genome using tophat, 
however I am unsure which version of the genome I should use to do this.

After some searches I have seen that people tend to use the non-masked 
"toplevel" version of the genome 
(ftp://ftp.ensembl.org/pub/release-69/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz), 
but I am wondering if this is a good idea: because of the redundant 
sequences found in the patches, the aligner will conclude that some 
reads map at multiple locations and these reads will be discarded.

Another option would be to map to the "primary assembly" 
(ftp://ftp.ensembl.org/pub/release-69/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.69.dna.primary_assembly.fa.gz), 
but this ignores some of recent improvements to the genome (fix patches).

Ideally I would like to use a "golden path" assembly (sum of all 
top-level sequences, omitting any redundant regions). What would you 
suggest?
Thanks
Julien

-- 
Julien Roux, PhD
Gilad lab, Department of Human Genetics, University of Chicago
http://giladlab.uchicago.edu/
920 East 58th Street, CLSC 317, Chicago, IL 60637, USA
tel: +1-773-834-1984   fax: +1-773-834-8470