[ensembl-dev] difference between refseq_import and refseq_human_import

Daniel Barrell dbarrell at ebi.ac.uk
Fri Oct 31 10:37:22 GMT 2014


Hi Kiran,

You are correct, we have two sources of data from RefSeq for human (and 
mouse):

1) The logic_name 'refseq_human_import' is a version of RefSeq which we 
use in collaboration with the CCDS 
(http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) project to create a 
new CCDS release and therefore is only a subset of the RefSeq annotation 
set. It is supplied to us in GTF format by the CCDS team at NCBI.

2) The logic_name 'refseq_import' is our relatively new import of 
annotation from the publicly available RefSeq GFF3 files (e.g: 
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz). 
This includes _all_ annotation from RefSeq, including annotation on the 
alternate reference loci. We currently have imported the GFF3 annotation 
for all merge species (human, mouse, rat, pig, zebrafish) and will 
import all other species in the next release.

To exclude the alternate loci you can use attrib_type.code='non_ref', 
for example:

select g.analysis_id,a.logic_name,g.source,count(*)
   from gene g,analysis a
  where g.analysis_id=a.analysis_id
    and a.logic_name ='refseq_import'
    and g.seq_region_id NOT IN (select sra.seq_region_id
                                  from seq_region_attrib sra, attrib_type at
                                 where sra.attrib_type_id=at.attrib_type_id
                                   and at.code='non_ref')
  group by g.source,a.logic_name, g.analysis_id;
+-------------+---------------+-----------------+----------+
| analysis_id | logic_name    | source          | count(*) |
+-------------+---------------+-----------------+----------+
|        8360 | refseq_import | BestRefSeq      |    24792 |
|        8360 | refseq_import | Curated Genomic |       22 |
|        8360 | refseq_import | Gnomon          |     4945 |
+-------------+---------------+-----------------+----------+


Regards

Dan


On 31/10/14 06:00, Kiran Mukhyala wrote:
> Hello Ensembl!
>
> There are multiple gene sets imported from RefSeq in the human 
> other_features database in e77.
>
> >/select analysis_id,logic_name,source,count(*) from gene join 
> analysis using (analysis_id) group by source/
>
> *analysis_id    logic_name **source **count(*)*
> 8359    ccds_import ccds 30493
> 8358    estgene ensembl 30287
> 8166    refseq_human_import    refseq 26652
> 8360    refseq_import BestRefSeq 27398
> 8360    refseq_import Curated Genomic    26
> 8360    refseq_import Gnomon 5363
>
> I am interested in the gene models imported from RefSeq using GRCh38 
> reference assembly excluding the alternate loci. It looks like genes 
> from 'BestRefSeq' include alternate loci while genes from 'refseq' don't.
>
> Could you please explain what the the difference between them are?
>
> Thanks,
> -Kiran
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141031/a9efc1af/attachment.html>


More information about the Dev mailing list