[ensembl-dev] difference between refseq_import and refseq_human_import
Daniel Barrell
dbarrell at ebi.ac.uk
Fri Oct 31 10:37:22 GMT 2014
Hi Kiran,
You are correct, we have two sources of data from RefSeq for human (and
mouse):
1) The logic_name 'refseq_human_import' is a version of RefSeq which we
use in collaboration with the CCDS
(http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) project to create a
new CCDS release and therefore is only a subset of the RefSeq annotation
set. It is supplied to us in GTF format by the CCDS team at NCBI.
2) The logic_name 'refseq_import' is our relatively new import of
annotation from the publicly available RefSeq GFF3 files (e.g:
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38_top_level.gff3.gz).
This includes _all_ annotation from RefSeq, including annotation on the
alternate reference loci. We currently have imported the GFF3 annotation
for all merge species (human, mouse, rat, pig, zebrafish) and will
import all other species in the next release.
To exclude the alternate loci you can use attrib_type.code='non_ref',
for example:
select g.analysis_id,a.logic_name,g.source,count(*)
from gene g,analysis a
where g.analysis_id=a.analysis_id
and a.logic_name ='refseq_import'
and g.seq_region_id NOT IN (select sra.seq_region_id
from seq_region_attrib sra, attrib_type at
where sra.attrib_type_id=at.attrib_type_id
and at.code='non_ref')
group by g.source,a.logic_name, g.analysis_id;
+-------------+---------------+-----------------+----------+
| analysis_id | logic_name | source | count(*) |
+-------------+---------------+-----------------+----------+
| 8360 | refseq_import | BestRefSeq | 24792 |
| 8360 | refseq_import | Curated Genomic | 22 |
| 8360 | refseq_import | Gnomon | 4945 |
+-------------+---------------+-----------------+----------+
Regards
Dan
On 31/10/14 06:00, Kiran Mukhyala wrote:
> Hello Ensembl!
>
> There are multiple gene sets imported from RefSeq in the human
> other_features database in e77.
>
> >/select analysis_id,logic_name,source,count(*) from gene join
> analysis using (analysis_id) group by source/
>
> *analysis_id logic_name **source **count(*)*
> 8359 ccds_import ccds 30493
> 8358 estgene ensembl 30287
> 8166 refseq_human_import refseq 26652
> 8360 refseq_import BestRefSeq 27398
> 8360 refseq_import Curated Genomic 26
> 8360 refseq_import Gnomon 5363
>
> I am interested in the gene models imported from RefSeq using GRCh38
> reference assembly excluding the alternate loci. It looks like genes
> from 'BestRefSeq' include alternate loci while genes from 'refseq' don't.
>
> Could you please explain what the the difference between them are?
>
> Thanks,
> -Kiran
>
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141031/a9efc1af/attachment.html>
More information about the Dev
mailing list