[ensembl-dev] loading NCBI exon structures into Ensembl

Bronwen Aken ba1 at sanger.ac.uk
Thu Jun 2 21:08:42 BST 2011


Hi Reece,

On 2 Jun 2011, at 18:05, Reece Hart wrote:

> On Wed, Jun 1, 2011 at 1:39 AM, Bronwen Aken <ba1 at sanger.ac.uk> wrote:
> For the RefSeq models, RefSeq provides us with a flat file giving the genomic coordinates for all genes, transcripts and exons in their gene set. We load this up directly into the otherfeatures database and do not change any coordinates.
> 
> Hi Bronwen-
> 
> Does NCBI provide this to you directly, or is it buried in a ftp directory somewhere? Getting this data via eutils is slow and convoluted, so I'd be thrilled to have a simpler route.

NCBI provides us with a file directly (when we do a new round of CCDS comparisons). To my knowledge, it's not publicly available on an FTP site.

> 
> In v62 otherfeatures, meta shows the xref.timestamp as '2010-06-22 23:25:07'. Do I correctly infer that the otherfeatures data are about a year old?


The data in otherfeatures are added and updated piecemeal, so it's best to look at the timestamp in gene_stable_id for your particular analysis. We also mention data updates in the "What's New" release page. eg. The RefSeqs first appeared in the human otherfeatures database in e57: http://www.ensembl.org/info/website/news/index.html?id=57&submit=Go

For e62, you'll see that the assembly patch genes were added to the database most recently:

mysql -uanonymous -hensembldb.ensembl.org -P5306 -Dhomo_sapiens_otherfeatures_62_37g -e "select created_date,logic_name from gene_stable_id gsi, gene g, analysis a  where g.gene_id=gsi.gene_id and g.analysis_id=a.analysis_id group by logic_name"
+---------------------+------------------------+
| created_date        | logic_name             |
+---------------------+------------------------+
| 2011-03-02 15:05:20 | assembly_patch_ensembl |
| 2010-09-23 17:28:12 | ccds_import            |
| 2010-08-20 16:49:12 | estgene                |
| 0000-00-00 00:00:00 | refseq_human_import    |
+---------------------+------------------------+

Oops, we seem to have forgotten off the timestamp for refseq in the above case. They are from June 2009. Don't worry though, because a new RefSeq set will become available in e63!
mysql -Dhomo_sapiens_otherfeatures_63_37 -e "select created_date,logic_name from gene_stable_id gsi, gene g, analysis a  where g.gene_id=gsi.gene_id and g.analysis_id=a.analysis_id group by logic_name"
+---------------------+------------------------+
| created_date        | logic_name             |
+---------------------+------------------------+
| 2011-03-02 15:05:20 | assembly_patch_ensembl |
| 2011-02-09 22:37:24 | ccds_import            |
| 2010-08-20 16:49:12 | estgene                |
| 2010-12-15 00:00:00 | refseq_human_import    |
+---------------------+------------------------+

As you can see, there is quite a time lag between the date we receive the file from NCBI and the date these models become available on our website. This is because, after the RefSeq data was uploaded we finalised our CCDS comparisons (GENCODE vs RefSeq in the case of human) before entering the RefSeq models into the otherfeatures database. After that, the otherfeatures database enters the Ensembl "release cycle" which is abut 2.5 months long. We are actively working with NCBI to increase the frequency of these RefSeq imports.

> 
> I would love to be able to recreate exactly what Ensembl does for otherfeatures, but with newer data. Are there scripts and/or documentation somewhere?


The otherfeatures database holds a variety of data that don't fit into the core database. eg. Exonerate alignments of all ESTs, imports of RefSeq and CCDS sets, genes on the new GRC patch regions. (There is a basic description of the database types here: ensembl-doc/pipeline_docs/database_types.txt.) 

ESTs are briefly mentioned here ensembl-doc/pipeline_docs/the_genebuild_process.txt. cDNAs are aligned using scripts here: ensembl-pipeline/scripts/cDNA_update. There is no public documentation on importing the RefSeq and CCDS as these data are received privately, but we can provide you with the import script if it helps. 

Cheers,
Bronwen
 

> 
> Thanks,
> Reece
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110602/016647cf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2058 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110602/016647cf/attachment.p7s>


More information about the Dev mailing list