[ensembl-dev] mysql homo_sapiens_otherfeatures_97_38 content

Wed Aug 14 12:19:48 BST 2019

Hi Jerome,
It is possible that we have not properly advertised this changed because the data itself is in a way still present in the database.

homo_sapiens_otherfeatures_95_38 > SELECT COUNT(*), source, logic_name FROM gene JOIN analysis USING(analysis_id) WHERE logic_name LIKE "refseq%" GROUP BY source, logic_name ORDER BY logic_name;
+----------+-----------------+---------------------+
| COUNT(*) | source          | logic_name          |
+----------+-----------------+---------------------+
|    28125 | refseq          | refseq_human_import |
|      575 | tRNAscan-SE     | refseq_import       |
|    11999 | Gnomon          | refseq_import       |
|    16472 | Curated Genomic | refseq_import       |
|    31637 | BestRefSeq      | refseq_import       |
+----------+-----------------+——————————+

As you can see in the table above, we have two analysis referring to RefSeq, refseq_human_import and refseq_import.
refseq_human_import source  is refseq only. The gene models which were in the database under the refseq_human_import logic_name were corresponding to the set RefSeq uses for the CCDS project. It means that not all RefSeq models were in the set.
The gene models in the refseq_import analysis are loaded from the GFF file available on the RefSeq FTP and contains all models generated by RefSeq.
refseq_human_import was a subset of refseq_import. Also the GFF file for refseq_import contains more information than the file for the refseq_human_import set. 

There are some differences in how the data is loaded. If you want to retrieve the RefSeq accession for a transcript you will need to use display_xref instead of stable id. The reason for that is a same RefSeq cDNA can match multiple times in a genome and a stable id is unique in the Ensembl schema.

If you were using the source to retrieve all RefSeq genes, it is better to use the logic_name, refseq_import.

Hope this helps,
Thibaut 

> On 13 Aug 2019, at 15:35, Jerome Roy <jerome at wuxinextcode.com> wrote:
> 
> Hi,
> 
> Apologies if there's a simple explanation I missed; I was accessing the mysql db and noticed missing data that I was expecting in the homo_sapiens_otherfeatures_97_38.gene table:
> 
> (none)> select source,count(*) from homo_sapiens_otherfeatures_97_38.gene group by source;
> 
> +-----------------+----------+
> | source          | count(*) |
> +-----------------+----------+
> | BestRefSeq      | 31502    |
> | ccds            | 33367    |
> | Curated Genomic | 16356    |
> | ensembl         | 261624   |
> | Gnomon          | 11787    |
> | tRNAscan-SE     | 575      |
> +-----------------+----------+
> 6 rows in set
> Time: 0.499s
> 
> (none)> select source,count(*) from homo_sapiens_otherfeatures_95_38.gene group by source;
> +-----------------+----------+
> | source          | count(*) |
> +-----------------+----------+
> | BestRefSeq      | 31637    |
> | ccds            | 32471    |
> | Curated Genomic | 16472    |
> | ensembl         | 261624   |
> | Gnomon          | 11999    |
> | refseq          | 28125    |
> | tRNAscan-SE     | 575      |
> +-----------------+----------+
> 7 rows in set
> Time: 0.497s
> 
> i.e. the rows with source='refseq' have disappeared from the ensembl97 (and ensembl96) database.
> 
> Is this change documented somewhere?
> 
> Best regards,
> -- 
> Jerome Roy 
> WuXiNextCODE
> https://www.wuxinextcode.com/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/