[ensembl-dev] mysql homo_sapiens_otherfeatures_97_38 content
Thibaut Hourlier
thibaut at ebi.ac.uk
Wed Aug 14 12:19:48 BST 2019
Hi Jerome,
It is possible that we have not properly advertised this changed because the data itself is in a way still present in the database.
homo_sapiens_otherfeatures_95_38 > SELECT COUNT(*), source, logic_name FROM gene JOIN analysis USING(analysis_id) WHERE logic_name LIKE "refseq%" GROUP BY source, logic_name ORDER BY logic_name;
+----------+-----------------+---------------------+
| COUNT(*) | source | logic_name |
+----------+-----------------+---------------------+
| 28125 | refseq | refseq_human_import |
| 575 | tRNAscan-SE | refseq_import |
| 11999 | Gnomon | refseq_import |
| 16472 | Curated Genomic | refseq_import |
| 31637 | BestRefSeq | refseq_import |
+----------+-----------------+——————————+
As you can see in the table above, we have two analysis referring to RefSeq, refseq_human_import and refseq_import.
refseq_human_import source is refseq only. The gene models which were in the database under the refseq_human_import logic_name were corresponding to the set RefSeq uses for the CCDS project. It means that not all RefSeq models were in the set.
The gene models in the refseq_import analysis are loaded from the GFF file available on the RefSeq FTP and contains all models generated by RefSeq.
refseq_human_import was a subset of refseq_import. Also the GFF file for refseq_import contains more information than the file for the refseq_human_import set.
There are some differences in how the data is loaded. If you want to retrieve the RefSeq accession for a transcript you will need to use display_xref instead of stable id. The reason for that is a same RefSeq cDNA can match multiple times in a genome and a stable id is unique in the Ensembl schema.
If you were using the source to retrieve all RefSeq genes, it is better to use the logic_name, refseq_import.
Hope this helps,
Thibaut
> On 13 Aug 2019, at 15:35, Jerome Roy <jerome at wuxinextcode.com> wrote:
>
> Hi,
>
> Apologies if there's a simple explanation I missed; I was accessing the mysql db and noticed missing data that I was expecting in the homo_sapiens_otherfeatures_97_38.gene table:
>
> (none)> select source,count(*) from homo_sapiens_otherfeatures_97_38.gene group by source;
>
> +-----------------+----------+
> | source | count(*) |
> +-----------------+----------+
> | BestRefSeq | 31502 |
> | ccds | 33367 |
> | Curated Genomic | 16356 |
> | ensembl | 261624 |
> | Gnomon | 11787 |
> | tRNAscan-SE | 575 |
> +-----------------+----------+
> 6 rows in set
> Time: 0.499s
>
> (none)> select source,count(*) from homo_sapiens_otherfeatures_95_38.gene group by source;
> +-----------------+----------+
> | source | count(*) |
> +-----------------+----------+
> | BestRefSeq | 31637 |
> | ccds | 32471 |
> | Curated Genomic | 16472 |
> | ensembl | 261624 |
> | Gnomon | 11999 |
> | refseq | 28125 |
> | tRNAscan-SE | 575 |
> +-----------------+----------+
> 7 rows in set
> Time: 0.497s
>
> i.e. the rows with source='refseq' have disappeared from the ensembl97 (and ensembl96) database.
>
> Is this change documented somewhere?
>
> Best regards,
> --
> Jerome Roy
> WuXiNextCODE
> https://www.wuxinextcode.com/
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list