[ensembl-dev] superfamily ProteinFeature id inconsistency in human db

Allison Regier aregier at genome.wustl.edu
Tue Mar 27 18:28:51 BST 2012


Hello,
I am using Ensembl release 66 (homo_sapiens_core_66_37).
I am trying to pull all domain annotations for transcripts.  (See code snippet below).
The db seems to be missing the IPR accession and description for the superfamily domain predictions.

Code snippet:
$stable_id = 'ENST00000561465'
$transcript = $transcript_adaptor->fetch_by_stable_id($stable_id)
$translation = $transcript->translation
$pfeatures = $translation->get_all_ProteinFeatures

foreach my $pf (@{$dfeatures}) {
	my $logic_name = $pf->analysis->logic_name;
	printf("%d-%d %s %s %s\n",$pf->start, $pf->end, $logic_name, $pf->interpro_ac, $pf->idesc);
}

Output:
55-138 pfam IPR003961 Fibronectin_type3
42-148 superfamily  
2-56 superfamily  
150-215 superfamily  
52-135 smart IPR003961 Fibronectin_type3
52-144 pfscan IPR003961 Fibronectin_type3

Looking directly in the db, I notice that the hit_name given to the superfamily hits does not match any id in the interpro table:

mysql> select protein_feature_id,protein_feature.seq_start,protein_feature.seq_end,hit_name,protein_feature.score from protein_feature, translation, transcript where protein_feature.translation_id=translation.translation_id and translation.transcript_id=transcript.transcript_id and transcript.stable_id="ENST00000561465";
+--------------------+-----------+---------+----------+--------+
| protein_feature_id | seq_start | seq_end | hit_name | score  |
+--------------------+-----------+---------+----------+--------+
|             347408 |        55 |     138 | PF00041  |   45.8 |
|             473563 |        52 |     144 | PS50853  | 17.506 |
|             658004 |        52 |     135 | SM00060  |   43.3 |
|             865531 |        42 |     148 | SSF49265 |      0 |
|             865532 |         2 |      56 | SSF49265 |      0 |
|             865533 |       150 |     215 | SSF49265 |      0 |
+--------------------+-----------+---------+----------+--------+

mysql> select * from interpro where id="PF00041"
    -> ;
+-------------+---------+
| interpro_ac | id      |
+-------------+---------+
| IPR003961   | PF00041 |
+-------------+---------+

mysql> select * from interpro where id="SSF49265";
Empty set (0.01 sec)

mysql> select * from interpro where interpro_ac="IPR003961";
+-------------+---------+
| interpro_ac | id      |
+-------------+---------+
| IPR003961   | 49265   |
| IPR003961   | PF00041 |
| IPR003961   | PS50853 |
| IPR003961   | SM00060 |
+-------------+---------+

It seems likely that hit name SSF49265 in the protein feature table should match up with id 49265 in the interpro table.  Since the join doesn't happen from protein feature to interpro, the corresponding xref join cannot be made, so the domain description is missing.
Can you confirm this?  Do you have any suggestions for a workaround?
thanks,
Allison

Staff Scientist
The Genome Institute at Washington University St Louis



More information about the Dev mailing list