[ensembl-dev] superfamily ProteinFeature id inconsistency in human db
Allison Regier
aregier at genome.wustl.edu
Tue Mar 27 18:28:51 BST 2012
Hello,
I am using Ensembl release 66 (homo_sapiens_core_66_37).
I am trying to pull all domain annotations for transcripts. (See code snippet below).
The db seems to be missing the IPR accession and description for the superfamily domain predictions.
Code snippet:
$stable_id = 'ENST00000561465'
$transcript = $transcript_adaptor->fetch_by_stable_id($stable_id)
$translation = $transcript->translation
$pfeatures = $translation->get_all_ProteinFeatures
foreach my $pf (@{$dfeatures}) {
my $logic_name = $pf->analysis->logic_name;
printf("%d-%d %s %s %s\n",$pf->start, $pf->end, $logic_name, $pf->interpro_ac, $pf->idesc);
}
Output:
55-138 pfam IPR003961 Fibronectin_type3
42-148 superfamily
2-56 superfamily
150-215 superfamily
52-135 smart IPR003961 Fibronectin_type3
52-144 pfscan IPR003961 Fibronectin_type3
Looking directly in the db, I notice that the hit_name given to the superfamily hits does not match any id in the interpro table:
mysql> select protein_feature_id,protein_feature.seq_start,protein_feature.seq_end,hit_name,protein_feature.score from protein_feature, translation, transcript where protein_feature.translation_id=translation.translation_id and translation.transcript_id=transcript.transcript_id and transcript.stable_id="ENST00000561465";
+--------------------+-----------+---------+----------+--------+
| protein_feature_id | seq_start | seq_end | hit_name | score |
+--------------------+-----------+---------+----------+--------+
| 347408 | 55 | 138 | PF00041 | 45.8 |
| 473563 | 52 | 144 | PS50853 | 17.506 |
| 658004 | 52 | 135 | SM00060 | 43.3 |
| 865531 | 42 | 148 | SSF49265 | 0 |
| 865532 | 2 | 56 | SSF49265 | 0 |
| 865533 | 150 | 215 | SSF49265 | 0 |
+--------------------+-----------+---------+----------+--------+
mysql> select * from interpro where id="PF00041"
-> ;
+-------------+---------+
| interpro_ac | id |
+-------------+---------+
| IPR003961 | PF00041 |
+-------------+---------+
mysql> select * from interpro where id="SSF49265";
Empty set (0.01 sec)
mysql> select * from interpro where interpro_ac="IPR003961";
+-------------+---------+
| interpro_ac | id |
+-------------+---------+
| IPR003961 | 49265 |
| IPR003961 | PF00041 |
| IPR003961 | PS50853 |
| IPR003961 | SM00060 |
+-------------+---------+
It seems likely that hit name SSF49265 in the protein feature table should match up with id 49265 in the interpro table. Since the join doesn't happen from protein feature to interpro, the corresponding xref join cannot be made, so the domain description is missing.
Can you confirm this? Do you have any suggestions for a workaround?
thanks,
Allison
Staff Scientist
The Genome Institute at Washington University St Louis
More information about the Dev
mailing list