[ensembl-dev] Incorrect HGVS nomenclature

Reece Hart reece at harts.net
Sun Feb 8 21:12:19 GMT 2015


On Wed, Feb 4, 2015 at 6:53 AM, Vasisht Tadigotla <
vasisht.tadigotla at courtagen.com> wrote:

> Thanks for the clarification. Is there a list of transcripts where the
> RefSeq sequence doesn’t match the reference?
>

There are several ways that transcripts may disagree with a reference
genome, such as:

   - Natural variation. Transcripts and reference genomes are from
   different individuals, so natural variation will show up as discrepancies.
   - Sequencing error. For example, NEFL (in GRCh37, at least) contains a
   poly-G repeat, almost certainly due to sequencing error, that leads to a
   frameshift.
   - Alignment ambiguity or other bioinformatics challenges. There are
   several flavors of this, such as genomic coordinates at NCBI != genomic
   coordinates at UCSC (these groups align by different methods), or even that
   NCBI will occasionally report that a transcripts has n transcripts exons
   and m genomic exons (where n != m -- see below).


At Invitae, several of us built a database called UTA (Universal Transcript
Archive) that tracks transcripts and transcript alignments for distinct
sources, transcript versions, alignment methods, and reference
assemblies/patches. The data are public and you can access some of the data
you seek like this:

snafu$ PGPASSWORD=uta_public psql -h uta.invitae.com  -U uta_public -d uta

uta_public at uta/uta=> select s_status,count(distinct
tx_ac),array_to_string((array_agg(distinct tx_ac))[1:5],' ') as examples
from uta_20140210.bermuda where alt_ac~'^NC_0000' and s_status is not NULL
group by s_status;
┌──────────┬───────┬────────────────────────────────────────────────────────────────────────────┐
│ s_status │ count │                                  examples
                     │
├──────────┼───────┼────────────────────────────────────────────────────────────────────────────┤
│ nlxdi    │    13 │ NM_001039127.3 NM_001170637.2 NM_001171904.1
NM_001195831.2 NM_001197224.2 │
│ nlxdI    │     7 │ NM_001039350.1 NM_001098212.1 NM_001936.3 NM_003585.4
NM_018906.2          │
│ nlxDi    │    19 │ NM_001080519.2 NM_001105553.1 NM_001137667.1
NM_001137668.1 NM_001142769.1 │
│ nlxDI    │     2 │ NM_001113239.2 NM_022740.4
                    │
│ nlXdi    │     1 │ NM_003715.2
                     │
│ nlXdI    │     2 │ NM_003585.3 NM_033487.1
                     │
│ nlXDi    │     6 │ NM_001260492.1 NM_001260493.1 NM_001260495.1
NM_001260496.1 NM_001278267.1 │
│ nlXDI    │    35 │ NM_001008391.2 NM_001037675.2 NM_001077693.2
NM_001082575.1 NM_001144382.1 │
│ NlxdI    │   128 │ NM_000278.3 NM_000294.2 NM_001001520.1 NM_001001991.1
NM_001002027.1       │
│ NlxDi    │   188 │ NM_000348.3 NM_000597.2 NM_000682.5 NM_000694.2
NM_000804.2                │
│ NlxDI    │    16 │ NM_001005505.1 NM_001032291.2 NM_001039888.3
NM_001098515.1 NM_001135914.1 │
│ NlXdI    │    61 │ NM_000277.1 NM_000314.4 NM_000399.3 NM_001012709.1
NM_001033580.2          │
│ NlXDi    │    51 │ NM_000104.3 NM_000158.3 NM_000828.4 NM_001008239.2
NM_001012288.1          │
│ NlXDI    │    18 │ NM_000257.2 NM_001077527.2 NM_001204477.1
NM_001204478.1 NM_001271893.1    │
│ NLxdi    │ 31622 │ NM_000015.2 NM_000016.4 NM_000017.2 NM_000018.3
NM_000019.3                │
│ NLxDI    │     1 │ NM_014817.3
                     │
│ NLXdi    │  3482 │ NM_000014.4 NM_000022.2 NM_000023.2 NM_000024.5
NM_000049.2                │
│ NLXDI    │     8 │ NM_000615.6 NM_001076682.3 NM_001242607.1
NM_001242608.1 NM_017495.5       │
└──────────┴───────┴────────────────────────────────────────────────────────────────────────────┘


s_status is the splign alignment status. N => number of genomic and
transcript exons are equal; L = length of those exons are equal; X/D/I
means alignment contains substitutions, deletions, insertions. Lower case
negates meaning. NLxdi is the ideal and most-common case: number of exons
are equal, the lengths of the transcript and reference exons are equal, and
no sub/del/ins. The second most common case, NLXdi, shows the number of
distinct transcripts with substitutions (presumably, nearly all of these
are natural variation and will correspond to dbSNP entries).

The above query shows the breakdown of distinct classes and up to 5
examples of them.

You're free to use the data as-is. All genomic coordinates are GRCh37, as
provided by source databases. I'm trying to carve out time for an
update/upgrade, but that's not a current priority.

-Reece
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150208/3c1e1cce/attachment.html>


More information about the Dev mailing list