[ensembl-dev] FW: Download human body map 2.0 transcript coordinates

Thibaut Hourlier th3 at sanger.ac.uk
Mon Feb 13 11:53:10 GMT 2012


Dear Ying,
The easiest would be to use the Ensembl Human database which you can 
download from<http://www.ensembl.org> the ftp site: 
http://www.ensembl.org/info/data/ftp/index.html, then the mapping will 
work because you will have the Ensembl coordinates.

If you are trying to map directly Ensembl data from the Human body map 
to a UCSC database it won't work because the internal id like 
seq_region_id are different.
You will need to use the seq_region names which you can found in the 
seq_region table (Ensembl schema). As we are also using GRCh37, the 
seq_region names will be the same, but for the chromosome, it will be 
"chr1" in UCSC and "1" in Ensembl for the chromosome 1.

In the perl code of my previous message, I was getting the coordinates 
of a known gene in the human database. Then using the coordinates of the 
genes, I was fetching all the transcripts from the RNASeq database that 
are overlapping the region. This is why I used a stable id like 
"ENSG00....".

Regards,
Thibaut

On 10/02/12 18:36, Li, Ying L wrote:
>
> Dear Thibaut,
>
> Thanks for your reply.  I am not using the API at this moment.  The 
> stable Id from the 'gene' table is more starting with ENSG, but 
> 'ROUGHG00000203265-v5.1-74-747-3-NC-0---1'. So, I tried to map the 
> seq_region(_id/_start/_end/_strand) USCS 'gene coord' of GRch37 
> without any success. Did I use the wrong coord (USCS)?
>
> Thanks a lot for your help,
>
> Best,
>
> Ying
>
> *From:*Thibaut Hourlier [mailto:th3 at sanger.ac.uk]
> *Sent:* Monday, January 30, 2012 9:44 AM
> *To:* Li, Ying L {PXTP~Nutley}
> *Cc:* dev at ensembl.org
> *Subject:* Re: [ensembl-dev] FW: Download human body map 2.0 
> transcript coordinates
>
> Dear Ying,
> As I said on the link you are quoting, we recommend people to use the 
> perl API : http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
>
> The only way you can map the reads with a gene is with the 
> seq_region(_id/_start/_end/_strand) information.
> If you know the stable id (ENSG000....) of the gene you are interested 
> in, it's quite simple with the API:
>
> $db = new Bio::EnsEMBL::DBAdaptor(
>                  -host =>  'ensembldb.ensembl.org',
>                  -port =>  5306,
>                  -user =>  'anonymous',
>                  -dbname =>  'homo_sapiens_core_65_37');
> $ga = $db->get_GeneAdaptor();
> $gene = $ga->fetch_by_stable_id("ENSG000XXXX");
> $slice = $gene->slice;
> $rnaseqdb = new Bio::EnsEMBL::DBAdaptor(
>                  -host =>  'ensembldb.ensembl.org',
>                  -port =>  5306,
>                  -user =>  'anonymous',
>                  -dbname =>  'homo_sapiens_rnaseq_65_37');
> $rnaseqsa = $rnaseqdb->get_SliceAdaptor();
> $rnaseqslice = $rnaseqsa->fetch_by_name($slice->name);
> @transcripts = @{$rnaseqslice->get_all_Transcripts('skeletal_rnaseq')};
> foreach my $transcript (@transcripts) {
>     foreach my $sf (@{$transcript->get_all_supporting_features()}) {
>          #We print the number of reads that spanned accross the intron
>          print STDOUT $sf->hit_name, ' :', $sf->score, "\n";
>     }
> }
>
> The number of reads that span the introns is the score you can find in 
> the dna_align_feature table of the rnaseq database.
>
> Regards,
> Thibaut
>
>
> On 27/01/12 19:09, Li, Ying L wrote:
>
> Dear Thibaut,
>   
>   
>   
> I am trying to get the ensemble human bodymap with gene or transcript.  And I followed your instruction at the this blog site:http://lists.ensembl.org/pipermail/dev/2011-August/001593.html
>   
>   
>   
> I am able to setup an oracle schema to do the following query:
>   
> SELECT t.* , sr.name
>   
> FROM rnaseq37_analysis a, rnaseq37_transcript t
>   
> LEFT JOIN rnaseq37_seq_region sr
>   
> ON sr.seq_region_id = t.seq_region_id
>   
> WHERE t.analysis_id = a.analysis_id
>   
> AND a.logic_name = 'skeletal_rnaseq'
>   
> ;
>   
>   
>   
> TRANSCRIPT_ID
>   
> GENE_ID
>   
> ANALYSIS_ID
>   
> SEQ_REGION_ID
>   
> SEQ_REGION_START
>   
> SEQ_REGION_END
>   
> SEQ_REGION_STRAND
>   
> DISPLAY_XREF_ID
>   
> BIOTYPE
>   
> STATUS
>   
> DESCRIPTION
>   
> IS_CURRENT
>   
> CANONICAL_TRANSLATION_ID
>   
> STABLE_ID
>   
> VERSION
>   
> CREATED_DATE
>   
> MODIFIED_DATE
>   
> NAME
>   
> 840249
>   
> 585754
>   
> 8244
>   
> 27517
>   
> 184298181
>   
> 184300196
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593641
>   
> ROUGHT00000241809
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 3
>   
> 840251
>   
> 585756
>   
> 8244
>   
> 27523
>   
> 74332013
>   
> 74659111
>   
> -1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593643
>   
> ROUGHT00000241811
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840252
>   
> 585758
>   
> 8244
>   
> 27523
>   
> 74702071
>   
> 74742711
>   
> -1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593644
>   
> ROUGHT00000241812
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840254
>   
> 585759
>   
> 8244
>   
> 27523
>   
> 74857620
>   
> 74885297
>   
> -1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593646
>   
> ROUGHT00000241814
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840256
>   
> 585761
>   
> 8244
>   
> 27523
>   
> 74887723
>   
> 74895618
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593648
>   
> ROUGHT00000241816
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840257
>   
> 585762
>   
> 8244
>   
> 27523
>   
> 74903474
>   
> 74917165
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593649
>   
> ROUGHT00000241817
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840259
>   
> 585764
>   
> 8244
>   
> 27523
>   
> 74921628
>   
> 74941367
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593651
>   
> ROUGHT00000241819
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840261
>   
> 585766
>   
> 8244
>   
> 27523
>   
> 75015772
>   
> 75019126
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593653
>   
> ROUGHT00000241820
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 8
>   
> 840264
>   
> 585768
>   
> 8244
>   
> 27519
>   
> 15260701
>   
> 15375468
>   
> -1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593656
>   
> ROUGHT00000241821
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 12
>   
> 840266
>   
> 585770
>   
> 8244
>   
> 27519
>   
> 15742384
>   
> 15751506
>   
> 1
>   
> \N
>   
> protein_coding
>   
> PREDICTED
>   
> \N
>   
> 1
>   
> 593658
>   
> ROUGHT00000241822
>   
> 1
>   
> 2011-01-12 10:33:07
>   
> 2011-01-12 10:33:07
>   
> 12
>   
>   
>   
> Now I need to map the gene_id or transcript_id to some kind of standard id (eg ensg00000*****) so that I can tell what gene is the gene_id regards to, do you know what is the best way to do so? if you can tell me how to map the gene_id? In additional, do you know if there is a '# of read" for the rnaseq data?
>   
>   
>   
> Thanks a lot for your help,
>   
>   
>   
> Best regards,
>   
> Ying
>
>
>
>   
> Hi there,
>   
> I am on your mailing list, so resubmitting this question -- see attached file.
>   
> Thanks a lot for your help.
>   
> Best,
> Ying
> -----Original Message-----
> From:dev-bounces at ensembl.org  <mailto:dev-bounces at ensembl.org>  [mailto:dev-bounces at ensembl.org] On Behalf Ofdev-owner at ensembl.org  <mailto:dev-owner at ensembl.org>
> Sent: Wednesday, January 25, 2012 4:52 PM
> To: Li, Ying L {PXTP~Nutley}
> Subject: Re: [ensembl-dev] Download human body map 2.0 transcript coordinates
>   
> The Ensembl dev mailing list only accepts postings from people who are subscribed. You can subscribe or unsubscribe athttp://lists.ensembl.org/mailman/listinfo/dev
>   
>
>
>
>
> _______________________________________________
> Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
> List admin (including subscribe/unsubscribe):http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog:http://www.ensembl.info/
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120213/cbaa4a19/attachment.html>


More information about the Dev mailing list