[ensembl-dev] FW: Download human body map 2.0 transcript coordinates
Li, Ying L
ying_l.li at roche.com
Mon Feb 13 16:15:43 GMT 2012
Dear Thibaut,
Thanks a lot for your reply.
I would like to try with your 'easiest' way by downloading the ensembl human database. I need more details on how to do the mapping between RNAseq and the core?
Here is what I have:
1) I have home_sapiens_rnaseq_65_37 (ftp://ftp.ensembl.org/pub/release-65/mysql/homo_sapiens_rnaseq_65_37/) data from ENSEMBL (all the mysql files), and I had the table created with all the data files, including gene, transcript tables. In gene and transcript tables, we have gene_id as one of their columns.
2) I downloaded the home_sapiens-core_65_37 (ftp://ftp.ensembl.org/pub/release-65/mysql/homo_sapiens_core_65_37/ ) gene/transcript/seq_region tables. I was hoping to find the stable_id like 'ROUGHG00000203265-v5.1-74-747-3-NC-0-1' in the core tables, but did not find any. I also tried to map seq_region_id/seq_region_start/seq_region_end/ to the core_gene seq_region_id/seq_region_start/seq_region_end witout success.
3) Here are my questions:
a. which tables should I use to map the RNASEQ_gene to the core_gene?
b. What is the best way to map core_gene "ENSG000....' To hugo gene?
Many thanks for your help.
Best,
Ying
From: Thibaut Hourlier [mailto:th3 at sanger.ac.uk]
Sent: Monday, February 13, 2012 6:53 AM
To: Li, Ying L {PXTP~Nutley}
Cc: dev at ensembl.org
Subject: Re: [ensembl-dev] FW: Download human body map 2.0 transcript coordinates
Dear Ying,
The easiest would be to use the Ensembl Human database which you can download from the ftp site: http://www.ensembl.org/info/data/ftp/index.html, then the mapping will work because you will have the Ensembl coordinates.
If you are trying to map directly Ensembl data from the Human body map to a UCSC database it won't work because the internal id like seq_region_id are different.
You will need to use the seq_region names which you can found in the seq_region table (Ensembl schema). As we are also using GRCh37, the seq_region names will be the same, but for the chromosome, it will be "chr1" in UCSC and "1" in Ensembl for the chromosome 1.
In the perl code of my previous message, I was getting the coordinates of a known gene in the human database. Then using the coordinates of the genes, I was fetching all the transcripts from the RNASeq database that are overlapping the region. This is why I used a stable id like "ENSG00....".
Regards,
Thibaut
On 10/02/12 18:36, Li, Ying L wrote:
Dear Thibaut,
Thanks for your reply. I am not using the API at this moment. The stable Id from the 'gene' table is more starting with ENSG, but 'ROUGHG00000203265-v5.1-74-747-3-NC-0-1'. So, I tried to map the seq_region(_id/_start/_end/_strand) USCS 'gene coord' of GRch37 without any success. Did I use the wrong coord (USCS)?
Thanks a lot for your help,
Best,
Ying
From: Thibaut Hourlier [mailto:th3 at sanger.ac.uk]
Sent: Monday, January 30, 2012 9:44 AM
To: Li, Ying L {PXTP~Nutley}
Cc: dev at ensembl.org<mailto:dev at ensembl.org>
Subject: Re: [ensembl-dev] FW: Download human body map 2.0 transcript coordinates
Dear Ying,
As I said on the link you are quoting, we recommend people to use the perl API : http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
The only way you can map the reads with a gene is with the seq_region(_id/_start/_end/_strand) information.
If you know the stable id (ENSG000....) of the gene you are interested in, it's quite simple with the API:
$db = new Bio::EnsEMBL::DBAdaptor(
-host => 'ensembldb.ensembl.org',
-port => 5306,
-user => 'anonymous',
-dbname => 'homo_sapiens_core_65_37');
$ga = $db->get_GeneAdaptor();
$gene = $ga->fetch_by_stable_id("ENSG000XXXX");
$slice = $gene->slice;
$rnaseqdb = new Bio::EnsEMBL::DBAdaptor(
-host => 'ensembldb.ensembl.org',
-port => 5306,
-user => 'anonymous',
-dbname => 'homo_sapiens_rnaseq_65_37');
$rnaseqsa = $rnaseqdb->get_SliceAdaptor();
$rnaseqslice = $rnaseqsa->fetch_by_name($slice->name);
@transcripts = @{$rnaseqslice->get_all_Transcripts('skeletal_rnaseq')};
foreach my $transcript (@transcripts) {
foreach my $sf (@{$transcript->get_all_supporting_features()}) {
#We print the number of reads that spanned accross the intron
print STDOUT $sf->hit_name, ' :', $sf->score, "\n";
}
}
The number of reads that span the introns is the score you can find in the dna_align_feature table of the rnaseq database.
Regards,
Thibaut
On 27/01/12 19:09, Li, Ying L wrote:
Dear Thibaut,
I am trying to get the ensemble human bodymap with gene or transcript. And I followed your instruction at the this blog site: http://lists.ensembl.org/pipermail/dev/2011-August/001593.html
I am able to setup an oracle schema to do the following query:
SELECT t.* , sr.name
FROM rnaseq37_analysis a, rnaseq37_transcript t
LEFT JOIN rnaseq37_seq_region sr
ON sr.seq_region_id = t.seq_region_id
WHERE t.analysis_id = a.analysis_id
AND a.logic_name = 'skeletal_rnaseq'
;
TRANSCRIPT_ID
GENE_ID
ANALYSIS_ID
SEQ_REGION_ID
SEQ_REGION_START
SEQ_REGION_END
SEQ_REGION_STRAND
DISPLAY_XREF_ID
BIOTYPE
STATUS
DESCRIPTION
IS_CURRENT
CANONICAL_TRANSLATION_ID
STABLE_ID
VERSION
CREATED_DATE
MODIFIED_DATE
NAME
840249
585754
8244
27517
184298181
184300196
1
\N
protein_coding
PREDICTED
\N
1
593641
ROUGHT00000241809
1
2011-01-12 10:33:07
2011-01-12 10:33:07
3
840251
585756
8244
27523
74332013
74659111
-1
\N
protein_coding
PREDICTED
\N
1
593643
ROUGHT00000241811
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840252
585758
8244
27523
74702071
74742711
-1
\N
protein_coding
PREDICTED
\N
1
593644
ROUGHT00000241812
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840254
585759
8244
27523
74857620
74885297
-1
\N
protein_coding
PREDICTED
\N
1
593646
ROUGHT00000241814
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840256
585761
8244
27523
74887723
74895618
1
\N
protein_coding
PREDICTED
\N
1
593648
ROUGHT00000241816
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840257
585762
8244
27523
74903474
74917165
1
\N
protein_coding
PREDICTED
\N
1
593649
ROUGHT00000241817
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840259
585764
8244
27523
74921628
74941367
1
\N
protein_coding
PREDICTED
\N
1
593651
ROUGHT00000241819
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840261
585766
8244
27523
75015772
75019126
1
\N
protein_coding
PREDICTED
\N
1
593653
ROUGHT00000241820
1
2011-01-12 10:33:07
2011-01-12 10:33:07
8
840264
585768
8244
27519
15260701
15375468
-1
\N
protein_coding
PREDICTED
\N
1
593656
ROUGHT00000241821
1
2011-01-12 10:33:07
2011-01-12 10:33:07
12
840266
585770
8244
27519
15742384
15751506
1
\N
protein_coding
PREDICTED
\N
1
593658
ROUGHT00000241822
1
2011-01-12 10:33:07
2011-01-12 10:33:07
12
Now I need to map the gene_id or transcript_id to some kind of standard id (eg ensg00000*****) so that I can tell what gene is the gene_id regards to, do you know what is the best way to do so? if you can tell me how to map the gene_id? In additional, do you know if there is a '# of read" for the rnaseq data?
Thanks a lot for your help,
Best regards,
Ying
Hi there,
I am on your mailing list, so resubmitting this question -- see attached file.
Thanks a lot for your help.
Best,
Ying
-----Original Message-----
From: dev-bounces at ensembl.org<mailto:dev-bounces at ensembl.org> [mailto:dev-bounces at ensembl.org] On Behalf Of dev-owner at ensembl.org<mailto:dev-owner at ensembl.org>
Sent: Wednesday, January 25, 2012 4:52 PM
To: Li, Ying L {PXTP~Nutley}
Subject: Re: [ensembl-dev] Download human body map 2.0 transcript coordinates
The Ensembl dev mailing list only accepts postings from people who are subscribed. You can subscribe or unsubscribe at http://lists.ensembl.org/mailman/listinfo/dev
_______________________________________________
Dev mailing list Dev at ensembl.org<mailto:Dev at ensembl.org>
List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120213/2ea32470/attachment.html>
More information about the Dev
mailing list