[ensembl-dev] gene count anomaly

mag mr6 at ebi.ac.uk
Mon Dec 7 09:47:37 GMT 2015


Hi John,

The zebrafish geneset contains 10 genes of biotype TEC (To be 
Experimentally Confirmed)
As such, they do not fit in any of the biotype categories we display on 
the annotation page (they are not coding, not non coding, not pseudogenes)

The annotation page hence displays a count of 31,943 genes (25,642 
coding, 6,008 non coding and 293 pseudogenes)
The biomart page displays a count of 31,953 genes, which is the same as 
the above + 10 TEC

If you use the API with a fetch_all('toplevel'), this will retrieve all 
chromosomes as well as all non assembled scaffolds.
Using this with the same snippet of code that you include in your email 
should return 31,953 genes, the same as biomart.


Hope that helps,
Magali

On 07/12/2015 02:41, john samuel wrote:
> Thanks Dan.
> I must admit I don't use biomart, so I don't know what it can do, but 
> now that you've shown me I see what you mean.
> That gives me a better idea about which total to use.
> Maybe someone else can shed some light on why the total on the 
> annotation page is different?
> John
>
> On 15-12-06 09:24 PM, Daniel Lawson wrote:
>> Hi John,
>>
>> You are missing 565 loci correct (31953 - 31388).
>>
>> I open Mart and using the Region filter select all non-chromosome 
>> scaffolds. The 'Gene' count for these is 565, see image if that works 
>> on the email list, else I include a URL for the Mart query.
>>
>> http://www.ensembl.org/biomart/martview/68b3c7a216540966e2b2f569b596e7e6?VIRTUALSCHEMANAME=default&ATTRIBUTES=drerio_gene_ensembl.default.feature_page.ensembl_gene_id|drerio_gene_ensembl.default.feature_page.ensembl_transcript_id&FILTERS=drerio_gene_ensembl.default.filters.chromosome_name 
>> <http://www.ensembl.org/biomart/martview/68b3c7a216540966e2b2f569b596e7e6?VIRTUALSCHEMANAME=default&ATTRIBUTES=drerio_gene_ensembl.default.feature_page.ensembl_gene_id%7Cdrerio_gene_ensembl.default.feature_page.ensembl_transcript_id&FILTERS=drerio_gene_ensembl.default.filters.chromosome_name>."KN149679.1,KN149681.1,KN149682.1,KN149684.1,KN149686.1,KN149687.1,KN149688.1,KN149689.1,KN149690.1,KN149691.1,KN149694.1,KN149695.1,KN149696.1,KN149697.1,KN149698.1,KN149702.1,KN149704.1,KN149706.1,KN149707.1,KN149710.1,KN149711.1,KN149713.1,KN149715.1,KN149717.1,KN149719.1,KN149725.1,KN149727.1,KN149730.1,KN149731.1,KN14
>> 9732.1,KN149734.1,KN149735.1,KN149739.1,KN149753.1,KN149755.1,KN149764.1,KN149765.1,KN149776.1,KN149779.1,KN149781.1,KN149782.1,KN149784.1,KN149787.1,KN149790.1,KN149795.1,KN149797.1,KN149798.1,KN149799.1,KN149803.1,KN149813.1,KN149816.1,KN149818.1,KN149829.1,KN149830.1,KN149831.1,KN149842.1,KN149843.1,KN149846.1,KN149847.1,KN149850.1,KN149855.1,KN149857.1,KN149858.1,KN149859.1,KN149861.1,KN149868.1,KN149874.1,KN149878.1,KN149880.1,KN149883.1,KN149884.1,KN149886.1,KN149894.1,KN149895.1,KN149896.1,KN149897.1,KN149900.1,KN149904.1,KN149906.1,KN149909.1,KN149910.1,KN149912.1,KN149914.1,KN149916.1,KN149917.1,KN149921.1,KN149923.1,KN149929.1,KN149930.1,KN149933.1,KN149934.1,KN149936.1,KN149939.1,KN149943.1,KN149945.1,KN149946.1,KN149947.1,KN149948.1,KN149951.1,KN149955.1,KN149959.1,KN149962.1,KN149964.1,KN149966.1,KN149968.1,KN149978.1,KN149986.1,KN149987.1,KN149989.1,KN149992.1,KN149995.1,KN149997.1,KN149998.1,KN150000.1,KN150001.1,KN150002.1,KN150003.1,KN150008.1,KN150013.1,KN150015.1,KN
>> 150027.1,KN150032.1,KN150038.1,KN150039.1,KN150040.1,KN150041.1,KN150042.1,KN150046.1,KN150051.1,KN150052.1,KN150056.1,KN150062.1,KN150064.1,KN150066.1,KN150067.1,KN150071.1,KN150072.1,KN150075.1,KN150079.1,KN150080.1,KN150084.1,KN150086.1,KN150088.1,KN150090.1,KN150096.1,KN150099.1,KN150102.1,KN150104.1,KN150108.1,KN150109.1,KN150112.1,KN150115.1,KN150120.1,KN150125.1,KN150127.1,KN150128.1,KN150131.1,KN150137.1,KN150141.1,KN150142.1,KN150148.1,KN150156.1,KN150158.1,KN150162.1,KN150164.1,KN150165.1,KN150168.1,KN150169.1,KN150170.1,KN150171.1,KN150172.1,KN150173.1,KN150176.1,KN150177.1,KN150178.1,KN150188.1,KN150189.1,KN150193.1,KN150196.1,KN150199.1,KN150205.1,KN150207.1,KN150208.1,KN150212.1,KN150213.1,KN150214.1,KN150216.1,KN150221.1,KN150229.1,KN150230.1,KN150232.1,KN150239.1,KN150240.1,KN150241.1,KN150251.1,KN150259.1,KN150262.1,KN150265.1,KN150267.1,KN150269.1,KN150272.1,KN150273.1,KN150277.1,KN150285.1,KN150305.1,KN150307.1,KN150311.1,KN150312.1,KN150314.1,KN150317.1,KN150320.1,
>> KN150322.1,KN150324.1,KN150326.1,KN150328.1,KN150332.1,KN150334.1,KN150335.1,KN150336.1,KN150339.1,KN150342.1,KN150345.1,KN150346.1,KN150348.1,KN150350.1,KN150351.1,KN150353.1,KN150355.1,KN150359.1,KN150361.1,KN150362.1,KN150365.1,KN150366.1,KN150371.1,KN150372.1,KN150379.1,KN150380.1,KN150383.1,KN150387.1,KN150390.1,KN150399.1,KN150400.1,KN150401.1,KN150402.1,KN150403.1,KN150405.1,KN150407.1,KN150411.1,KN150412.1,KN150415.1,KN150416.1,KN150424.1,KN150425.1,KN150432.1,KN150433.1,KN150435.1,KN150442.1,KN150447.1,KN150449.1,KN150451.1,KN150456.1,KN150470.1,KN150474.1,KN150475.1,KN150482.1,KN150487.1,KN150490.1,KN150491.1,KN150492.1,KN150505.1,KN150506.1,KN150508.1,KN150516.1,KN150518.1,KN150521.1,KN150527.1,KN150530.1,KN150531.1,KN150532.1,KN150541.1,KN150543.1,KN150544.1,KN150545.1,KN150550.1,KN150552.1,KN150561.1,KN150562.1,KN150564.1,KN150566.1,KN150568.1,KN150570.1,KN150572.1,KN150574.1,KN150576.1,KN150578.1,KN150589.1,KN150590.1,KN150596.1,KN150597.1,KN150600.1,KN150603.1,KN150605.
>> 1,KN150608.1,KN150614.1,KN150616.1,KN150617.1,KN150620.1,KN150628.1,KN150630.1,KN150631.1,KN150635.1,KN150636.1,KN150637.1,KN150642.1,KN150647.1,KN150650.1,KN150653.1,KN150654.1,KN150663.1,KN150665.1,KN150666.1,KN150667.1,KN150670.1,KN150672.1,KN150674.1,KN150677.1,KN150680.1,KN150681.1,KN150683.1,KN150685.1,KN150691.1,KN150696.1,KN150698.1,KN150699.1,KN150700.1,KN150702.1,KN150703.1,KN150706.1,KN150708.1,KN150709.1"&VISIBLEPANEL=filterpanel
>>
>> Hope that helps/goes some way to explaining the difference between 
>> Mart and your API script. I can't comment on whether or not either of 
>> these are the definitive gene count for zebrafish.
>>
>> regards
>> Dan
>>
>>
>> On 7 December 2015 at 02:17, john samuel 
>> <john.samuel at senecacollege.ca <mailto:john.samuel at senecacollege.ca>> 
>> wrote:
>>
>>     Thanks Dan.
>>     I thought of that, and I tried the same code but looking for
>>     genes in all the scaffolds, thinking that there might be some
>>     unplaced scaffolds, but the total for all scaffolds adds up to
>>     31,501.  This could be, as you said, all the genes mapped to
>>     chromosomes, plus some unplaced scaffolds, but that doesn't match
>>     any of the other totals, so I'm no closer to knowing which total
>>     is correct.
>>     Any other thoughts?
>>     John
>>
>>
>>     On 15-12-06 09:07 PM, Daniel Lawson wrote:
>>>     Hi John,
>>>
>>>     There may be other sequences in the assembly that have not been
>>>     assigned to a chromosome. You can check this via the API or in
>>>     Mart. I expect you'll find a bunch of small sequences that
>>>     harbour some genes - maybe that will get your totals to balance.
>>>
>>>     cheers
>>>     Dan
>>>
>>>
>>>
>>>
>>>     On 7 December 2015 at 01:59, john samuel
>>>     <john.samuel at senecacollege.ca
>>>     <mailto:john.samuel at senecacollege.ca>> wrote:
>>>
>>>         Hi,
>>>         I am trying to get an accurate count of all the ENSDARG
>>>         genes from the latest zebrafish data (GRCz10) in ensembl.
>>>         If I use the perl api to get all the genes in all the
>>>         chromosomes I get a total of 31,388 i.e.
>>>
>>>         my $slice_adaptor = $registry->get_adaptor( 'danio_rerio',
>>>         'Core', 'Slice' );
>>>         my @slices = @{ $slice_adaptor->fetch_all('chromosome') };
>>>         my $total = 0;
>>>         my %all;
>>>         foreach my $slice (@slices) {
>>>             my @genes = @{ $slice->get_all_Genes() };
>>>             my $count = scalar @genes;
>>>         $all{$slice->seq_region_name()}=$count;
>>>             $total += $count;
>>>         }
>>>         foreach my $sorted (sort {$a<=>$b} keys %all) {
>>>             print "chromosome: $sorted\t$all{$sorted}\n";
>>>         }
>>>         print "gene total is\t$total\n";
>>>
>>>         chromosome: MT    37
>>>         chromosome: 1    1386
>>>         chromosome: 2    1587
>>>         chromosome: 3    1611
>>>         chromosome: 4    3103
>>>         chromosome: 5    1704
>>>         chromosome: 6    1280
>>>         chromosome: 7    1507
>>>         chromosome: 8    1216
>>>         chromosome: 9    1108
>>>         chromosome: 10    1108
>>>         chromosome: 11    1039
>>>         chromosome: 12    952
>>>         chromosome: 13    1013
>>>         chromosome: 14    953
>>>         chromosome: 15    1146
>>>         chromosome: 16    1241
>>>         chromosome: 17    1048
>>>         chromosome: 18    942
>>>         chromosome: 19    1123
>>>         chromosome: 20    1253
>>>         chromosome: 21    1092
>>>         chromosome: 22    1174
>>>         chromosome: 23    1031
>>>         chromosome: 24    800
>>>         chromosome: 25    934
>>>         gene total is    31388
>>>
>>>         Anyone see anything wrong with how I get the total?  I
>>>         don't, but then when I go to biomart (see below), I get a
>>>         total of 31953
>>>
>>>
>>>
>>>         and if I go to the info page for the genome at
>>>         http://useast.ensembl.org/Danio_rerio/Info/Annotation I see
>>>         a differenttotal there too (31,650 not counting pseudogenes).
>>>
>>>
>>>
>>>         Anyone have any idea why the different totals and which one
>>>         to believe and whether there's anything wrong with using the
>>>         one that my code calculated as the definitive one?  I need
>>>         to compare the total number of genes vs. the number that we
>>>         are finding under certain conditions, to do some stats.
>>>         John
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>         Posting guidelines and subscribe/unsubscribe info:
>>>         http://lists.ensembl.org/mailman/listinfo/dev
>>>         Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>>
>>>     -- 
>>>     VectorBase | i5K insect genome initiative
>>>
>>>
>>>     _______________________________________________
>>>     Dev mailing listDev at ensembl.org <mailto:Dev at ensembl.org>
>>>     Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>     Ensembl Blog:http://www.ensembl.info/
>>
>>     _______________________________________________
>>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>     Posting guidelines and subscribe/unsubscribe info:
>>     http://lists.ensembl.org/mailman/listinfo/dev
>>     Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>> -- 
>> VectorBase | i5K insect genome initiative
>>
>>
>> _______________________________________________
>> Dev mailing listDev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog:http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151207/058be4fb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 38620 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151207/058be4fb/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 23932 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20151207/058be4fb/attachment-0001.png>


More information about the Dev mailing list