[ensembl-dev] Transcripts missing in Mus_musculus.GRCm38.102.gtf.gz and Mus_musculus.GRCm38.102.gff3.gz

Hervé Pagès hpages.on.github at gmail.com
Tue Nov 9 21:02:44 GMT 2021


Hi Marc,

On 03/11/2021 01:12, mchakiachvili wrote:
> Good morning Herve,
> 
> Sorry for the late response, I had to dig a little into the code and ask 
> older members of the team to get an answer (thanks Mag!)
> 
> the main difference between GTF and GFF3 dumping is that for GTF, we get 
> the transcripts from the gene ($gene->get_all_Transcripts)
> while for the GFF3, we get the transcripts from the underlying slice 
> ($transcript_adaptor->fetch_all_by_Slice)
> https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199 
> <https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199>
> https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112 
> <https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112>
> 13:02 <https://genomes-ebi.slack.com/archives/C0F4FQPHN/p1635858153044300>
> this means that in theory, if the transcript goes over the boundaries of 
> the slice, we might not dump it although we dump the genes
> I can’t see anything in the missing transcript that would explain that, 
> but maybe the data in the actual database is faulty
> do we know if it is also missing in previous releases or is this 
> specific to 102?
> The main difference between GTF and GFF3 dumping is that for GTF, we get 
> the transcripts from the gene ($gene->get_all_Transcripts)
> while for the GFF3, we get the transcripts from the underlying slice 
> ($transcript_adaptor->fetch_all_by_Slice)
> 
> https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199 
> <https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199>
> https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112 
> <https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112>
>> This means if the transcript goes over the boundaries of the slice, we 
> might not dump it although we dump the genes. We are now considering a 
> fix (in one way or another to make things even)
> 
> Thanks for your patience.

Thanks for looking into this.

> 
> BTW, did you notice the same for other releases or is it just for 102?

I notice the same problem for other releases e.g. in release 101 there 
seems to be 52 transcripts missing from 
Mus_musculus.GRCm38.101.chr_patch_hapl_scaff.gff3.gz compared to the GTF 
file.

This also affects Homo sapiens e.g. in release 104 there seems to be 154 
transcripts missing from 
Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gff3.gz compared to the GTF 
file. For example gene ENSG00000283658 has 9 transcripts in the GTF file 
but only 1 transcript in the GFF3 file.

Cheers,
H.

> 
> Thanks for your help.
> 
> Kind regards,
> 
> Marc
> 
> On Tue, 2021-11-02 at 19:19 -0700, Hervé Pagès wrote:
>> Hi,
>>
>> Can someone on this list explain why the GFF3 files are missing
>> transcripts compared to the GTF files or should I ask somewhere else?
>>
>> See below for the details.
>>
>> Thanks again,
>> H.
>>
>>
>> On 26/10/2021 16:54, Hervé Pagès wrote:
>>> Hmm.. not quite. GFF3 file
>>> Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gff3.gz still seems to be
>>> missing some transcripts.
>>>
>>> The mus_musculus_core_102_38 db and
>>> Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz both contain 144778
>>> transcripts but Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gff3.gz
>>> contains only 144726. So 52 transcripts are missing. For example
>>> ENSMUST00000206994 is missing. This transcript belongs to gene
>>> ENSMUSG00000108408. So this gene has 5 transcripts in the GTF file but
>>> only 4 in the GFF3 file.
>>>
>>> What could be the reason why some transcripts are excluded from the GFF3
>>> file?
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> On 26/10/2021 16:37, Hervé Pagès wrote:
>>>> That's it. These *.chr_patch_hapl_scaff.* files seem indeed to contain
>>>> the full db dump. Thanks!
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>> On 26/10/2021 16:22, Thomas Danhorn wrote:
>>>>> As far as I know, these GTFs/GFFs only contain genes and transcripts
>>>>> from the primary assembly, i.e. not from patches.  I suspect
>>>>> http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz 
>>>>> <http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz> 
>>>>>
>>>>>
>>>>> might contain such transcripts.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Tue, 26 Oct 2021, Hervé Pagès wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Does anybody know why transcript ENSMUST00000230762 is missing from
>>>>>> the GTF and GFF3 files for Mus musculus in Ensembl release 102?
>>>>>>
>>>>>> ENSMUST00000230762 is a transcript present in the
>>>>>> mus_musculus_core_102_38 db. It's located on novel-patch sequence
>>>>>> CHR_WSB_EIJ_MMCHR11_CTG3 from GRCm38.p6. But for some reason it's
>>>>>> not in the Mus_musculus.GRCm38.102.gtf.gz or
>>>>>> Mus_musculus.GRCm38.102.gff3.gz files found here
>>>>>> http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/ 
>>>>>> <http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/> and here
>>>>>> http://ftp.ensembl.org/pub/release-102/gff3/mus_musculus/ 
>>>>>> <http://ftp.ensembl.org/pub/release-102/gff3/mus_musculus/>
>>>>>>
>>>>>> Furthermore, it seems that the GTF and GTF3 files are missing 2079
>>>>>> transcripts compared to the mus_musculus_core_102_38 db. Anybody
>>>>>> knows what's going on?
>>>>>>
>>>>>> Thanks,
>>>>>> H.
>>>>>>
>>>>>> -- 
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Bioconductor Core Team
>>>>>> hpages.on.github at gmail.com <mailto:hpages.on.github at gmail.com>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org 
>>>>>> <https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org>
>>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org 
>>>>> <https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org>
>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>>>
>>>>
>>>
>>
> 
> -- 
> 
> Marc Chakiachvili
> 
> Ensembl Production Project Leader - Genomics Technology Infrastructure
> 
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
> 

-- 
Hervé Pagès

Bioconductor Core Team
hpages.on.github at gmail.com



More information about the Dev mailing list