[ensembl-dev] Transcripts missing in Mus_musculus.GRCm38.102.gtf.gz and Mus_musculus.GRCm38.102.gff3.gz
Hervé Pagès
hpages.on.github at gmail.com
Tue Nov 9 21:02:44 GMT 2021
Hi Marc,
On 03/11/2021 01:12, mchakiachvili wrote:
> Good morning Herve,
>
> Sorry for the late response, I had to dig a little into the code and ask
> older members of the team to get an answer (thanks Mag!)
>
> the main difference between GTF and GFF3 dumping is that for GTF, we get
> the transcripts from the gene ($gene->get_all_Transcripts)
> while for the GFF3, we get the transcripts from the underlying slice
> ($transcript_adaptor->fetch_all_by_Slice)
> https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199
> <https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199>
> https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112
> <https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112>
> 13:02 <https://genomes-ebi.slack.com/archives/C0F4FQPHN/p1635858153044300>
> this means that in theory, if the transcript goes over the boundaries of
> the slice, we might not dump it although we dump the genes
> I can’t see anything in the missing transcript that would explain that,
> but maybe the data in the actual database is faulty
> do we know if it is also missing in previous releases or is this
> specific to 102?
> The main difference between GTF and GFF3 dumping is that for GTF, we get
> the transcripts from the gene ($gene->get_all_Transcripts)
> while for the GFF3, we get the transcripts from the underlying slice
> ($transcript_adaptor->fetch_all_by_Slice)
>
> https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199
> <https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199>
> https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112
> <https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112>
> 
> This means if the transcript goes over the boundaries of the slice, we
> might not dump it although we dump the genes. We are now considering a
> fix (in one way or another to make things even)
>
> Thanks for your patience.
Thanks for looking into this.
>
> BTW, did you notice the same for other releases or is it just for 102?
I notice the same problem for other releases e.g. in release 101 there
seems to be 52 transcripts missing from
Mus_musculus.GRCm38.101.chr_patch_hapl_scaff.gff3.gz compared to the GTF
file.
This also affects Homo sapiens e.g. in release 104 there seems to be 154
transcripts missing from
Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gff3.gz compared to the GTF
file. For example gene ENSG00000283658 has 9 transcripts in the GTF file
but only 1 transcript in the GFF3 file.
Cheers,
H.
>
> Thanks for your help.
>
> Kind regards,
>
> Marc
>
> On Tue, 2021-11-02 at 19:19 -0700, Hervé Pagès wrote:
>> Hi,
>>
>> Can someone on this list explain why the GFF3 files are missing
>> transcripts compared to the GTF files or should I ask somewhere else?
>>
>> See below for the details.
>>
>> Thanks again,
>> H.
>>
>>
>> On 26/10/2021 16:54, Hervé Pagès wrote:
>>> Hmm.. not quite. GFF3 file
>>> Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gff3.gz still seems to be
>>> missing some transcripts.
>>>
>>> The mus_musculus_core_102_38 db and
>>> Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz both contain 144778
>>> transcripts but Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gff3.gz
>>> contains only 144726. So 52 transcripts are missing. For example
>>> ENSMUST00000206994 is missing. This transcript belongs to gene
>>> ENSMUSG00000108408. So this gene has 5 transcripts in the GTF file but
>>> only 4 in the GFF3 file.
>>>
>>> What could be the reason why some transcripts are excluded from the GFF3
>>> file?
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> On 26/10/2021 16:37, Hervé Pagès wrote:
>>>> That's it. These *.chr_patch_hapl_scaff.* files seem indeed to contain
>>>> the full db dump. Thanks!
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>> On 26/10/2021 16:22, Thomas Danhorn wrote:
>>>>> As far as I know, these GTFs/GFFs only contain genes and transcripts
>>>>> from the primary assembly, i.e. not from patches. I suspect
>>>>> http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz
>>>>> <http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz>
>>>>>
>>>>>
>>>>> might contain such transcripts.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>> On Tue, 26 Oct 2021, Hervé Pagès wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Does anybody know why transcript ENSMUST00000230762 is missing from
>>>>>> the GTF and GFF3 files for Mus musculus in Ensembl release 102?
>>>>>>
>>>>>> ENSMUST00000230762 is a transcript present in the
>>>>>> mus_musculus_core_102_38 db. It's located on novel-patch sequence
>>>>>> CHR_WSB_EIJ_MMCHR11_CTG3 from GRCm38.p6. But for some reason it's
>>>>>> not in the Mus_musculus.GRCm38.102.gtf.gz or
>>>>>> Mus_musculus.GRCm38.102.gff3.gz files found here
>>>>>> http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/
>>>>>> <http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/> and here
>>>>>> http://ftp.ensembl.org/pub/release-102/gff3/mus_musculus/
>>>>>> <http://ftp.ensembl.org/pub/release-102/gff3/mus_musculus/>
>>>>>>
>>>>>> Furthermore, it seems that the GTF and GTF3 files are missing 2079
>>>>>> transcripts compared to the mus_musculus_core_102_38 db. Anybody
>>>>>> knows what's going on?
>>>>>>
>>>>>> Thanks,
>>>>>> H.
>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Bioconductor Core Team
>>>>>> hpages.on.github at gmail.com <mailto:hpages.on.github at gmail.com>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>>>>> <https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org>
>>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>> Posting guidelines and subscribe/unsubscribe info:
>>>>> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>>>>> <https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org>
>>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>>>>
>>>>
>>>
>>
>
> --
>
> Marc Chakiachvili
>
> Ensembl Production Project Leader - Genomics Technology Infrastructure
>
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
--
Hervé Pagès
Bioconductor Core Team
hpages.on.github at gmail.com
More information about the Dev
mailing list