[ensembl-dev] Import Homo Sapiens GFF3 file from NCBI

Herzig, David david.herzig at roche.com
Thu Jul 5 15:37:03 BST 2018


Thanks for the answer.

Indeed, I had some memory issues. But after memory is increased, I still
get an error:

Loading the file
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_
GRCh38.p12_top_level.gff3.gz

I use the following perl script to do that:

ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl


Error Message:

No translation for transcript NR_038231.1

No translation for transcript XR_002956503.1

No translation for transcript NR_135168.1

Can't call method "add_Transcript" on unblessed reference at
/home/ensembl/release-92/ensembl-pipeline/scripts/refseq_import/
parse_ncbi_gff3.pl line 965, <__ANONIO__> line 3695780.

The error message points to the last line of the GFF file which is ### (the
end of the file). The error itself occures at the following code, while
adding the transcript. $genes($parent_id) seems to have a diiferent type.
It is always Bio::Ensembl::Gene expect in case of the error. Then it is
HASH. Any ideas?

if (exists $genes{$parent_id}) {

      my $stabid = $transcript->stable_id();

      say("Adding transcript " . $stabid . " to gene " . $parent_id) if (
$verbose);

      $genes{$parent_id}->add_Transcript($transcript);

    } else {

      # this should never happen (edit: so why isn't it thrown?)

      say("Parent Gene not found for transcript: " . $transcripts{$k}{
stable_id});

      next TRANSCRIPT;

      $tcnt--;

    }


On Wed, Jul 4, 2018 at 3:23 PM, Thomas Danhorn <danhornt at njhealth.org>
wrote:

> I doubt this is a problem with the script per se, but rather with the
> configuration of the machine it runs on (and therefore more a job for your
> sysadmin than for the Ensembl team).
>
> I am just speculating, but here is my best guess:
>
> - The script is growing a hash whenever it finds a slice (see the code).
> - With that amount of memory required to hold the data grows as well.
> - If the import is sorted, it appears that the issue happens toward the
> end (chromosome X), i.e. when most of the data has already been read and
> stored in memory.
> - It is likely that your machine has a memory limit in place, which
> prevents processes to use more RAM than physically available.  (One could
> use more, but that would necessitate swapping to disk, which ini turn
> essentially freezes the machine until done, which might be long time,
> therefore it is typically preferable on shared machines to just kill the
> process.)
> - The reason why your other organisms worked is most likely that they had
> less data to import (the human genome is large and well studied) and stayed
> under the memory limit.
> - A solution to this to find a machine/node with more memory and a higher
> limit.
>
> Hope this helps,
>
> Thomas
>
>
>
> On Wed, 4 Jul 2018, Herzig, David wrote:
>
> Hi Ensembl Dev Team
>>
>> I have set up a mysql db containing homo_sapiens_core_92_38.
>>
>> After that I tried to import the NCBI file:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.
>> p12_top_level.gff3.gz
>>
>> I use the following perl script to do that:
>>
>> ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl
>>
>> The process will be killed during the loading:
>>
>> Last output lines:
>> ***
>> Slice NT_187386.1 found (scaffold:GRCh38:KI270731.1:1:150754:1)
>> Slice NT_187388.1 found (scaffold:GRCh38:KI270733.1:1:179772:1)
>> Slice NT_187389.1 found (scaffold:GRCh38:KI270734.1:1:165050:1)
>> Slice NC_000023.11 found (chromosome:GRCh38:X:1:156040895:1)
>> Killed
>> ***
>>
>> Any ideas from your side?
>>
>> I did the same for other species and this works all fine.
>>
>> regards,
>> David
>>
>> --
>> David Herzig
>> Senior Scientist
>> Roche Pharma Research and Early Development
>> Roche Innovation Center Basel
>>
>> F. Hoffmann-La Roche Ltd
>> Grenzacherstrasse 124
>> 4070 Basel
>> Switzerland
>> Phone +41 61 687 31 70
>>
>> Learn more about pRED Informatics at http://go.roche.com/*pREDi*
>> <http://go.roche.com/pREDi>
>>
>>
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
David Herzig
Senior Scientist
Roche Pharma Research and Early Development
Roche Innovation Center Basel

F. Hoffmann-La Roche Ltd
Grenzacherstrasse 124
4070 Basel
Switzerland
Phone +41 61 687 31 70

Learn more about pRED Informatics at http://go.roche.com/*pREDi*
<http://go.roche.com/pREDi>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180705/6bbade52/attachment.html>


More information about the Dev mailing list