[ensembl-dev] Import Homo Sapiens GFF3 file from NCBI

Thomas Danhorn danhornt at njhealth.org
Wed Jul 4 14:23:58 BST 2018


I doubt this is a problem with the script per se, but rather with the 
configuration of the machine it runs on (and therefore more a job for 
your sysadmin than for the Ensembl team).

I am just speculating, but here is my best guess:

- The script is growing a hash whenever it finds a slice (see the code).
- With that amount of memory required to hold the data grows as well.
- If the import is sorted, it appears that the issue happens toward the 
end (chromosome X), i.e. when most of the data has already been read and 
stored in memory.
- It is likely that your machine has a memory limit in place, which 
prevents processes to use more RAM than physically available.  (One could 
use more, but that would necessitate swapping to disk, which ini turn 
essentially freezes the machine until done, which might be long time, 
therefore it is typically preferable on shared machines to just kill the 
process.)
- The reason why your other organisms worked is most likely that they had 
less data to import (the human genome is large and well studied) and 
stayed under the memory limit.
- A solution to this to find a machine/node with more memory and a higher 
limit.

Hope this helps,

Thomas


On Wed, 4 Jul 2018, Herzig, David wrote:

> Hi Ensembl Dev Team
>
> I have set up a mysql db containing homo_sapiens_core_92_38.
>
> After that I tried to import the NCBI file:
>
> ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz
>
> I use the following perl script to do that:
>
> ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl
>
> The process will be killed during the loading:
>
> Last output lines:
> ***
> Slice NT_187386.1 found (scaffold:GRCh38:KI270731.1:1:150754:1)
> Slice NT_187388.1 found (scaffold:GRCh38:KI270733.1:1:179772:1)
> Slice NT_187389.1 found (scaffold:GRCh38:KI270734.1:1:165050:1)
> Slice NC_000023.11 found (chromosome:GRCh38:X:1:156040895:1)
> Killed
> ***
>
> Any ideas from your side?
>
> I did the same for other species and this works all fine.
>
> regards,
> David
>
> -- 
> David Herzig
> Senior Scientist
> Roche Pharma Research and Early Development
> Roche Innovation Center Basel
>
> F. Hoffmann-La Roche Ltd
> Grenzacherstrasse 124
> 4070 Basel
> Switzerland
> Phone +41 61 687 31 70
>
> Learn more about pRED Informatics at http://go.roche.com/*pREDi*
> <http://go.roche.com/pREDi>
>

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



More information about the Dev mailing list