[ensembl-dev] Import Homo Sapiens GFF3 file from NCBI

Tue Jul 10 10:49:48 BST 2018

Hi David,

I think I know what's happening.

In parse_ncbi_gff3.pl after line 192, the script is not checking for metadata (or comment lines if you prefer), so it is attempting to parse the ### meaningfully. This code was retrofitted here, and I think a bug has crept in.

Try inserting the following at 193:

next if $gff_file->is_metadata;

That should allow the script to skip the final line and terminate cleanly. Alternatively you can remove the trailing ### from the file and the problem should go away. In the meantime I will look into getting a more systematic fix made.

I would be interested to know how much memory the script was allocated when it was failing for you originally.

Regards,

Kieron

> On 5 Jul 2018, at 15:37, Herzig, David <david.herzig at roche.com> wrote:
> 
> Thanks for the answer.
> 
> Indeed, I had some memory issues. But after memory is increased, I still get an error:
> 
> Loading the file
> ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz
> 
> I use the following perl script to do that:
> 
> ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl
> 
> 
> Error Message:
> No translation for transcript NR_038231.1
> No translation for transcript XR_002956503.1
> No translation for transcript NR_135168.1
> Can't call method "add_Transcript" on unblessed reference at /home/ensembl/release-92/ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl line 965, <__ANONIO__> line 3695780.
> 
> The error message points to the last line of the GFF file which is ### (the end of the file). The error itself occures at the following code, while adding the transcript. $genes($parent_id) seems to have a diiferent type. It is always Bio::Ensembl::Gene expect in case of the error. Then it is HASH. Any ideas?
> 
> if (exists $genes{$parent_id}) {
>       my $stabid = $transcript->stable_id();
>       say("Adding transcript " . $stabid . " to gene " . $parent_id) if ($verbose);
>       $genes{$parent_id}->add_Transcript($transcript);
>     } else {
>       # this should never happen (edit: so why isn't it thrown?)
>       say("Parent Gene not found for transcript: " . $transcripts{$k}{stable_id});
>       next TRANSCRIPT;
>       $tcnt--;
>     }
> 
> 
> On Wed, Jul 4, 2018 at 3:23 PM, Thomas Danhorn <danhornt at njhealth.org> wrote:
> I doubt this is a problem with the script per se, but rather with the configuration of the machine it runs on (and therefore more a job for your sysadmin than for the Ensembl team).
> 
> I am just speculating, but here is my best guess:
> 
> - The script is growing a hash whenever it finds a slice (see the code).
> - With that amount of memory required to hold the data grows as well.
> - If the import is sorted, it appears that the issue happens toward the end (chromosome X), i.e. when most of the data has already been read and stored in memory.
> - It is likely that your machine has a memory limit in place, which prevents processes to use more RAM than physically available.  (One could use more, but that would necessitate swapping to disk, which ini turn essentially freezes the machine until done, which might be long time, therefore it is typically preferable on shared machines to just kill the process.)
> - The reason why your other organisms worked is most likely that they had less data to import (the human genome is large and well studied) and stayed under the memory limit.
> - A solution to this to find a machine/node with more memory and a higher limit.
> 
> Hope this helps,
> 
> Thomas
> 
> 
> 
> On Wed, 4 Jul 2018, Herzig, David wrote:
> 
> Hi Ensembl Dev Team
> 
> I have set up a mysql db containing homo_sapiens_core_92_38.
> 
> After that I tried to import the NCBI file:
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz
> 
> I use the following perl script to do that:
> 
> ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl
> 
> The process will be killed during the loading:
> 
> Last output lines:
> ***
> Slice NT_187386.1 found (scaffold:GRCh38:KI270731.1:1:150754:1)
> Slice NT_187388.1 found (scaffold:GRCh38:KI270733.1:1:179772:1)
> Slice NT_187389.1 found (scaffold:GRCh38:KI270734.1:1:165050:1)
> Slice NC_000023.11 found (chromosome:GRCh38:X:1:156040895:1)
> Killed
> ***
> 
> Any ideas from your side?
> 
> I did the same for other species and this works all fine.
> 
> regards,
> David
> 
> -- 
> David Herzig
> Senior Scientist
> Roche Pharma Research and Early Development
> Roche Innovation Center Basel
> 
> F. Hoffmann-La Roche Ltd
> Grenzacherstrasse 124
> 4070 Basel
> Switzerland
> Phone +41 61 687 31 70
> 
> Learn more about pRED Informatics at http://go.roche.com/*pREDi*
> <http://go.roche.com/pREDi>
> 
> 
> NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
> 
> -- 
> David Herzig
> Senior Scientist
> Roche Pharma Research and Early Development
> Roche Innovation Center Basel
> 
> F. Hoffmann-La Roche Ltd
> Grenzacherstrasse 124
> 4070 Basel
> Switzerland
> Phone +41 61 687 31 70
> Learn more about pRED Informatics at http://go.roche.com/pREDi
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/