Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on importing of data #1

Open
andyhegedus opened this issue Sep 1, 2020 · 9 comments
Open

Issue on importing of data #1

andyhegedus opened this issue Sep 1, 2020 · 9 comments

Comments

@andyhegedus
Copy link

Hi,
Very interested in this data set. First attempt was to use provided cypher import file, but process never completed.
Second attempt, copy and paste snippets at a time from script. Generally worked until this point (all other steps have worked and completed is short order)

// Add Alternative titles for Occupations and Workrole
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///AlternateTitles.txt' AS line FIELDTERMINATOR ' '
MATCH (a:Occupation {onet_soc_code: line.O*NET-SOC Code})
MERGE (t:AlternateTitles {title: line.Alternate Title,
shorttitle: line.Short Title, source: line.Source(s)})
WITH a, t, line
CALL apoc.create.relationship(a, 'Equivalent_To', {}, t) YIELD rel
RETURN count(rel)

This step never completes and after repeated tries I get a memory error. Any suggestions for fixes/workarounds?
Andy

@davidmeza1
Copy link
Owner

First, at this point, any code that starts with ":auto" has to be run by itself. Currently working on automating the script.
Some sections will require up to an hour to run, depending on your resources.
You should increase the heap size in the config file to a minimum of 1G and max of 3G. If you have sufficient resources, you can set it higher.

@andyhegedus
Copy link
Author

Hi,
I have sufficient available RAM (40G) on this machine so I set the heap to 5G. Still does seem to complete. Some how I just find it very odd that this particular block is giving issue. ALL previous blocks issued very quickly and without complaint. Especially the block just before it which is very similar in commands and is importing the abilities.txt which is 4X the size. That completed in <5 sec.
Andy

@davidmeza1
Copy link
Owner

That is odd. Others have used this as recent as yesterday with no issues. I will review this evening, if i can.

@andyhegedus
Copy link
Author

One thing I did notice was there there was a node with label Alternate_Titiles (note underscore) that had been created earlier and this block was trying to merge to AlternateTitles (no-hypen) though I could figure where it was created.

image

@andyhegedus
Copy link
Author

Hi,

I just ran the next block of your script and it ran fine:
Added 2292 labels, created 2292 nodes, set 6876 properties, started streaming 1 records after 2 ms and completed after 4605 ms. Note: It has created a node with label AlternateTitles which distinct from the earlier created ones Alternate_Titles.

LOAD CSV WITH HEADERS
FROM 'file:///AlternateTitles.txt' AS line FIELDTERMINATOR ' '
MATCH (a:Workrole {onet_soc_code: line.O*NET-SOC Code})
MERGE (t:AlternateTitles {title: line.Alternate Title,
shorttitle: line.Short Title, source: line.Source(s)})
WITH a, t, line
CALL apoc.create.relationship(a, 'Equivalent_To', {}, t) YIELD rel
RETURN count(rel)

I then try running this block without success.

LOAD CSV WITH HEADERS
FROM 'file:///AlternateTitles.txt' AS line FIELDTERMINATOR ' '
MATCH (a:Occupation {onet_soc_code: line.O*NET-SOC Code})
MERGE (t:AlternateTitles {title: line.Alternate Title,
shorttitle: line.Short Title, source: line.Source(s)})
WITH a, t, line
CALL apoc.create.relationship(a, 'Equivalent_To', {}, t) YIELD rel
RETURN count(rel)

@andyhegedus andyhegedus changed the title Issue on importing off data Issue on importing of data Sep 2, 2020
@davidmeza1
Copy link
Owner

I ran all the queries last night with no issues. The name difference Alternate_Titles and AlternateTitles should not cause an issue. The first is an element in the Taxonomy, the second is an actual alternate title for an occupation or workrole. I'll keep looking.

@andyhegedus
Copy link
Author

Hi David,

I deleted the database and tried a second time with the same result. That step does not complete. I do notice in Activity monitor that Java Swells to 1.5G of memory while it is processing.

One thing I am curious on are the constraints Your cypher file lists only these three. Dis you create other constraints per chance?

// TODO need to add constraints, this is example only
CREATE CONSTRAINT ON (occupation:Occupation) ASSERT occupation.onet_soc_code IS UNIQUE;
CREATE CONSTRAINT ON (element:Element ) ASSERT element.ElementID IS UNIQUE;
CREATE CONSTRAINT ON (occupation:MajorGroup) ASSERT occupation.onet_soc_code IS UNIQUE;

@davidmeza1
Copy link
Owner

I have not been able to recreate your error. I have run the scripts a couple of time and while it does take some time, it does not fail.
The increase in memory makes sense as this particular query creates many relationships. I am fine tuning the model I use, which should, in theory, reduce the number of relationships.
That is all the constraints I have at this time.

@andyhegedus
Copy link
Author

I tried again after doing the update to 4.1.1 that was pushed out today. The problematic step was able to load albeit that single step took 20+ minutes to complete. The next block (which I actually ran first to 1 minute to complete).
Also as a side note the last part of your script has imports that I think are specific to your organization and the associated data files are not in the download (correct thing). You may look to deleting those steps from the import script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants