-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added random split for non-SMILES identifiers #452
Added random split for non-SMILES identifiers #452
Conversation
data/tabular/train_test_split.py
Outdated
# filtered_paths.append(path) | ||
# elif "peptide" in path: | ||
# filtered_paths.append(path) | ||
# elif "bicerano_dataset" in path: | ||
# filtered_paths.append(path) | ||
# paths_to_data = filtered_paths | ||
|
||
REPRESENTATION_LIST = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LN33 and here you use the same variable name but I guess for a different use case, or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is a global variable in LN33, but I think it is a good idea to change this local one to proper convention with lower-case names 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MicPie now should be better
data/tabular/train_test_split.py
Outdated
REPR_DF = pd.DataFrame() | ||
REPR_DF["SMILES"] = list(set(REPRESENTATION_LIST)) | ||
REPR_DF[repr_col] = list(set(REPRESENTATION_LIST)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, there should not need to be a need to allocate a dataframe for this (which can cost more memory). But it should work for our needs.
) | ||
|
||
|
||
def cli( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this should do what we want it to do. I'll walk through it once more tomorrow with a fresh mind
data/tabular/scaffold_split.py
Outdated
|
||
all_smiles = set() | ||
for file in transformed_files: | ||
df = pd.read_csv(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add low_memory=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
if not os.path.exists( | ||
os.path.join(os.path.dirname(file), "data_clean.csv") | ||
): | ||
subprocess.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you are running "transform.py" you are not ensuring that the additional mol. reprs. are present. But we can keep it as it is, but just to mention that here.
"Val smiles:", | ||
len(val_smiles), | ||
"Test smiles:", | ||
len(splits["test"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why no test_smiles
here and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cause we do not allocate a list of smiles for test (we only have the indices in the dict)
src/chemnlp/data/split.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes here as only moved, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, there is new content. Let me ensure i pushed latest version
No description provided.