Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added random split for non-SMILES identifiers #452

Merged
merged 18 commits into from
Nov 17, 2023

Conversation

AdrianM0
Copy link
Contributor

No description provided.

# filtered_paths.append(path)
# elif "peptide" in path:
# filtered_paths.append(path)
# elif "bicerano_dataset" in path:
# filtered_paths.append(path)
# paths_to_data = filtered_paths

REPRESENTATION_LIST = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LN33 and here you use the same variable name but I guess for a different use case, or am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is a global variable in LN33, but I think it is a good idea to change this local one to proper convention with lower-case names 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MicPie now should be better

Comment on lines 235 to 236
REPR_DF = pd.DataFrame()
REPR_DF["SMILES"] = list(set(REPRESENTATION_LIST))
REPR_DF[repr_col] = list(set(REPRESENTATION_LIST))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, there should not need to be a need to allocate a dataframe for this (which can cost more memory). But it should work for our needs.

)


def cli(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this should do what we want it to do. I'll walk through it once more tomorrow with a fresh mind


all_smiles = set()
for file in transformed_files:
df = pd.read_csv(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add low_memory=False?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

if not os.path.exists(
os.path.join(os.path.dirname(file), "data_clean.csv")
):
subprocess.run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are running "transform.py" you are not ensuring that the additional mol. reprs. are present. But we can keep it as it is, but just to mention that here.

"Val smiles:",
len(val_smiles),
"Test smiles:",
len(splits["test"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no test_smiles here and above?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause we do not allocate a list of smiles for test (we only have the indices in the dict)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes here as only moved, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, there is new content. Let me ensure i pushed latest version

@kjappelbaum kjappelbaum merged commit 09e6721 into OpenBioML:main Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants