Added random split for non-SMILES identifiers #452

AdrianM0 · 2023-10-23T16:29:41Z

No description provided.

MicPie · 2023-10-24T12:19:25Z

data/tabular/train_test_split.py

    #         filtered_paths.append(path)
    #     elif "peptide" in path:
    #         filtered_paths.append(path)
+    #     elif "bicerano_dataset" in path:
+    #         filtered_paths.append(path)
    # paths_to_data = filtered_paths

    REPRESENTATION_LIST = []


In LN33 and here you use the same variable name but I guess for a different use case, or am I wrong?

yes, that is a global variable in LN33, but I think it is a good idea to change this local one to proper convention with lower-case names 🤔

@MicPie now should be better

kjappelbaum · 2023-10-24T21:08:22Z

data/tabular/train_test_split.py

    REPR_DF = pd.DataFrame()
-    REPR_DF["SMILES"] = list(set(REPRESENTATION_LIST))
+    REPR_DF[repr_col] = list(set(REPRESENTATION_LIST))


in general, there should not need to be a need to allocate a dataframe for this (which can cost more memory). But it should work for our needs.

kjappelbaum · 2023-10-24T21:09:21Z

data/tabular/train_test_split.py

    )


+def cli(


nice, this should do what we want it to do. I'll walk through it once more tomorrow with a fresh mind

MicPie · 2023-11-17T11:44:47Z

data/tabular/scaffold_split.py

+
+    all_smiles = set()
+    for file in transformed_files:
+        df = pd.read_csv(file)


add low_memory=False?

MicPie · 2023-11-17T11:45:22Z

data/tabular/scaffold_split.py

+            if not os.path.exists(
+                os.path.join(os.path.dirname(file), "data_clean.csv")
+            ):
+                subprocess.run(


When you are running "transform.py" you are not ensuring that the additional mol. reprs. are present. But we can keep it as it is, but just to mention that here.

MicPie · 2023-11-17T11:46:09Z

data/tabular/scaffold_split.py

+        "Val smiles:",
+        len(val_smiles),
+        "Test smiles:",
+        len(splits["test"]),


why no test_smiles here and above?

cause we do not allocate a list of smiles for test (we only have the indices in the dict)

MicPie · 2023-11-17T11:48:14Z

src/chemnlp/data/split.py

No changes here as only moved, right?

hm, there is new content. Let me ensure i pushed latest version

AdrianM0 added 9 commits October 23, 2023 17:14

feat: added random split for non-SMILES identifiers

7173116

fix: commented out debug lines

089073c

feat: add non-SMILES for debug. add checks for non-SMILES + data-saver

1b6dfc7

feat: generalized train-test split

517a084

lint

32e70c4

fix: error

0b88457

lint

06b3d07

fix: double reading and prioritize SMILES

912f6b8

chore: pre-commit

0d407c9

MicPie reviewed Oct 24, 2023

View reviewed changes

kjappelbaum reviewed Oct 24, 2023

View reviewed changes

AdrianM0 and others added 7 commits October 25, 2023 11:57

chore: var names more pythonic

30881de

update composition type

1cd82fc

add AS_SEQUENCE as type

2edaccf

update representation list

973d08e

factor out scaffold split

27dac3e

update scaffold split code

abe158e

Merge branch 'main' into train_test_split_addition

3b8edda

MicPie approved these changes Nov 17, 2023

View reviewed changes

Kevin Maik Jablonka added 2 commits November 17, 2023 16:32

lint

4e39a73

lint and low_memory=False

1130991

kjappelbaum merged commit 09e6721 into OpenBioML:main Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added random split for non-SMILES identifiers #452

Added random split for non-SMILES identifiers #452

Uh oh!

AdrianM0 commented Oct 23, 2023

Uh oh!

MicPie Oct 24, 2023

Uh oh!

AdrianM0 Oct 24, 2023

Uh oh!

AdrianM0 Oct 25, 2023

Uh oh!

kjappelbaum Oct 24, 2023

Uh oh!

kjappelbaum Oct 24, 2023

Uh oh!

MicPie Nov 17, 2023

Uh oh!

kjappelbaum Nov 17, 2023

Uh oh!

MicPie Nov 17, 2023

Uh oh!

MicPie Nov 17, 2023

Uh oh!

kjappelbaum Nov 17, 2023

Uh oh!

MicPie Nov 17, 2023

Uh oh!

kjappelbaum Nov 17, 2023

Uh oh!

Uh oh!

		)


		def cli(

Added random split for non-SMILES identifiers #452

Added random split for non-SMILES identifiers #452

Uh oh!

Conversation

AdrianM0 commented Oct 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!