Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Training result #3

Open
qunshanxingyun opened this issue Nov 28, 2024 · 4 comments
Open

Question about Training result #3

qunshanxingyun opened this issue Nov 28, 2024 · 4 comments

Comments

@qunshanxingyun
Copy link

Dear Zhao,
I am a graduate student currently attempting to reproduce the training results from your paper. Following the guidance in the README.md, I have set up the environment and downloaded the BindingDB dataset. However, when I run the command you provided:

python main_token.py --dataset BindingDB

I find that the training process is not proceeding well. Here is the training log recorded in wandb:

prot_tokenizer 446
drug_tokenizer 428
Some weights of EsmForMaskedLM were not initialized from the model checkpoint at westlake-repl/SaProt_650M_AF2 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight', 'esm.embeddings.position_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at HUBioDataLab/SELFormer and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataset loaded: train-34439; valid-4920; test-9840
539it [36:28,  4.06s/it]
77it [05:12,  4.06s/it]
154it [10:26,  4.07s/it]
Epoch 1: Validation AUC: 0.5381
Epoch 1, Loss: 0.6890225624993913
Epoch 2: Validation AUC: 0.5007
Epoch 2, Loss: 0.6797261296707535
Epoch 3: Validation AUC: 0.5566
Epoch 3, Loss: 0.6756483885927855
Epoch 4: Validation AUC: 0.5371
Epoch 4, Loss: 0.6739429008761673
Epoch 5: Validation AUC: 0.5372
Epoch 5, Loss: 0.6732677729983498
Epoch 6: Validation AUC: 0.5141
Epoch 6, Loss: 0.6707289610369086
Epoch 7: Validation AUC: 0.5585
Epoch 7, Loss: 0.6690432626152746
Epoch 8: Validation AUC: 0.5501
Epoch 8, Loss: 0.6656342917337931
Epoch 9: Validation AUC: 0.5572
Epoch 9, Loss: 0.6675316541004712
Epoch 10: Validation AUC: 0.5575
Epoch 10, Loss: 0.6681337383107484
Epoch 11: Validation AUC: 0.5577
Epoch 11, Loss: 0.666059407099722
Epoch 12: Validation AUC: 0.5595
Epoch 12, Loss: 0.6617658937795706
Epoch 13: Validation AUC: 0.5511
Epoch 13, Loss: 0.6615839151592998
Epoch 14: Validation AUC: 0.5391
Epoch 14, Loss: 0.6584334625605086
Epoch 15: Validation AUC: 0.5633
Epoch 15, Loss: 0.6583778830538876
Epoch 16: Validation AUC: 0.5355
Epoch 16, Loss: 0.6562116607221912
Epoch 17: Validation AUC: 0.5436
Epoch 17, Loss: 0.6541746810548602
Epoch 18: Validation AUC: 0.5576
Epoch 18, Loss: 0.6524081914677027
Epoch 19: Validation AUC: 0.5602
Epoch 19, Loss: 0.6519535347570515
Epoch 20: Validation AUC: 0.5561
Epoch 20, Loss: 0.6498159312803803
Epoch 21: Validation AUC: 0.5592
Epoch 21, Loss: 0.6495243820719462
Epoch 22: Validation AUC: 0.5227
Epoch 22, Loss: 0.6509300642862833
Epoch 23: Validation AUC: 0.5298
Epoch 23, Loss: 0.648390460655729
Epoch 24: Validation AUC: 0.5602
Epoch 24, Loss: 0.646911659245146
Epoch 25: Validation AUC: 0.5288
Early stopping triggered after 25 epochs.
Test AUC: 0.5419539081961172, AUPR: 0.45798490049710316, Accuracy: 0.5660569105691057

I am currently using the BindingDB data from /bindingdb/random in the DrugBAN repository.
image

I'm not sure if this is correct or I missed some important steps.
I would greatly appreciate it if you could provide some guidance.

@qunshanxingyun
Copy link
Author

Before this, I tried using the Human dataset and ran the training code with the same hyperparameter settings, and the results look quite good

prot_tokenizer 446
drug_tokenizer 428
Some weights of EsmForMaskedLM were not initialized from the model checkpoint at westlake-repl/SaProt_650M_AF2 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight', 'esm.embeddings.position_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at HUBioDataLab/SELFormer and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataset loaded: train-4197; valid-600; test-1200
66it [07:09,  6.51s/it]
10it [01:10,  7.05s/it]
19it [02:05,  6.59s/it]
Epoch 1: Validation AUC: 0.8563
Epoch 1, Loss: 0.488140295400764
Epoch 2: Validation AUC: 0.8681
Epoch 2, Loss: 0.4101908211455201
Epoch 3: Validation AUC: 0.8897
Epoch 3, Loss: 0.3980258427786105
Epoch 4: Validation AUC: 0.9066
Epoch 4, Loss: 0.38441175561059604
Epoch 5: Validation AUC: 0.9151
Epoch 5, Loss: 0.3696805624799295
Epoch 6: Validation AUC: 0.9182
Epoch 6, Loss: 0.3614708785757874
Epoch 7: Validation AUC: 0.8955
Epoch 7, Loss: 0.3388005508617921
Epoch 8: Validation AUC: 0.8524
Epoch 8, Loss: 0.33054223588921805
Epoch 9: Validation AUC: 0.8889
Epoch 9, Loss: 0.3339718331893285
Epoch 10: Validation AUC: 0.9072
Epoch 10, Loss: 0.32357954414504947
Epoch 11: Validation AUC: 0.9108
Epoch 11, Loss: 0.31077620364499814
Epoch 12: Validation AUC: 0.9320
Epoch 12, Loss: 0.29455175363656244
Epoch 13: Validation AUC: 0.8513
Epoch 13, Loss: 0.28147737794753275
Epoch 14: Validation AUC: 0.9185
Epoch 14, Loss: 0.2656678603679845
Epoch 15: Validation AUC: 0.8562
Epoch 15, Loss: 0.2479517622427507
Epoch 16: Validation AUC: 0.9400
Epoch 16, Loss: 0.24507965734510712
Epoch 17: Validation AUC: 0.9395
Epoch 17, Loss: 0.22614611267591966
Epoch 18: Validation AUC: 0.9102
Epoch 18, Loss: 0.21601766314018855
Epoch 19: Validation AUC: 0.9051
Epoch 19, Loss: 0.19639081451477428
Epoch 20: Validation AUC: 0.9048
Epoch 20, Loss: 0.19496720701907622
Epoch 21: Validation AUC: 0.9266
Epoch 21, Loss: 0.1839833426656145
Epoch 22: Validation AUC: 0.9528
Epoch 22, Loss: 0.16109456669426325
Epoch 23: Validation AUC: 0.9343
Epoch 23, Loss: 0.16240013447223287
Epoch 24: Validation AUC: 0.9469
Epoch 24, Loss: 0.14385056783529845
Epoch 25: Validation AUC: 0.9657
Epoch 25, Loss: 0.13903420384634624
Epoch 26: Validation AUC: 0.9364
Epoch 26, Loss: 0.12498962105900953
Epoch 27: Validation AUC: 0.9612
Epoch 27, Loss: 0.11473344768764394
Epoch 28: Validation AUC: 0.9194
Epoch 28, Loss: 0.1073367449765404
Epoch 29: Validation AUC: 0.9660
Epoch 29, Loss: 0.10698797641265573
Epoch 30: Validation AUC: 0.9585
Epoch 30, Loss: 0.10357639622507674
Epoch 31: Validation AUC: 0.9530
Epoch 31, Loss: 0.08866109775209968
Epoch 32: Validation AUC: 0.9476
Epoch 32, Loss: 0.08555335598066449
Epoch 33: Validation AUC: 0.9553
Epoch 33, Loss: 0.07548184774703148
Epoch 34: Validation AUC: 0.9631
Epoch 34, Loss: 0.07361776761315537
Epoch 35: Validation AUC: 0.9613
Epoch 35, Loss: 0.06897221593129815
Epoch 36: Validation AUC: 0.9563
Epoch 36, Loss: 0.05704509153623472
Epoch 37: Validation AUC: 0.9569
Epoch 37, Loss: 0.06027515013843323
Epoch 38: Validation AUC: 0.9551
Epoch 38, Loss: 0.057640916145773546
Epoch 39: Validation AUC: 0.9399
Early stopping triggered after 39 epochs.
Test AUC: 0.9517959899298688, AUPR: 0.9309388830423682, Accuracy: 0.8891666666666667

So is this a problem with the dataset itself?

@ZhaohanM
Copy link
Owner

Thank you for your interest in our work.

The inputs to our proposed model are SELFIES of drugs and Structure-Aware Sequences of proteins. The original dataset must be preprocessed according to the provided instructions in README.md. The following is a sample for guidance:

image

Additionally, if you wish to use SMILES for drugs and amino acid sequences for proteins as inputs, you can utilise the original dataset directly. In this case, you only need to replace the current drug encoder (SELFormer) and protein encoder (Saprot) accordingly.

@qunshanxingyun
Copy link
Author

Thank you for your guidance!
I successfully built the dataset, obtained the Alphafold predicted structure through the script, and encoded the drug-target into selfies and seq. Based on that I trained the model; however, the performance seems to be lacking;
image

I noticed that my dataset (before splitting) has about 32,000 data points, and some Uniprot_ids were not successfully obtained;
image

How could I reproduce the performance in the paper?

@ZhaohanM
Copy link
Owner

ZhaohanM commented Dec 2, 2024

Congratulations on obtaining the SELFIES dataset and the Structure-Aware (SA) sequences of proteins!

The splitting strategy of the dataset affects the prediction results. To reproduce the performance in the paper, the next step is to integrate the SELFIES and SA datasets you acquired with established benchmarks such as BindingDB, BioSNAP, or Human. Furthermore, I recommend checking the dataset against the data in the figure below after processing the dataset. This will help bring your experimental results in line with those reported in our paper.

image

In addition, if the UniProt ID or corresponding 3D structure file (.cif) is not available, the 3D structure should be predicted with the AlphaFold model directly using the amino acid sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants