@@ -2,9 +2,10 @@ Documenting Experiments
2
2
=======================
3
3
4
4
.. todo ::
5
+
5
6
This chapter needs to be rewritten for :ref: `2025.1 `.
6
7
7
- When publishing results — either formally, through a venue such as ACM Recsys ,
8
+ When publishing results — either formally, through a venue such as ACM RecSys ,
8
9
or informally in your organization, it's important to clearly and completely
9
10
specify how the evaluation and algorithms were run.
10
11
@@ -19,19 +20,10 @@ Common Evaluation Problems Checklist
19
20
This checklist is to help you make sure that your evaluation and results are
20
21
accurately reported.
21
22
22
- * Pass `include_missing=True ` to :py:meth: `~lenskit.topn.RecListAnalysis.compute `. This
23
- operation defaults to `False ` for compatiability reasons, but the default will
24
- change in the future.
25
-
26
- * Correctly fill missing values from the evaluation metric results. They are
27
- reported as `NaN ` (Pandas NA) so you can distinguish between empty lists and
28
- lists with no relevant items, but should be appropraitely filled before
29
- computing aggregates.
30
-
31
- * Pass `k ` to :py:meth: `~lenskit.topn.RecListAnalysis.add_metric ` with the
32
- target list length for your experiment. LensKit cannot reliably detect how
33
- long you intended to make the recommendation lists, so you need to specify the
34
- intended length to the metrics in order to correctly account for it.
23
+ * Pass `k ` to your ranking metrics with the target list length for your
24
+ experiment. LensKit cannot reliably detect how long you intended to make the
25
+ recommendation lists, so you need to specify the intended length to the
26
+ metrics in order to correctly account for it.
35
27
36
28
Reporting Algorithms
37
29
~~~~~~~~~~~~~~~~~~~~
@@ -50,17 +42,17 @@ algorithn peformance but not behavior.
50
42
51
43
For example:
52
44
53
- +------------+-------------------------------------------------------------------------------+
54
- | Algorithm | Hyperparameters |
55
- +============+===============================================================================+
56
- | ItemItem | :math: `k_\mathrm {max}=20 , k_\mathrm {min}=2 , s_\mathrm {min}=1.0 \times 10 ^{-3 }` |
57
- +------------+-------------------------------------------------------------------------------+
58
- | ImplicitMF | :math: `k=50 , \lambda _u=0.1 , \lambda _i=0.1 , w=40 ` |
59
- +------------+-------------------------------------------------------------------------------+
45
+ +------------------ +-------------------------------------------------------------------------------+
46
+ | Algorithm | Hyperparameters |
47
+ +============+===================================================================================== +
48
+ | ItemKNNScorer | :math: `k_\mathrm {max}=20 , k_\mathrm {min}=2 , s_\mathrm {min}=1.0 \times 10 ^{-3 }` |
49
+ +------------------ +-------------------------------------------------------------------------------+
50
+ | ImplicitMFScorer | :math: `k=50 , \lambda _u=0.1 , \lambda _i=0.1 , w=40 ` |
51
+ +------------------ +-------------------------------------------------------------------------------+
60
52
61
53
If you use a top-N implementation other than the default
62
- :py:class: `~lenskit.basic.TopNRanker `, or reconfigure its candidate
63
- selector, also clearly document that.
54
+ :py:class: `~lenskit.basic.TopNRanker `, or reconfigure its candidate selector,
55
+ also clearly document that.
64
56
65
57
Reporting Experimental Setup
66
58
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -74,12 +66,13 @@ without modification, report:
74
66
75
67
- The splitting function used.
76
68
- The number of partitions or test samples.
69
+ - The timestamp or fraction used for temporal splitting.
77
70
- The number of users per sample (when using
78
- :py:class : `~lenskit.splitting.sample_users `) or records per sample (when using
79
- :py:class : `~lenskit.splitting.sample_records `).
71
+ :py:func : `~lenskit.splitting.sample_users `) or records per sample (when using
72
+ :py:func : `~lenskit.splitting.sample_records `).
80
73
- When using a user-based strategy (either
81
- :py:class : `~lenskit.splitting.crossfold_users ` or
82
- :py:class : `~lenskit.splitting.sample_users `), the test rating selection
74
+ :py:func : `~lenskit.splitting.crossfold_users ` or
75
+ :py:func : `~lenskit.splitting.sample_users `), the test rating selection
83
76
strategy (class and parameters), e.g. ``SampleN(5) ``.
84
77
85
78
Any additional pre-processing (e.g. filtering ratings) should also be clearly
@@ -92,29 +85,23 @@ automated reporting is not practical.
92
85
Reporting Metrics
93
86
~~~~~~~~~~~~~~~~~
94
87
95
- Reporting the metrics themelves is relatively straightforward. The
96
- :py:meth: `lenskit.topn.RecListAnalysis.compute ` method will return a data frame
97
- with a metric score for each list. Group those by algorithm and report the
98
- resulting scores (typically with a mean).
88
+ Reporting the metrics themselves is relatively straightforward. The
89
+ :py:meth: `lenskit.bulk.RunAnalysis.measure ` method returns a results object
90
+ contianing the metrics for individual lists, the global metrics, and easy access
91
+ (through :meth: `~lenskit.bulk.RunAnalysis.list_summary `) to summary statistics
92
+ of per-list metrics, optionally grouped by keys such as model name.
99
93
100
- The following code will produce a table of algorithm scores for hit rate, nDCG
101
- and MRR, assuming that your algorithm identifier is in a column named ``algo ``
94
+ The following code will produce a table of algorithm scores for hit rate, NDCG
95
+ and MRR, assuming that your algorithm identifier is in a column named ``model ``
102
96
and the target list length is in ``N ``::
103
97
104
- rla = RecListAnalysis()
105
- rla.add_metric(topn.hit, k=N)
106
- rla.add_metric(topn.ndcg, k=N)
107
- rla.add_metric(topn.recip_rank, k=N)
108
- scores = rla.compute(recs, test, include_missing=True)
109
- # empty lists will have na scores
110
- scores.fillna(0, inplace=True)
98
+ rla = RunAnalysis()
99
+ rla.add_metric(Hit(k=N))
100
+ rla.add_metric(NDCG(k=N))
101
+ rla.add_metric(RecipRank(k=N))
102
+ results = rla.measure(recs, test)
111
103
# group by agorithm
112
- algo_scores = scores.groupby('algorithm')[['hit', 'ndcg', 'recip_rank']].mean()
113
- algo_scores = algo_scores.rename(columns={
114
- 'hit': 'HR',
115
- 'ndcg': 'nDCG',
116
- 'recip_rank': 'MRR'
117
- })
104
+ model_metrics = results.list_summary('model')
118
105
119
106
You can then use :py:meth: `pandas.DataFrame.to_latex ` to convert ``algo_scores ``
120
107
to a LaTeX table to include in your paper.
0 commit comments