-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified clustering for Agora use-case #7
Conversation
As of right now, I'm thinking that more unit test coverage is the bulk of the remaining work before I'd feel good about merging this (obviously pending your feedback!) |
Thank you I will take a look soon! (Sorry I clicked ready for review by mistake) |
Would also like to add code coverage repos in PRs, and for this to show full coverage of |
Oh! One more thing I'm thinking on. I swapped from from enum import IntEnum
class VoteValueEnum(IntEnum):
UP: 1
NEUTRAL: 0
DOWN: -1 We can keep the internal language neutral, and people can transform whatever data they like into votes in this format. For example: ## DEMDIS votes could map like this before passing in
# See: https://github.com/Demdis/Clustering-types/blob/main/types.py
class DemdisVoteValueEnum(str):
AGREE = "agree"
DISAGREE = "disagree"
SKIP = "skip"
DEMDIS_VOTE_MAPPING = {
DemdisVoteValueEnum.AGREE: VoteValueEnum.UP,
DemdisVoteValueEnum.SKIP: VoteValueEnum.NEUTRAL,
DemdisVoteValueEnum.DISAGREE: VoteValueEnum.DOWN,
}
votes_from_demdis = [{k: DEMDIS_VOTE_MAPPING[val] if k == "vote" else val for k, val in v.items()} for v in votes]
## AGORA could map like this
class AgoraVoteEnum(Enum):
AGREE = "agree"
DISAGREE = "disagree"
AGORA_VOTE_MAPPING = {
AgoraVoteEnum.AGREE: VoteValueEnum.UP,
AgoraVoteEnum.DISAGREE: VoteValueEnum.DOWN,
}
votes_from_agora = [{k: AGORA_VOTE_MAPPING[val] if k == "vote" else val for k, val in v.items()} for v in votes]
## AGORA could test Polis data without pass values like this
SIMULATED_AGORA_VOTE_MAPPING = {
VoteValueEnum.UP: VoteValueEnum.UP,
VoteValueEnum.NEUTRAL: None,
VoteValueEnum.DOWN: VoteValueEnum.DOWN,
}
votes_from_polis_agora = [{k: SIMULATED_AGORA_VOTE_MAPPING[val] if k == "vote" else val for k, val in v.items()} for v in votes]
## Someone could map LIKERT scale data to the algos one of these ways
# (like was done for JapanChoice.jp/polis recently)
class LikertScale(IntEnum):
STRONGLY_AGREE: 1
AGREE: 2
NEUTRAL: 3
DISAGREE: 4
STRONGLY_DISAGREE: 5
# One option
LIKERT_LIBERAL_MAPPING = {
LikertScale.STRONGLY_AGREE: VoteValueEnum.UP,
LikertScale.AGREE: VoteValueEnum.UP,
LikertScale.NEUTRAL: VoteValueEnum.NEUTRAL,
LikertScale.DISAGREE: VoteValueEnum.DOWN,
LikertScale.STRONGLY_DISAGREE: VoteValueEnum.DOWN,
}
# Another option
LIKERT_CONSERVATIVE_MAPPING = {
LikertScale.STRONGLY_AGREE: VoteValueEnum.UP,
LikertScale.AGREE: VoteValueEnum.NEUTRAL,
LikertScale.NEUTRAL: VoteValueEnum.NEUTRAL,
LikertScale.DISAGREE: VoteValueEnum.NEUTRAL,
LikertScale.STRONGLY_DISAGREE: VoteValueEnum.DOWN,
}
votes_from_likert = [{k: LIKERT_CONSERVATIVE_MAPPING[val] if k == "vote" else val for k, val in v.items()} for v in votes] |
Makes sense!
Makes sense!
Makes sense.
Agreed.
Up to you. Not sure what
Up to you!
👍
👍
👍
👍
Amazing work! Is it already usable? |
yeah go ahead for merging I think you can work on a subsequent PR for unit tests later! :). |
You're right,
It should be, though reddwarf (and I believe polis) rely on assumption of [incrementing?] numeric statement_id and participant_id for now. Shouldn't be hard to fix though, so I'll take a look. I agree that strings are ideal, so this makes total sense. |
I'll just write unit tests for getting (To reassure anyone watching: I have been way out of character for my normal development style by pushing to mainline so far, but once i go into PR mode, it's easy for me to commit to it) |
Amazing! |
mind if we keep this as a versioned agora provider for a while longer? What if we revisit in a month? Before namespacing the true An Agora provider also allows me to just favour a "yes" response to your needs, without feeling like I'm committing to larger considerations that maybe I wish for time to consider. Basically, helps us move quicker by letting the interface essentially be owned by you for now. If you're alright with that, there would be 4 options (in order of personal pref, top to bottom):
|
Sounds good to me :).
I get your point.
Whatever you prefer suits me! |
Thanks nico! I'll go with (1) |
Ready for review @nicobao |
This does the following: | ||
|
||
1. builds a vote matrix (includes as statement with at least 1 participant vote), | ||
2. filters out any participants with less than 7 votes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this 2.
filtering optional via a a config passed in options
?
Stock polis provides clustering even with 2 statements and 2 votes on each statements. It's just not displayed in the app. So I think it's should be left to be the discretion of the user.
I understand you may have other requirements, so at least provide this as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question: are "PASS" taken into account in the 7 votes? In other words, will someone with 6 pass and one agree be placed in a cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this 2. filtering optional via a a config passed in options?
Ah it actually is -- it's just awkwardly documented in the ClusteringOptions
section of reddwarf.types.agora
below. I've posted a comment there to call it out. It will be more obvious in the built docs site, because it's linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stock polis has a more complicated heuristic where any participant that either (1) hits the min vote threshold or (2) votes on all active statements (even if below the that minimum early on), gets "activated" as a user, and then is never taken out.
This leads some edge-cases of Polis where participants who only voted on 4 things are technically in the PCA calculation a month later when there are 50 statements way more participants.
As I recall, there's not strong reason to keep these participants in after they've been included, Team Polis just felt it more consistent to not remove ppl once they're part of the calculation.
Maybe the simpler stateless solution for now is to include people who either (a) vote on 7 statements or (b) vote on all statements (for when there are less than 7 statements).
In the simplest stateless algorithm, that means there's an weird UX edge-case: if one run x has 6 statements and y participants, but then participant y+1 arrives (and votes) and then adds a new statement, now everyone will disappear from the next run x+1, at least until they each come back to vote on that final 7th statement. I think that strange edge-case is why Team Polis tracks the state of "who was in this before".
This all doesn't matter very much if every conversation starts with 7 seed statements, but it smooths out the user experience when allowing a convo that starts with less than 7 statements.
So what's your preference for the Agora implementation? I'm pretty relaxed on what we start with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our use-case, we currenly focus on being as "inclusive" as possible, even if it means compromising on "purity". Could change down the line if/when we introduce "wikisurvey" a la Polis-the-product.
It must "feel" good and fun. So the more participants are included, and the earliest, the best.
I think this option is very dependent on the library consumer's use-case, so I would leave it entirely at the discretion of the library consumer product.
But I also understand the constraints: if a user hasn't participated much, the "clustering" may not be very precise, or mean a lot.
I'd say, "7 votes needed" feels a bit arbitrary. Instead, I'd say in theory, we should cluster anyone who voted on "enough" "decisive" statements, which is, I understand difficult to measure, and it is evolving over time.
So for now I'd like to keep it simple and modular, keeping our options open by... adding these three options:
number_vote_threshold
: your "7" in "filters out any participants with less than 7 votes"percentage_vote_threshold
: "100%" in "the participant voted on 100% of the statements"condition
: possible values:"or"
|"and"
. Corresponds to"or"
in: "any participant that either (1) hits the min vote thresholdOR
(2) votes on {percentage_vote_threshold} active statements gets 'activated' as a user"
Then I'd tinker with it in unit tests, and in our own product. And adapt over time with user feedback.
(Feel free to rename the options as you see fit.)
EDIT: actually, even better than these arbitrary three options, we should just allow the users to pass an optional predicate
in the options, that is being called by the underlying library to identify whether the participated is activated or not. The predicate could look like so, taking inspiration from filter
in Python:
# should return True if the participant is "activated", else returns "False"
# if the predicate is not passed as param to `run_clustering`, then all the participants are activated
def filter_participants(participant_id: int | str, vote_count: int, vote_coverage_percentage: int) -> bool
something like this (specific API could change, but this is a good start).
If you want, you could later program and expose to library consumers a few pre-made "filter_participants" predicate implementation containing the default config that you think is "relevant" for various use-cases:
stock_polis_filter_participants: Predicate
// default existing Polis behaviordef only_vote_threshold(threshold: int = 7) -> Predicate
// implementaiton you suggested- ...etc
Then it's up to the library consumer to use them or not.
But thanks to this, you can easily provide a sensible "batteries-included" starting point, e.g.:
from reddwarf import only_vote_threshold, run_clustering, stock_polis_filter_participants
db = init_db()
votes = select_votes(db)
options_0 = {
filter_participants: only_vote_threshold() # default to 7
}
clusters_0 = run_clustering (votes, options_0)
options_1 = {
filter_participants: only_vote_threshold(6)
}
clusters_1 = run_clustering (votes, options_1)
options_2 = {
filter_participants: stock_polis_filter_participants
}
clusters_2 = run_clustering (votes, options_2)
clusters_3 = run_clustering (votes) # no participants are filtered
def custom_filter_participants(participant_id: int | str, vote_count: int, vote_coverage_percentage: int) -> bool:
if participant_id = 3:
return False
else:
return True
custom_filter = custom_filter_participants
options_4 = {
filter_participants: custom_filter
}
clusters_4 = run_clustering (votes, options_4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the predicate idea, though am on-the-fence on whether it's the right time to implement just yet.
I'd say, "7 votes needed" feels a bit arbitrary.
Strongly agree. So many of these choices are arbitrary. The 2021 paper even admits:
Participants who voted on fewer than seven comments are removed from the conversation to avoid the “clumping up” of participants around the center of the conversation. This number is somewhat arbitrary but tuned as a hyperparameter based on experience with the domain. We will leave discussion of better metrics and functions for deciding whether to keep or drop a participant to a future paper dealing with missing data.
https://hyp.is/JbNMus5gEe-cQpfc6eVIlg/gwern.net/doc/sociology/2021-small.pdf
Having said this, I'm just now fully realizing how much you're working in a whole other domain with your reddit-style UI. Shared defaults won't always be possible between these domains. So the above point I perhaps started out trying to make (about Polis encoding domain knowledge) is kinda inappropriate.
For example, "domain knowledge" in polis/wikisurveys might say that if someone only votes on 3 of 6 comments in a new Polis convo, it makes total sense to exclude them, because essentially no reasonable participant would do that. That participant's votes are almost certainly junk data -- someone just cruising through and straight-voting agree 3 times to see how it works and whether the statements are interesting, which will have unreasonable influence on early conversation. But in Agora's list, votes on 3 of 6 statements would be very common, and perhaps even the main way participants will show up in the data, given the affordances of the reddit-style UI :)
I've just now realized that we're prob going to need (not real name suggestions) redditlike.run_clustering()
and wikisurvey.run_clustering()
bc the sensible defaults just aren't the same between them. We have a wealth of recommendation/guardrails for the latter via Polis, but its kinda terra nova for the former.
I would like this library to be highly supportive of someone operating in the wikisurvey scenario (overridable guardrails etc), since that's a very rigorously understood domain, and we already know what holds pretty well.
I wonder if maybe the answer for these two domains is to have two implementatations (eventually predicate-like? or now?). We'd have one for wikisurveys/polis, and another for agora (explicitly named for now). agora.run_clustering()
would allow tune-able defaults suiting your domain.
For clarity, going to spec out what's needed for wikisurvey.run_clustering()
, though it need not be implemented in this PR:
wikisurvey.run_clustering()
⭐ needs to be added
- below the
min_user_vote_threshold
(default: 7), include any users who's voted on every statement. ⭐ - beyond the
min_user_vote_threshold
, only include users who meet the threshold. - add a
keep_participant_ids
field toClusteringOptions
that takes a list ofparticipant_id
s to be kept irregardless of (1) and (2). ⭐
This allows a sensible default for processing wikisurveys, and if implementers wish to smooth the edge-cases (i.e. not have users jump in and out of the calculation when it's near the threshold), then they can track participants who were included already and pass them into keep_participant_ids
to smooth things out. Polis behavior (as I currently understand it) could easily be reproduced with this function definition.
class ClusteringOptions(TypedDict):
min_user_vote_threshold: Optional[int]
keep_participant_ids: Optional[int] # new
max_clusters: Optional[int]
We could later (beyond MVP?) move min_user_vote_threshold
and keep_participant_ids
into a predicate. (I suspect the eventual predicate will need access to not just participants, but statement data, because it's not totally clear to me whether filtering of statements vs participants should happen first
Back to Agora
To be clear, I love the predicate idea, but it still feels more than MVP here. Can we bank that for later, and add it as we see the shared shape of needs between implementations? For now, just keep agora.run_clustering()
as options keys for your specific needs? If you need a predicate, I'm happy to do that now, just always prefer to wait for the second implementation to appear
What does Agora specifically want, nevermind the abstraction of predicates? Is it essentially min_user_vote_threshold=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- below the min_user_vote_threshold (default: 7), include any users who's voted on every statement.
For now, I'm realizing the current internal utils.filter_matrix()
needs the above change at minimum to even make sense, so going to make a PR for that regardless of the needs of agora implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't what we have already? agora.run_clustering()
is already distinct from the other one.
Despite its package name (which I already mention I think we should change to something more neutral in the long term, or unify!), I don't think this function should explicitly optimize for any specific product.
The way I see it:
(1): core reddwarf clustering library, configurable with low-level stuff => used by (2): specialized-wikisurvey-library-that-configure-the-core-library-with-sensible-default-such-as-filtering-participants-or-statements => used by (3): visualization library => ...etc
run_clustering
is part of (1), so product-specific stuff shouldn't be present. Product-specific stuff should be in (2), in other words, in how (1) is configured before being run.
What does Agora specifically want, nevermind the abstraction of predicates? Is it essentially min_user_vote_threshold=0?
=> The short answer is I don't know
. I would need to do some testing. Instinctively, with how it works now, I think I wouldn't put any restrictions on participants whatsover.
If you do add some options, I think it's actually more work to do it with multiple params than to do it via a predicate. And the added benefit of predicate is allowing library users to experiment on their own and provide you with feedback.
Afaik, filter_statements
is unnecessary because we don't have the incremental API yet. So the library consumer will know already what statement they don't want to use for clustering (moderation, etc), and filter them out before passing it to the library votes
param.
I thought filter_participants
was more than just blindly ignoring the said participants. I thought it was meant to remove participants to get "activated" (so more like filter_activated_participants
), which doesn't mean they are not used at all? If inactivated participants aren't used at all, then since the library consumer knows all the data already, it's straightforward for the library consumer to simply filter the participants in the votes
input of run_clustering
however they see fit, and filter_participants
is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, trying to confine product use cases into a box (reddit-like vs wikisurvey-like) on the lowest-level library is a slippery slope. Right now, we're discussing Polis and Agora, but the reality is we can’t predict how library consumers will use this. There may be countless use cases beyond what we can imagine, so it's crucial to keep our mantra open. This means avoiding arbitrary decisions whenever possible for the lowest-level functions and instead prioritizing maximum flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to digest, but reticketed the naming conversation here (sorry to miss out on responding earlier): #10
For the test coverage, what do you have in mind? |
ah, wasn't the feedback to leave this for another PR? Re: #7 (comment) |
Sure 💯. Do you think I can already replace stock polis with this library in Agora? I'd like to transition as soon as possible so:
|
It sounds like you just want to set
I hope to not draw us into this approach too much in the future, but I'd take you up on just merging and refining! (lengthy PRs are a slog)
If everything in this comment feels correct, then yes! If you approve this PR, I'll quickly draft an issue to cut a release, so you can pin things 🎉 |
Nice, merged this then. Feel free to merge yourself too! |
Addresses #6
cc: @nicobao
Things to note:
run_clustering
takes a singleConversation
instead ofList[Conversation]
. this allows calculations on a more minimal interface that doesn't need anything beyond a list of votes. running the method for each conversation seemed reasonable to leave to the implementer. (and there is no performance benefit to batching)statement
is the term I prefer overopinion
, because in theory these algorithms can be run on metadata and factual beliefs.Conversation
.votes_by_participants
andvotes_by_opinion
are less ideal keys onConversation
object than simplyvotes
. If users arriving with data of that shape, we can provide helpers to deconstruct back into vote records.projection
inCluster
. I prefer the demdis approach of treating outputClusteredParticipant
(with projection coords attached to returned participant) as different from any inputParticipant
. My reasoning for this is because participants can't even be grouped without projected coords, so projected data are basic and should be returned on the participant, not a separate key. (The way polis deconstructs all data into separate keys is not something I'd like to emulate, as it's complicated to make sense of when interpreting the data -- counting length of arrays etc.)ClusteringResult
object instead ofClusters
leaves more room for expanded metadataparticipants
overmembers
(less disjoint terminology seems better)reddwarf.utils
.reddwarf.utils
are fully documented on docs websitereddwarf.utils
methods (see "Todos" section)Todos
reddwarf.utils
methods usedgenerate_raw_matrix()
get_unvotes_statement_ids()
filter_matrix()
run_pca()
run_kmeans()
#wontdo
find_optimal_k()
#wontdo
scale_projected_data()
#wontdo
run code coverage report for any PRs#wontdo
ensure full coverage for allreddwarf.utils
methods#wontdo
reddwarf.agora.run_clustering_v1()
andreddwarf.types.agora
(runmake docs-serve
to view) 0df8409