24
votes
How can I combine several ranked lists into one mega list?
Hello smart ~comp people! I have a very basic, layman question. The kind of question I'm scared to make on Reddit and gettting flamed. Tildes is usually more patient ;)
Suppose that I get get a bunch of "best of" lists for several videogames. Like "the best RPGs on the Nintendo DS" for example. The lists have varying lenghts. Is there an easy way for me to combine those lists into one that doesn't require (really) learning to program?
I can follow instructions! Thanks!
This is a more subtle question than you might expect. It's equivalent to asking the "best" way to rank candidates in an election, but Arrow's impossibility theorem shows that there really is no "best" way to do this (more precisely, it is impossible to prevent irrelevant candidates from spoiling the results).
So the best we can do is acknowledge the drawbacks of whatever algorithm we decide to use.
The most straightforward approach would be to use the Borda method. In this approach, we first tabulate the N unique candidates from all the ballots (lists). Then for each ballot, we compute a Borda score for each candidate in that ballot, where the Borda score is given by
If a candidate is unranked in a given ballot, then its rank is assumed to be N.
For example, given two ballots, we would have the following Borda scores
which would produce the following ranked list (by summing the Borda scores for each candidate)
Unfortunately, the Borda method gives poorer predictions when some ballots don't rank every candidate, as a ballot can spoil the final result by ranking their favorite candidate first and leaving the other candidates unranked. For example, imagine you want to combine two video game lists. One list was compiled by an IGN critic who meticulously tabulated their hundred favorite games on the Nintendo DS. The other ballot was written by a 7-year-old and contains only My Little Pony: Pinkie Pie's Party. Then the Borda method would rank My Little Pony: Pinkie Pie's Party above every game in the critic's list except the critic's top game, which would be rated equal.
Since lists of favorite media are unlikely to have complete overlap, you find yourself vulnerable to this shortcoming. However, we can attempt to correct for this failure mode! In this paper, the authors consider some other formulations of the Borda method. The simplest alternative to implement would be the average Borda count method.
The average Borda count method is equivalent to the Borda method except when a ballot contains unranked candidates. In this case, the unranked candidates on the Ballot receive a score equal to the average of the unused points that the ballot could have allotted. With a little bit of algebra, you can show that the average Borda count is given by
where K is the number of items on a given ballot. For the example from before, all but one point was allotted, so the average Borda count for the first list would be
(5-3-1)/2=0.5whereas the second list would not have an average Borda count since all the candidates are ranked. Now we havewhich becomes
In this case, there is no change in the final ordering. But in the critic vs 7-year-old example, MLP would be ranked in the middle of the list instead of at the front (for better or worse).
There are a few tools available online to compute the Borda scores for you. Here is one such calculator, though it only allows for five candidates (almost certainly too few for your use case). A few people have also written python scripts, for example here. Unfortunately, neither of these implementations correct for the average Borda count, so you would have to do that yourself.
I don't know the exact words. I want to weigh the rankings of several best of lists to create a master list representing the the rankings of all the lists combined.
It'd be immensely helpful if you can post a short sample of the data and the end result you want then. For example with lists like:
rpg.txt:
class_based.txt:
The UNIX command
cat rpg.txt class_based.txt | sortwould result in:But this makes a few assumptions:
I haven't decided which lists to use yet. But I use Emacs, which allows me to standardize in text format any lists I encounter with relative ease. These are "best of" lists, so they're going to be "1, 2, 3" and so on.
What I don't know is how to weigh them into a single list.
Let's say I have these lists:
List 1:
List 2:
Can I add them to make a list 3 that represents both rankings combined?
I agree - for all the times I’ve seen spreadsheets used for things they shouldn’t be, this one seems like they’re the right tool for the job.
I’d be thinking average of the item’s position across lists as the score to sort by, either divided by the number of lists a given item appears in, or ranking an item
len + 1for any list it doesn’t appear in, depending whether you want to treat non-appearance as neutral (i.e. these are categorised collections and you don’t expect Pokémon to come up on the top ten racing games in the first place) or negative (if it didn’t come up on the top ten best games of all time, it’s presumably lower ranked than all ten).psi's comment is a nice and informative comment on the tradeoffs you'll have to make depending on how you want to combine them. As far as I understood you don't have anything in your mind currently? If I were in your position I'd go for something simple like TaylorSwiftsPickles's average method, and write myself a Python script like this. Example result:
Google Sheets w/ a column to clean up / weigh your ranks and
QUERYcould handle this with ease.demo sheet
that sheet will be automatically deleted in 30 days -- but its pretty straightforward. The layout is important for this sort of thing. If you're doing a lot of scraping, I'd definitely have an input sheet to bring the titles in, standardize them, then push them to the main dataset.
This was basically my thought. I'm sure it's more work in some ways but I don't program and I'd be able to make excel (or sheets) do this fairly easily
i think the clean up with this would be faster than having to mess with a bunch of scripts.
I can't speak to the scripts obviously but to me most of the work is deciding how the rankings would work together, so it's the thinking. The rest is formulae
I don't really have a good answer especially compared to all the already good suggestions posted, but this reminded me of a YouTube video where Josh Strife Hayes does exactly this.
What's your use case/what's the end result you're going for? While I'm not at all trying to discourage you from learning, this may be a task that the LLM of your choice could tackle with relative ease. Assuming you just want a one off, you could feed it the input data and tell it out to structure your output.
I wish to combine several best of lists into a single list that represents their overall ranking for each game.
I know I could just ask GPT but due to my own ignorance I would have no way to determine if the reasoning was sound :P
GPT could easily be both wrong and totally convincing to me.
Thanks!
I second that wisdom!
I am a software engineer and use LLMs regularly to solve data problems like this. When I give an LLM a problem and a solution, I get adequate results. When I give it a problem and ask for a solution, I rarely get good results and often get simply incorrect ones. Since software is my area, I'm able to review and reject bad results but even here I would struggle because (as @psi points out) the challenge here is in the algorithm and ranking philosophy more than writing the code, both places where I think an LLM would do terribly.
If the files are already sorted and similar length you could just interleave them:
Also, here's a weighted version which is kind of similar to the ideas above:
https://github.com/chapmanjacobd/computer/blob/main/bin/interleave_weighted.py
Then you could deduplicate afterwards with something like this: dedupe.py
Combining lists is (generally) pretty straightforward, so you should be able to do it without getting your hands too dirty.
The important question is what do you mean by “combine”? Do you want to combine the lists sequentially, or interleave them, or something else?
By combining sequentially, I mean the result is: all of List A followed by all of List B followed by all of List C, etc.
By interleaving I mean: each 1st item followed by each 2nd item followed by each 3rd item, etc.
Also, do you need to programmatically ingest the lists to be sorted? If so, it would be useful to know what format the lists are in at the moment.
Could you clarify in what form/format the lists you got is in?
I'm in the bandcamp where all good list should be in a spreadsheet format (librecalc or google sheets).
For converting them into spreadsheets, there are some tools available depending on the original list format, but some work is of course still required.
This is one of those cases where LLMs do shine. Here you have an overview of the various rank aggregation algorithms that you could use + their implementation in code:
LLM output
I admit I haven't read all comments ultra carefully and/or understanding them fully, with that said I sense there are a few things missing:
Purpose! It is not clear what your purpose is, and I do believe the why might impact how to go about it.
What lists? How many lists? Are all lists of equal length and weight? Length is fairly straight forward but weight can be a little bit tricky. I mean are all lists equally important? Maybe you don't put the same weight on "valued-critics-specific-list" as "generic-clickbait-sites-vaguely-related-list"? Do all or most lists contain the same objects?