word-distances
A tool that can calculate the average solution set for a first guess in the game of Wordle.
Yes, the name isn't great -- I initially had a different idea in mind, but wound up going with a brute-force approch that gives a much better result.
The code currently in the repo can be compiled with javac Main.java. It's only three source files. It takes no input parameters.
How it Works
The point of this tool is to calculate the best first-guess word for the game Wordle. It does this by taking the list of all possible words, and iteratively comparing them against all possible solution words, generating the result guess as it would in the game (i.e.: GREEN, YELLOW, or BLACK).
Using this result, it then applies it as a filter against the solution word list. The total number of remaining words is then calculated. This process is repeated for a single guess word against all possible answer words, increasing the total.
Currently, the output is only this total score. Lower scores are best, as this indicates that under the most situations the guess word results in a lowest number of possible solution words. It's a brute-force search, so the result is exact. You can average the result for each guess word by dividing this total by the number of solution words (currently 2315) -- the word with the lowest average is the best word to start the game with.
Every possible guess word and total score are output to stdout, in CSV format. The best scoring word along with its score is written to stderr at the end of the run.
Expected Runtime
Running through the 30+ million possibilities took around 4h on my 2021 Intel MacBook Pro. Sorry, I forgot to run it against time to get a more exact result, and the computer went to sleep at some point while I was out for the evening, so YMMV.
Results
Full results can be found in results.txt, which is in CSV format and is sorted by word. Import into your favorite spreadsheet program to sort by score, calculate the average, or do whatever other analysis you want to do.
TL;DR results are:
Top 5 Words:
Word | Score | Average |
---|---|---|
roate | 185067 | 79.9425485961123 |
soare | 190696 | 82.3740820734341 |
raile | 197271 | 85.2142548596112 |
raise | 197781 | 85.4345572354212 |
orate | 199294 | 86.088120950324 |
Bottom 5 Words:
Word | Score | Average |
---|---|---|
immix | 2197269 | 949.144276457883 |
xylyl | 2218252 | 958.208207343413 |
gyppy | 2242976 | 968.888120950324 |
jugum | 2420774 | 1045.69071274298 |
jujus | 2504428 | 1081.82634989201 |
Basic Analysis
Remember, the average will be the average number of words the solution could possibly be out of the solution word set. So the ideal word would be the one that results in the smallest possible number of words for any possible solution.
The best word is ROATE, which when played averages 79.9 remaining possible words out of 2315. The worst is JUJUS, which will leave you with an average of 1081.8 remaining words out of 2315.
Further Work
With the results calculated here, my next step is going to be to expand the code so that we can concentrate on just the top 10 and bottom 10, and dump the raw results for all guessword-solution combinations. We can then graph their frequency, to get a better idea of their overall effectiveness. Is the average fairly evenly distributed? Are some of the words super-effective against part of the solution set, but super-uneffective against the other half? Lots of questions to be answered here, but it's not worth the computation time to run this for every guessword.
Can this be used to create an optimal solver?
I think so.
Step 1 would be to use the guessword with the lowest overall average ('ROATE'). Step 2 would be to figure out the GREEN, YELLOW, BLACK "score" for this guess, and then use that as a filter against the words in the results currently generated by this code (word-score or word-average).
Step 2 would be to iterate, and pick the lowest remaining total word after the above filtering. This would continue until a solution is found.
Which leads to...
Can we calculate the maximum search depth?
I think so. Using the above optimal solver, we can then run it against all the possible solution words, and calculate how many guesses were required to get the result. Are there any words where this is greater than 6 (the maximum number of guesses in the game)? Stay tuned to find out!
Thanks
Big thanks go out to one of my former professors, WM, with whom I've been enjoying a Facebook correspondance this week to discuss different strategies for scoring the guess words. Her computationally simpler scoring mechanism came up with the words OATER, ORATE, and ROATE, which I've calculated as the 22nd, 5th, and 1st best solutions. Considering hers ran in seconds versus the hours it took to do my calculation I'd say she's found some great starting words!