New tool – WordFreq Martin Jansson
A disclaimer… I am no developer, but I have developed a tool. As I develop I have the mindset of a developer, not the tester. I have done lots of mistakes, intentionally not implemented good/needed things and considered what parts I can get away with in the first release. This tool might not seem big and useful, but I have used it and it has created many interesting results in the past. As I developed this I tried a new method of implementation… all ideas I had on what functions the tool should have, what was supposed to work, what was not supposed to work etc I wrote down in a testideas-document. I then had one column that identified if it worked or not in a specific release. All good feedback I added to that list.
This is the first tool we create at the test eye that is open for the public. At thetesteye we have choosen to publish our material under the license Attribution No Derivatives. My personal aim with this tool was to increase my knowledge of coding. I have used Python and Tkinter as a graphical interface. In the Publications section you will find the link to the tool and the currently released version.
General discussion
The general idea is to use the frequency of words as a way to find errors. The more text you analyze, the higher statistical significance; thus resulting in an easier chance of spotting the erroneous words. This kind of script is very often found as a code example. When I first created a script for this I did not know that. I ran it on a quite large text corpus and found that the company name had been spelled incorrectly 7 times in the copyright text. I also found lots and lots of spelling mistakes as well as some strange API functions that were incorrect.
Use cases
- Run on documentation to find unfrequent words (that usually contains spelling errors)
- Run on code to find variables that are similar but not the same and used incorrectly
- Run on code to find unused variables, thus variables only used once
- Run on code + API documentation to find things that should not be there or code that are not covered anywhere
- Localization specific: When doing translations you might be allowed to have a certain amount of errors, this is one way of finding a few extra faults that you can remove
How I use it
I run the tool on a tree structure. I open the result file in Excel or OpenOffice Calc. I then sort on frequency… start deleting uninteresting records. You can open it in MS Word or something similar to filter out things that are in fact spelled correctly. After a few cleaning ups you might have a list that is worth investigating.
Bugs and Enhancements
The testideas.xls contain the current tests and some of the enhancements that I’ve gotten so far. If you got any suggestions, feel free to mail me at martin.jansson@thetesteye.com.
I think it is great that you publish your tool here for anyone to use! Although also useful for developers, I will discuss with the people writing our customer product information what tools they use today and see if this tool is someting they are interested in introducing. A tip – on the download page you might want to specify what system requirements there are for your tool.
Good call. I had some problems using it on Vista, but I guess that is a normal case. I updated the Publications file and will include more details in next release.
Thanks for this tool!
Today I had great use of it.
I have proof-read a dissertation on 240 pages, and after corrections, WordFreq helped me find 12 more types of problems.
I used alphabetical sorting on the words, which helped me see inconsistencies quite easy.
An “ignore case” option would have made the tool even more powerful.