A principal source of incorrectness in a bridge hand evaluator is the lack of completeness of the alpha-beta search. That is to say, if the evaluator manages to avoid considering a particular sequence of card plays that have not been logically ruled out (say, by an equivalence or superiority argument), then it may be possible that the evaluation is incorrect.
During the development of my system, I have found some errors some time after the release of a version of the software. These discoveries were made by happenstance -- I noticed a pattern of the values of internal variables that did not seem to be right, even though the answers returned by the system seemed to be right. It is still possible, though I hope it is not the case, that some bugs remain lurking.
One sure way to discern the existance of an error is to run two different evaluators on the same problem and obtain two different answers. In that case, it is certain that one of the evaluators has given a wrong answer. Of course, it may not be a simple matter to determine which one was wrong!
To help address this vexing problem, I suggest the establishment of a library of problems and answers, which can be used to help certify bridge hand evaluators as being possibly more reliable.