How exactly do you choose your favorite model?
Because that is the model you fall back to when uncertainty is large. The model you use when a big event is forecast. The model you are most familiar with. The model you use to re-calibrate yourself when a big event is forecast.
Have you chosen wisely? What standards have you used to evaluate your "favorite" model? What metrics have you used to evaluate your model? How long was the data set that you used to perform this verification? Or did you validate your model based on a few cases using a few parameters?
Most managers, when choosing which models are superior, ask for a number. A number they can use to justify an upgrade or justify removal of a modeling system. They ask what metrics are most relevant and then ask for that number and the metrics/numbers for the competing modeling systems.
Yet all metrics have both strengths and weaknesses, can be applied to some variables, over certain time intervals. Sometimes they are informative, sometimes not. We have been exploring some more popular metrics like Fractions Skill Scores, Gilbert Skill Scores, CSI, etc. We have been doing this by pairing these scores with subjective measures of skill.
But exactly how do you quantify the skill of mode or evolution of convection?
Can you compare two different models and rank them when one purposefully is different in appearance than the other? And when these (lets call them) obvious biases are present, what metric can account for these? How can you be objective in that setting?
There are many (forecasters, researchers, modelers) suffering through these practical considerations because there are only conditional metrics available with which to evaluate a model. Most people prefer to have "eyes on the output" (i.e. subjective impressions) to identify the strengths and weaknesses of the system so it can be developed further so they can be improved.
As it stands we settle for models with well known biases. But in this climate of change we really need to develop a framework for evaluation that at its core is able to distinguish skill and reliability across models that can capture the subjective impressions of the people who use them.