Monday, May 09, 2016

Evaluating Nulls

Today was quite an active severe weather day locally. Our thoughts are with the people and families affected by today's tornadoes. However, today's SFE forecasts won't be evaluated until tomorrow, and today's evaluations focused on the relatively lackluster severe weather across Utah on Friday. Our participants were faced with a problem that has been discussed by the forecast evaluation community for years - how do you rate and compare correct null forecasts?
Several times today, participants were faced with comparing forecasts that looked quite similar, such as the forecasts of hourly maximum UH between three subsets of the CLUE. These subsets contain an ensemble with 10 ARW members, an ensemble with 10 NMMB members, and an ensemble with 5 ARW members and 5 NMMB members. Throughout the period of interest, only a handful of reports occurred. However, there were relatively few probabilities from all ensembles of UH greater than 100m2/s2, suggesting that intense UH activity within the ensembles was limited. Probabilities of UH greater than 25m2/s2 were more widespread, but in 3km models this level of UH seems to simply indicate general thunderstorm activity. UH is grid-scale dependent, increasing in intensity as grid spacing reduces. So, our forecasters were essentially faced with a correct null when considering strong UH. Participants noted that no region highlighted the reports that did occur, but overall still considered this a correct null.

Some participants rated these forecasts fairly highly, with scores of 7 or 8 out of 10. Participants assigning a rating of this magnitude considered the lack of high UH and the widespread low UH "reasonable given the large coverage of mostly non-severe convection". Others rated these members as a 4 or 5, and were of the opinion that there was no basis for choosing one model over another, or that it was hard to rate the models fairly since there were fairly few reports.

These two perspectives are common to the correct null problem. How much credit do you give to the numerical guidance, or to the forecasters, who forecast correctly that there are low probabilities of severe weather in an unfavorable environment? Do you give them a high score, perhaps even a perfect 10/10? A middling score because the forecast wasn't that difficult? Many statistical methods treat these correct nulls the same as more difficult correct nulls, such as when the guidance or forecaster correctly does not forecast severe weather in what looks to be a favorable environment. A good example of this is the two MPAS cases written up in the summary document from SFE 2015 - two similar setups, but one produced severe convection while the other did not due to abundant morning convection and subsequent cloud cover. Ideally the correct forecast of little to no severe convection over central Oklahoma would be given more weight than the forecast of little to no severe convection over Utah last Friday, as it was more difficult to discern that there would be less severe weather in the Oklahoma case. 

Having forecasters individually produce ratings (rather than group consensus ratings) and comment on why they assigned particular scores allows us to determine how different forecasters approach these problems. By looking closely at these evaluations in addition to statistical analyses of the forecasts, we can approach the problem from complementary directions. Combined efforts such as these may eventually lead to better forecast evaluation methods by blending the best analysis capabilities of both humans and machines. 

No comments: