Subscribe to Blog via Email
Good Stats Bad Stats
Search Text
November 2024 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -
Recent Posts
goodstatsbadstats.com
Last August Accuweather extended their long range forecasts from the previous 30 days out to 45 days. They are now claiming they can predict the weather a month and half into the future with the degree of accuracy that they are worthy of publication.
I had serious doubts as to just how good those forecasts are as I had watched the 30 day forecasts as I wondered what the weather would be like on the days my daughter would make her trips home form Penn State (the home of Accuweather) during the Christmas holiday season. I looked at the 30 day forecasts they were publishing at the time and found them lacking.
In early October the crew at the Capital Weather Gang at the Washington post published a rather simple analysis of the accuracy of the those new forecasts. They were not impressed.
A group of meteorology students at Penn State did a more in depth evaluation. They concluded that the climatologically normal actually made for a better forecast of the daily high temperature than did the Accuweather forecasts. They also concluded that this was true not just 45 days into the future but even just 10 to 15 days into the future.
Accuweather unsurprisingly defended the forecasts. But I think the defense was well summed up by the reporter for the Washington Post when he wrote: “The thrust of Myers’ rationale for providing 45-day forecasts is customer demand and satisfaction.” Read the comments. The arguments mostly go along the lines of there is a demand for the data, people are reading the data, they are coming back to our website, we have the data, and therefore we have an obligation to provide the data. The arguments are based solely on demand for the data. They are not based on the accuracy or quality of the data.
Accuweather did also made a couple of relevant points. They argued that the students only looked at the forecast of the high temperature. That is a very good point. As the spokesman said looking at precipitation and the forecast for the low temperature really should have been included. The spokesman also said the looking at only a 93 day window was too short. That point can be debated. He also thought the looking at only 15 cites was too small a sample. Give the consistent results across the 15 cites this does not seem to me to be a very serious criticism of the students work. Interestingly the original piece in the Washington post looked at the high and low temperate and precipitation in their analysis and still found the forecasts lacking.
So how should on evaluate a long range weather forecast How do we conclude that it is good for bad?
The students at Penn State used average absolute deviation of the foretasted high temperature relative to the actual high. The Washington Post original evaluation used the same absolute deviation, but extended it to both the high and low temperatures and added in a component that incorporated participation into their measure of accuracy. The Washington Post graphics for the three cites they looked at are informative. They looked at San Francisco CA, Denver CO, and Mobile Al forecasts. Denver was the city with the worst performance. The differences illustrate the problems with using a single measure for all three cities. First participation rate and frequency vary by city. In the desert southwest I can make a fairly accurate forecast of participation by just claiming it will never rain. I cannot do anything similar in Florida. Also important is how much temperatures vary. The less variability there is in the temperatures a given city the more accurate my forecasts are likely to be when I use temperature deviation as my evaluation criterion. I simply have to have a model that gives forecasts within a smaller temperate range for those cities with the smaller variability. In the Washington DC metropolitan area summer high temperatures vary from about 60 to 100. In the middle of winter they can vary from 10 to 70. The range is much large in the winter than the summer. So if like the Penn State students I use as my benchmark the average absolute deviation from the climatologically normal my summer measurement us very likely to look better than my winter measurement. This is simply because the average absolute deviation in the winter most likely with be larger than it is in the summer.
Another question to consider in developing an evaluation criterion is what are the users expectations. An error of a few degrees in the the forecast is likely inconsequential. The failure to forecast a major snowstorm is a big deal. We all expect forecasts to change over time because we recognize that forecasting is an inexact science. Perhaps a better measure would incorporate how much the forecast varies around the actual weather that occurs. A forecast that is 20 degrees above normal one day and 20 degrees below normal a week later for the same date is a problem regardless of the final weather on the forecast day. So I would want a measure the incorporated how much the forecast changed over time and how close to the forecast date did the prediction get to the final weather for that data and remain there.
So what evaluation criterion would I suggest? First off I would not penalize the forecast for being off by a few degrees. That much error would be expected by the users. For errors of more than a few degrees I think I still prefer a quadratic loss function. Part of the reason for that is that I think these types of forecasts tend to avoid the extreme. I have not seen forecasts of 20 degrees above or below the norm. This is a conservative response by the forecaster, but missing those extreme weather events to me is a big error. Next I would measure the error over time as the date approaches. So the final error measure would be based on the 45 forecasts for the given day. I would use a weighting system for the error measure. This is because the forecast claims to be good out to 45 days. So an error at 45 days should be considered worst than an error 5 days out.
I would penalize the forecaster for changing the forecast. I would do this as a good forecast should not need to be changed. So any changes are a self admission that the original forecast has come up short. As some changes in the forecast should be expected I would make the penalty proportional to the size of the change – perhaps using a variance measure on the 45 high and low temperate forecasts. I would view greater variability in the forecast for a given date as an indication of the poorer quality of the forecast. I would also want to incorporate a measure of the error on the precipitation forecast. This also need should include components measuring the actual error as well as how much and how often the forecast changes over time.
I am not sure at this time how I would wrap all of these measures together. At this point with all I’d like to do in the evaluation criterion I am thinking of seems much too complex. I would want to do separate measures for each forecast area. There are enough climate differences across the country the developing and appropriate way to combine an error measure would be problematic.
With all that in mind I think I’ll start tracking Accuweather’s forecast for the Superbowl game. It is 26 days out and in their media hype they are already giving very specific forecasts. They claim
It is important to note that long-range forecasts are not intended to give an exact forecast so far ahead of time. However, they can give an accurate representation of the kind of weather pattern that will play out. The trend for this year’s winter is based on a number of factors, including the potential for a split jet stream.
While saying there will be cloudy with a high of 45 with no snow on game day. I’ll be looking at just how much that forecast change over the next three plus weeks.