The Berkeley Earth Surface Temperature Study has created a preliminary merged data set by combining 1.6 billion temperature reports from 16 preexisting data archives. Whenever possible, we have used raw data rather than previously homogenized or edited data. After eliminating duplicate records, the current archive contains over 39,000 unique stations. This is roughly five times the 7,280 stations found in the Global Historical Climatology Network Monthly data set (GHCN-M) that has served as the focus of many climate studies. The GHCN-M is limited by strict requirements for record length, completeness, and the need for nearly complete reference intervals used to define baselines. We have developed new algorithms that reduce the need to impose these requirements (see methodology), and as such we have intentionally created a more expansive data set.
We performed a series of tests to identify dubious data and merge identical data coming from multiple archives. In general, our process was to flag dubious data rather than simply eliminating it. Flagged values were generally excluded from further analysis, but their content is preserved for future consideration.
We filtered and merged the data archives using the following steps:
- Duplicate filter: We first separately searched each archive for multiple copies of the same record and eliminated the duplicates.
- Data split: Each unique record was broken up into fragments having no gaps longer than 1 year. Each fragment was then treated as a separate record for filtering and merging. Note however that the number of stations is based on the number of unique locations, and not the number of record fragments.
- Bad values filter: We flagged and excluded from further study values that had pre-existing indicators of data quality problems associated with instrumental error, in-filling of missing data, and/or post-hoc manipulations. We further removed values that exceeded global climate extremes (e.g. +5000 F).
- Repetition filter: We tested for runs of repeated values, a common sign of in-filling missing days, and flagged repeated values exceeding an empirical 99.9% threshold for non-randomness.
- Local outlier filter: We tested for and flagged values that exceeded a locally determined empirical 99.9% threshold for normal climate variation in each record.
- Temperature consistency filter: We required that the minimum temperature (Tmin) be strictly less than the maximum temperature (Tmax) for each measurement. We further required that any reported average or instantaneous temperature (Tavg and Tobs) be between the reported max and min, inclusive.
- Initial merge: Using nearby locations and matching station ID codes, we tested for the presence of identical data in multiple archives. Records that had identical content for at least 90% of values were then merged. Small segments of non-identical content within otherwise equivalent records were flagged and also carried forward.
- Regional filter: For each record, the 21 nearest neighbors having at least 5 years of record were located. These were used to estimate a normal pattern of seasonal climate variation. After adjusting for changes in latitude and altitude, each record was compared to its local normal pattern and 99.9% outliers were flagged. Simultaneously, a test was conducted to detect long runs of data that had apparently been miscoded as Fahrenheit when reporting Celsius. Such values, which might include entire records, would be expected to match regional norms after the appropriate unit conversion but not before.
- Second merge: Monthly time series were constructed from daily values with both a version using all values and a version using only non-flagged values. These monthly synthesis records were then compared to the values in data archives that reported only monthly data. Duplicates were found as before and merged.
- Site reduction: Though a majority of all station repetitions are identified by the presence of duplicated data, in a significant number of cases the presence of pre-existing data manipulations inhibited our tests for data duplication. We designed several tests based on location, name, and id codes to identify matching sites with somewhat dissimilar data. These were then consolidated as single stations having multiple data series.
- Best value series: “Best value” time series were formed by averaging across multiple records when they existed at the same site. In addition, flagged values were dropped and previously manipulated GHCN-M and Hadley Centre data was ignored in favor of other data sources when possible. These series are expected to be the primary records for most future studies, but the fully-flagged and multi-valued records will also be preserved and made available for more detailed analyses.
- Seasonality removed series: Finally, non-seasonal series were created by determining the mean seasonal cycle at each location and subtracting this from the best value data.
All of our original data files, along with several data packages prepared through the data filtration and merging process, are available on our data page.