Case Study: Processing Historical Weather Pattern Data

By Chris Moffitt

Mon 12 October 2020

Posted by Chris Moffitt in articles   

article header image

The main purpose of this blog is to show people how to use Python to solve real world problems. Over the years, I have been fortunate enough to hear from readers about how they have used tips and tricks from this site to solve their own problems. In this post, I am extremely delighted to present a real world case study. I hope it will give you some ideas about how you can apply these concepts to your own problems.

This example comes from Michael Biermann from Germany. He had the challenging task of trying to gather detailed historical weather data in order to do analysis on the relationship between air temperature and power consumption. This article will show how he used a pipeline of Python programs to automate the process of collecting, cleaning and processing gigabytes of weather data in order to perform his analysis.

I will turn it over to Michael to give the background for this problem.

Hi, I’m Michael, CEO of a company providing services to energy providers, especially focusing on electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into the deviation of the power consumption of electrical heating systems in comparison to the air temperature. Since we provide power to other companies, we need to have a good grasp on the power consumption, which correlates to the air temperature. In short, I needed to know how well I can predict the long term temperatures and how much deviation is to be expected.

To be able to do this analysis, I needed historical temperatures, which are supplied by the German weather service, DWD. Since it would be really time consuming to download all the files and extract them by hand, I decided to give this a shot with Python. I know a few things about programming, but I am pretty far from a professional programmer. The process was a lot of trial and error, but this project turned out to be exactly the right fit for this approach. I use a lot of hardcore Excel analysis, fetching and munching data with Power Query M, but this was clearly over the limit to what can be done in Excel.

I am really happy with the results. There is hardly anything as satisfying as letting the computer do the hard work for the next 20 min, while grabbing a cup of coffee.

I am also really happy to have learned a few more things about web scraping, because I can use it in future projects to automate data fetching.

Here is a visual to help understand the process Michael created:

Data Processing Pipeline

If you are interested in following along, all of the code examples are available here.

The first notebook in the pipeline is 1-dwd_konverter_download . This notebook pulls historical temperature data from the German Weather Service (DWD) server and formats it for future use in other projects.

The data is delivered in hourly frequencies in a .zip file for each of the available weather stations. To use the data, we need everything in a single .csv file with all stations side-by-side. Also, we need the daily average.

To reduce computing time, we also crop all data earlier than 2007.

For the purposes of this article, I have limited the download to only 10 files but the full data set is over 600 files.

import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path # Set base values
download_folder = Path.cwd() / 'download'
base_url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/' # Initiate Session and get the Index-Page
with requests.Session() as s: resp = s.get(base_url) # Parse the Index-Page for all relevant <a href>
soup = BeautifulSoup(resp.content)
links = soup.findAll("a", href=re.compile("stundenwerte_TU_.*_hist.zip")) # For testing, only download 10 files
file_max = 10
dl_count = 0 #Download the .zip files to the download_folder
for link in links: zip_response = requests.get(base_url + link['href'], stream=True) # Limit the downloads while testing dl_count += 1 if dl_count > file_max: break with open(Path(download_folder) / link['href'], 'wb') as file: for chunk in zip_response.iter_content(chunk_size=128): file.write(chunk) print('Done')

This portion of code will parse the download page and find all of the zip files with the name studenwerte_TU and save them in a download directory.

Now that we have isolated the data we need, we must format it for further analysis.

There are three steps in this notebook 3-dwd_konverter_build_df :

The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date. Then we set a DateTimeIndex and concatenate them into the main_df . Because the loop takes a long time, we output some status messages, to ensure the process is still running.

Then we display some info of the main_df so we can ensure that there are no errors, mainly to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case some of the .csv files were accidentally duplicated during the development process.

For the final step, we unstack the main_df and save it to a .csv and a .pkl file for the next step in the analysis process. Also, we display some output to get a grasp of what is going on.

Now let’s look at the code:

import numpy as np
import pandas as pd
from IPython.display import clear_output from pathlib import Path
import glob import_files = glob.glob('import/*')
out_file = Path.cwd() / "export_uncleaned" / "to_clean" obsolete_columns = [ 'QN_9', 'RF_TU', 'eor'
] main_df = pd.DataFrame()
i = 1 for file in import_files: # Read in the next file df = pd.read_csv(file, delimiter=";") # Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index) df.drop(columns=obsolete_columns, inplace=True) df["MESS_DATUM"] = pd.to_datetime(df["MESS_DATUM"], format="%Y%m%d%H") df = df[df['MESS_DATUM']>= "2007-01-01"] df.set_index(['MESS_DATUM', 'STATIONS_ID'], inplace=True) # Merge to the main_df main_df = pd.concat([main_df, df]) # Display some status messages clear_output(wait=True) display('Finished file: {}'.format(file), 'This is file {}'.format(i)) display('Shape of the main_df is: {}'.format(main_df.shape)) i+=1 # Check if all types are correct
display(main_df['TT_TU'].apply(lambda x: type(x).__name__).value_counts()) # Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.
# The ~ is a bitwise operation, meaning it flips all bits.
main_df = main_df[~main_df.index.duplicated(keep='last')] # Unstack the main_df
main_df = main_df.unstack('STATIONS_ID')
display('Shape of the main_df is: {}'.format(main_df.shape)) # Save main_df to a .csv file and a pickle to continue working in the next step
main_df.to_pickle(Path(out_file).with_suffix('.pkl'))
main_df.to_csv(Path(out_file).with_suffix('.csv'), sep=";") display(main_df.head())
display(main_df.describe())

As this program runs, here is some of the progress output:

'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'
'This is file 10'
'Shape of the main_df is: (771356, 1)'
float 771356
Name: TT_TU, dtype: int64
'Shape of the main_df is: (113952, 9)'

Here is what the final DataFrame looks like:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2007-01-01 00:00:00 11.4 NaN NaN NaN 11.0 9.4 NaN 9.7 NaN
2007-01-01 01:00:00 12.0 NaN NaN NaN 11.4 9.6 NaN 10.4 NaN
2007-01-01 02:00:00 12.3 NaN NaN NaN 9.4 10.0 NaN 9.9 NaN
2007-01-01 03:00:00 11.5 NaN NaN NaN 9.3 9.7 NaN 9.5 NaN
2007-01-01 04:00:00 9.6 NaN NaN NaN 8.6 10.2 NaN 8.9 NaN

At the end of this step, we have the file in a condensed format we can use for analysis.

The data contains some errors, which need to be cleaned. You can see, by looking at the output of main_df.describe(), that the minimum temperature on some stations is -999. That means that there is no plausible measurement for this particular hour. We change this to np.nan, so that we can safely calculate the average daily value in the next step.

Once these values are corrected, we need to resample to daily measurements. Pandas resample makes this really simple.

import numpy as np
import pandas as pd
from pathlib import Path # Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv" # Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file) # Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True) # Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2) # Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",") # Show some results for verification
display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)

Here is the final DataFrame with daily average values for the stations:

TT_TU
STATIONS_ID 3 44 71 73 78 91 96 102 125
MESS_DATUM
2011-12-31 NaN 3.88 2.76 1.19 4.30 2.43 NaN 3.80 NaN
2012-01-01 NaN 10.90 8.14 4.03 10.96 10.27 NaN 9.01 NaN
2012-01-02 NaN 7.41 6.18 4.77 7.57 7.77 NaN 6.48 4.66
2012-01-03 NaN 6.14 3.61 4.46 6.38 5.28 NaN 5.63 3.51
2012-01-04 NaN 5.80 2.48 4.45 5.46 4.57 NaN 5.85 1.94