1. Comparing Wages With Consumer Price Index Data

Page content

How many people do you know who have a favorite US government department? I’ve got a favorite, and I’m not even American. It’s the Bureau of Labor Statistics, and I’ve been enjoying the torrents of employment and economics-related data they produce for decades, now.

But my admiration for BLS jumped to a new level when I discovered their well-supported application programming interface (API). This opens up all that delicious data to smart retrieval through our Python scripts. And it gives us some rich resources for discovery.

Let me illustrate with a relatively simple example. I’m going to use the API to request US consumer price index (CPI) and wage and salary statistics between 2002 and 2020.

The CPI is a measure of the price of a basket of essential consumer goods. It’s an important proxy for changes in the cost of living which, in turn, is an indicator of the general health of the economy.

Our wages data will come from the BLS Employment Cost Index covering “wages and salaries for private industry workers in all industries and occupations.” A growing employment index would, at first glance, suggest that things are getting better for most people.

However, seeing the average employment wages trends in isolation isn’t all that useful. After all, the highest salary won’t do you much good if your basic expenses are higher still. I know what I’m talking about here, since I grew up in the 1970s. Back then, high inflation would often completely cancel out the value of a raise at work. (Although I can’t say I remember that having too much of an impact on my brief career as a before-school newspaper carrier earning $10-15 a week.)

So the goal is to pull both the CPI and wages data sets and then correlate them. This will show us how wages have been changing in relation to costs.

Working With the API

There are two and a half things you’ll need to successfully access the API:

  • BLS endpoint addresses matching the specific data series you need
  • Python code to launch the request
  • (In case you’ll need them) A BLS API key to unlock higher request rates

Getting the series endpoint addresses you need may take some digging around in the BLS website. However, the most popular data sets are accessible through a single page. The following image shows you what that looks like, including the endpoint codes - like “LNS11000000” for the Civilian Labor Force set.

You can also search for data sets on this page. Searching for “computer,” for instance, will take you to a list that includes the deeply tempting “Average hourly wage for level 11 computer and mathematical occupations in Austin-Round Rock, TX.” The information you’ll discover by expanding that selection will include its series ID (endpoint) of “WMU00124201020000001500000011”

Because I know you can barely contain your curiosity, I’ll tell you that it turns out that level 11 computer and mathematical professionals in Austin-Round Rock, Texas could expect to earn $51.76/hour in 2019.

How do you turn series IDs into Python-friendly data? Well that’s what we’ll learn next.

Getting the GET and PUT requests exactly right can be complicated. But because I enjoy a simple life, I decided to go with one of the available third-party Python libraries. The one I use is called, simply, bls and is available through Oliver Sherouse’s GitHub repo. You install the library on your host machine using:

pip install bls

While you’re at the command line, you might as well activate your BLS API key. You register for the API from this page. They’ll send you an email with your key and a validation URL that you’ll need to click.

Once you’ve got your key, you export it to your system environment. On Linux or macOS, that would mean running something like this (where your key is substituted for the invalid one I used):

export BLS_API_KEY=lk88af0f0d5fd1iw290s52a01b8q

Working with CPI and salary data

With that out of the way, we’re all set to start digging for CPI and employment gold. Importing these four libraries - including bls will give us the tools we’re going to need:

import pandas as pd
import matplotlib as plt
import numpy as np
import bls

Now I pass the BLS endpoint for the wages and salaries data series to the bls.get_series command from the bls library. I copied the endpoint from the popular data sets page on the BLS website. I’ll assign the data series that comes back to the variable wages and then take a look at a sample from the data set.

wages = bls.get_series('CIU2020000000000A')
wages
0	2002Q1	3.5
1	2002Q2	3.6
2	2002Q3	3.1
3	2002Q4	2.6
4	2003Q1	2.9

Assessing our two data series

For now, just note how the format of the first column of data consists of a year (2002, 2003, etc.) followed by the financial quarter (Q1, Q2, etc.) and the fact that the second column (3.5, 3.6) consists of percentages. We’ll get back to that data set later. But right now we’ll switch over to the CPI series.

We’ll pull the CPI data series as the variable, cpi. I’ll export the data to a CSV file named cpi_data.csv so it’ll be available to me even if I accidentally overwrite the variable. I also happen to find it a bit easier to work with CSV files.

At any rate, I’ll then read the CSV into a dataframe called cpi_data and do a bit of formatting by adding two column headers (Date and CPI).

cpi = bls.get_series('CUUR0000SA0')
cpi.to_csv('cpi_data.csv')
cpi_data = pd.read_csv('cpi_data.csv')
cpi_data.columns = 'Date','CPI'

Here’s what my data will look like:

cpi_data
	Date	CPI
0	2002-01	177.100
1	2002-02	177.800
2	2002-03	178.800
3	2002-04	179.800
4	2002-05	179.800
...	...	...
222	2020-07	259.101
223	2020-08	259.918
224	2020-09	260.280
225	2020-10	260.388
226	2020-11	260.229
227 rows × 2 columns

If we want to correlate this CPI data with the wages series, the logical approach is to compare changes in both data sets as they occurred through time. But before that can happen, we’ll need to overcome two serious problems:

  • The CPI data is monthly while the wages data is quarterly (i.e., four increments per year). And the Date column in the CPI set is formatted differently (2002-01) than that of the wages set (2002Q1). Python won’t know how to compare those columns as they currently exist.
  • The CPI values are measured in points (177.10, etc.) while the wage changes are presented as percentages (3.5, etc). Those, too, won’t be comparable in any useful way unless we clean things up a bit.

Manipulating the CPI data series

Python is built to be helpful in fixing those problems. Sure, I could always open up the data in CSV files and manually edit each row. But that would be boring and time consuming for the hundreds of rows we’re working with. And it would be impossible for the millions of rows of data used by many other analytics projects.

So we’re going to automate our data clean up.

We’ll focus first on the CPI series. I’m going to use the Python str.replace method to search for any occurrence of -03 (i.e., “March”) in the Date column, and replace it with the string Q1. This will match the Q1 records in my wages data set. I’ll then do the same for the June, September, and December rows.

cpi_data['Date'] = cpi_data['Date'].str.replace('-03', 'Q1')
cpi_data['Date'] = cpi_data['Date'].str.replace('-06', 'Q2')
cpi_data['Date'] = cpi_data['Date'].str.replace('-09', 'Q3')
cpi_data['Date'] = cpi_data['Date'].str.replace('-12', 'Q4')

I’m sure there’s syntax that would allow me to do all that in one line, but the couple of seconds it took me to copy and paste those lines hardly compared to the time it would have taken me to figure out the “more efficient” alternative. If automation threatens to take me longer than the manual way, then I vote for manual every time.

We can print out some data to show us what we’ve got so far:

cpi_data['Date']

0      2002-01
1      2002-02
2       2002Q1
3      2002-04
4      2002-05
        ...   
222    2020-07
223    2020-08
224     2020Q3
225    2020-10
226    2020-11
Name: Date, Length: 227, dtype: object

The quarterly records are exactly the way we want them, but the rest are obviously still a problem. We should look for a characteristic that’s unique to all the records we don’t want to keep and use that as a filter. Our best (and perhaps only) choice is the dash ("-"). The str.contains method when set to False will, when run against the Date column as it is here, drop all the contents of all rows that contain a dash.

newcpi_data = cpi_data[(cpi_data.Date.str.contains("-") == False)]

You’ll want to confirm everything went as planned by printing a sample of your data. Everything here looks great. Note how there are now only 75 rows of data, as opposed to the 227 we began with.

newcpi

	Date	CPI
2	2002Q1	178.800
5	2002Q2	179.900
8	2002Q3	181.000
11	2002Q4	180.900
14	2003Q1	184.200
...	...	...
212	2019Q3	256.759
215	2019Q4	256.974
218	2020Q1	258.115
221	2020Q2	257.797
224	2020Q3	260.280
75 rows × 2 columns

If you like, you can save the data series in its final state to a CSV file:

newcpi_data.to_csv('cpi-clean.csv')

Converting percentages to “CPI” values

Well that’s our formatting problem out of the way: we’ve converted the CPI data to quarterly increments and cleaned up the date format to match the wages data. Now we’ll need to address the inconsistency between the absolute point values in the CPI set and the percentages that came with the wages data.

First, lets go get that data. Technically, if you’ve been following along on your own, you’ll already have pulled the series back near the beginning, but it can’t hurt to run it again. Once again, I’ll save the data to a CSV file (which, again, isn’t necessary), push it to a dataframe I’ll call df, and give it column headers. Naming the date column Date to match the CPI set will make things easier.

wages = bls.get_series('CIU2020000000000A')
wages.to_csv('bls_data_csv')
df = pd.read_csv('bls_wages_data_csv')
df.columns = 'Date','Wages'

The sample data we print looks just fine:

df.head()
	Date	Wages
0	2002Q1	3.5
1	2002Q2	3.6
2	2002Q3	3.1
3	2002Q4	2.6
4	2003Q1	2.9

newdf = df

Now, I should warn you that there’s going to be some Fake Math(tm) here.

What’s going on? Each of the percentages in the wages data series represents an annual average. So when we’re told that the rate for the first quarter of 2002 was 3.5%, that means that if wages continued to rise at the current (first quarter) rate for a full 12 months, the annual average growth would have been 3.5%. But not 14%.

Which means the numbers we’re going to work with will have to be adjusted. That’s because the actual growth during, say, the three months of 2002 Q1 wasn’t 3.5%, but only one quarter of that (or 0.875%). If I don’t make this adjustment, but continue to map quarterly growth numbers to quarterly CPI prices, then our calculated output will lead us to think that wages are growing so fast that they’ve become detached from reality.

And here’s where part of the fake math is going to rear its ugly head. I’m going to divide each quarterly growth rate by four. Or, in other words, I’ll pretend that the real changes to wages during those three months were exactly one quarter of the reported year-over-year rate. But I’m sure that’s almost certainly not true and is a gross simplification. However, for the big historical picture I’m trying to draw here, it’s probably close enough.

Now that will still leave us with a number that’s a percentage. But the corresponding CPI number we’re comparing it to is, again, a point figure. To “solve” this problem I’ll apply one more piece of fakery.

To convert those percentages to match the CPI values, I’m going to create a function. I’ll feed the function the starting (2002 Q1) CPI value of 177.10. That’ll be my baseline. I’ll give that variable the name newnum.

For each iteration the function will make through the rows of my wages data series, I’ll divide the current wage value (x) by 400. 100 simply converts the percentage (3.5, etc.) to a decimal (0.035). And the four will reduce the annual rate (12 months) to a quarterly rate (3 months).

To convert that to a usable number, I’ll multiply it by the current value of newnum and then add newnum to the product. That should give us an approximation of the original CPI value adjusted by the related wage-growth percentage.

But, of course, this won’t be a number that has any direct equivalent in the real world. Instead, it is, as I said, an arbitrary approximation of what that number might have been. But, again, I think it’ll be close enough for our purposes.

Take a couple of minutes and read through the function. global newnum declares the variable as global. This makes it possible for me to replace the original value of newnum with the function’s output so the percentage in the next row will be adjusted by the updated value. Note also how any strings (str) will be ignored. And, finally, note how the updated data series will populate the newwages_data variable.

newnum = 177.1
def process_wages(x):
    global newnum
    if type(x) is str:
        return x
    elif x:
        newnum = (x / 400) * newnum + newnum
        return newnum
    else:
        return
newwages = newdf.applymap(process_wages)

Let’s check that new data:

newwages

	Date	Wages
0	2002Q1	178.649625
1	2002Q2	180.257472
2	2002Q3	181.654467
3	2002Q4	182.835221
4	2003Q1	184.160776
...	...	...
70	2019Q3	273.092663
71	2019Q4	275.140858
72	2020Q1	277.410770
73	2020Q2	279.421999
74	2020Q3	281.308097
75 rows × 2 columns

Looks great.

Merge and plot our dataframes

What’s left? We need to merge our two data series. But since we’ve already done all the cleaning up and manipulation, this will go smoothly. I’ll create a new dataframe called merged_data and feed it with the output of this pd.merge function. I simply supply the names of my two dataframes, and specify that the Date column should be the index.

merged_data = pd.merge(newcpi_data, newwages_data, on='Date')

Fast and friendly. No trouble at all. Go ahead and take a look:

merged_data
	Date	CPI		Wages
0	2002Q1	178.800	178.649625
1	2002Q2	179.900	180.257472
2	2002Q3	181.000	181.654467
3	2002Q4	180.900	182.835221
4	2003Q1	184.200	184.160776
...	...	...	...
70	2019Q3	256.759	273.092663
71	2019Q4	256.974	275.140858
72	2020Q1	258.115	277.410770
73	2020Q2	257.797	279.421999
74	2020Q3	260.280	281.308097
75 rows × 3 columns

Our data is all there. We could visually scan through the CPI and Wages columns and look for any unusual relationships, but we didn’t come this far to just look at numbers. Let’s plot the thing.

Here we’ll tell plot to take our merged dataframe (merged_data) and create a bar chart. Because there’s an awful lot of data here, I’ll extend the size of the chart with a manual figsize value. I set the x-axis labels to use values in Date column and, again because there are so many of them, I’ll rotate the labels by 45 degrees to make them more readable. Finally, I’ll set the label for the y-axis.

ax = merged_data.plot(kind='bar', figsize=(20, 7))
ax.set_xticklabels(merged_data.Date, rotation=45)
ax.set_ylabel('CPI vs. Wages and salaries - 12-month change')
ax.set_xlabel('Dates')

And this is what I got:

Because of the crowding, it’s not especially easy to read. But you can see that the orange “Wages” bars are, for the most part, higher than the blue “CPI” bars. We’ll have a stab at analyzing some of this in the next chapter.

Is there an easier way to display all this data? You bet there is. I can change the value of kind from bar to line and things will instantly improve. Here’s how the new code will work as a line plot and with a grid:

ax = merged_data.plot(kind='line', figsize=(20, 7))
ax.set_ylabel('CPI vs. Wages and salaries - 12-month change')
ax.set_xlabel('Dates')
ax.grid()

And here’s our new chart:

In the next chapter, we’ll try to better understand the data we’re using and then add historical S&P 500 index data as another metric.