2. Wages and CPI: Reality Check

Page content

We’re supposed to be doing data analytics here, so just staring at pretty graphs probably isn’t the whole point.

The CPI and wages data sets we plotted in the previous chapter, for instance, showed us a clear general correlation, but there were some visually recognizable anomalies. Unless we can connect those anomalies with historical events - and explain them in a historical context - we won’t be getting the full value from our data.

Confirm the integrity of your data

But even before going there, we should confirm that our plots actually make sense in the context of their data sources. Working with our BLS examples, let’s look at graphs to compare CPI and wages data from both before and after our manipulation. That way we can be sure that our math (and particularly our fake math) didn’t skew things too badly.

Here’s what our CPI data looked like when plotted using the raw data:

It’s certainly a busy graph, but you can clearly see the gentle upward slope, punctuated by a handful of sudden jumps. Next. we’ll see that same data after removing three out of every four months' data points. The same ups and downs are still visible. Given our overall goals, I’d categorize our transformation as a success.

Now, how about the wages data? Here, because we moved from percentages to currency, the transformation was more intrusive and the risks of misrepresentation were greater. We’ll also need to take into account the way a percentage will display differently from an absolute value. Here’s the original data:

Note how there’s no consistent curve - either upwards or downwards. That’s because we’re measuring the rate of growth as it took place within each individual quarter, not the growth itself. Now compare that with this line graph of that wage data, now converted to currency-based values:

The gentle curve you see does make sense - it’s about real growth, after all, not growth rates. But it’s also possible to recognize a few spots where the curve steepens, and others where it smooths out. But why are the slopes so smooth in comparison with the percentage-based data? Look at the Y-axis labels: the index graph is measured in points between 180 and 280, while the percentage graph goes from 0-3.5. It’s the scale that’s different.

All in all, I believe we’re safe concluding that what we’ve produced is a good match with our source data.

Establish a historical context for your data

Now let’s look at those anomalies. Here, to refresh your memory, is our final line graph:

The most obvious point of interest is the accelerated wages growth starting some time in 2015 and noticeably picking up speed in 2017. This latter shift would seem to validate the claims that the Trump economy markedly improved the lives of workers. This would appear to be especially true considering how much faster wages rose over consumer costs, meaning that the pay raises made a real difference.

Of course, as can never be repeated enough, correlation is not the same as causation. It’s possible that those changes had little or nothing to do with government actions. But, by the same token, we can’t rule it out either. There’s certainly an argument to be made that easing regulatory overhead and reducing the corporate tax rate are healthy policy changes that can stimulate business activity and, by extension, hiring.

Ok. Back to the graph. What was behind that huge spike in CPI back in 2008? Well, the famous Great Recession is always associated with 2008, isn’t it?

The problem is that the subprime mortgage crisis that’s blamed for the disaster only kicked in in the summer of that year. Our data shows significant increases in the first two quarters. And anyway, why would an economic downturn cause price increases? You’d think a recession would spark a drop in consumer demand which, in turn, should lead to a drop in prices.

I’m going to take full advantage of my ignorance of the actual facts and venture a wild guess. If I’m wrong, I’m sure someone reading this will correct me. The subprime economy was fed by artificially low lending rates and too much access to easy capital. Those conditions drew millions of Americans into the housing markets, fueled by flaky mortgages.

Now, a hot housing market should always lead to higher house prices (a simple product of the law of supply and demand). And housing prices make up around 33% of the CPI basket. Perhaps home prices were still rising into the final months before the market was hit with the full force of the meltdown.

I’m not sure this nails the whole backstory, but I do see some supporting evidence. Pulling the BLS “Housing in U.S. city average” cost index will allow us at least a glimpse behind the curtain.

housing = bls.get_series('SUUR0000SAH')
housing.to_csv('housing_index.csv')

If you look through the numbers for the years 2004-2007, you’ll see the index only rose around 2.5 points between January and July. But between January, 2008 and July, 2008, it jumped more than four points before dropping off and remaining stagnant for the next six months. Perhaps that accounts for the spike.

Date		Index
2004-01	112.5
2004-07	115.0

2005-01	115.6
2005-07	118.1

2006-01	119.6
2006-07	122.0

2007-01	122.767
2007-07	125.416

2008-01	125.966
2008-07	130.131
2008-08	129.985
2008-09	129.584
2008-10	129.189
2008-11	128.667
2008-12	128.495

The bottom line: if you were working in the US through those twenty years and you didn’t enjoy regular pay raises, it’s either your employer’s fault or, if you were self-employed, yours. But don’t blame the economy.

Incorporate S&P 500 quotes

Let’s add one more data source to our mix. We’ve already seen how the CPI and employment wages trended over the past twenty years. Perhaps it would also be helpful to compare those numbers to a major US stock market index. Specifically, I’m going to import historical quotes from the Standard and Poors (S&P) 500 index.

The S&P 500 is a measure of the performance of 500 of the largest companies currently trading on major US stock exchanges. The index doesn’t directly represent any real-world equities and has no direct value on its own. Instead, it’s designed as a useful indicator of the health of the market as a whole.

If you’re looking for a single number that neatly and reliably summarizes the state of the global equity investment environment over time, the S&P 500 is probably your best bet.

There are all kinds of ways to get the data we’re after. I downloaded it from the Wall Street Journal website as a CSV file.

Warning: Gotchas ahead!

Be careful, though, as not all CSV files are created equal. Perhaps because of the way I processed the file I got, I ran into a couple of frustrating problems within Python.

For one thing, some of the column header names somehow became “corrupted” with an extra space. That made it impossible for Pandas to successfully parse my data, returning a confusing error message:

KeyError: "['Low', 'High', 'Open', 'Close'] not in index"

“Not in index”? But of course they’re all in the index. What foolishness!

Well, in fact, “Low " (with a space) was in the index, but “Low” (without a space) was nowhere to be found. I had to open the CSV file in a plain text editor to properly remove those spaces.

My second problem was the result of the Date column in the CSV being read as regular strings. This prevented Pandas from properly sequencing the events by date. This gave the resulting graphs the disturbing (and misleading) sense that we’ve been inexorably heading into a major depression for the past two decades - with the S&P crashing from 3700 down to around 800!

Reformatting and then sorting the CSV by Date in my LibreOffice Calc software solved that problem.

You may not face those exact problems, but you can be sure that you’ll eventually confront data sources suffering from some kind of corruption. So keep your eyes open.

At any rate, here’s how I read the CSV file into a dataframe:

sp = pd.read_csv('new_s_p_500.csv')

And here’s what the raw data looked like:

sp

	Date	Open	High	Low	Close
0	01/02/02	1154.67	1154.67	1136.23	1154.67
1	01/03/02	1165.27	1165.27	1154.01	1165.27
2	01/04/02	1172.51	1176.55	1163.42	1172.51
3	01/07/02	1164.89	1176.97	1163.55	1164.89
4	01/08/02	1160.71	1167.60	1157.46	1160.71
...	...	...	...	...	...
4779	12/24/20	3694.03	3703.82	3689.32	3703.06
4780	12/28/20	3723.03	3740.51	3723.03	3735.36
4781	12/29/20	3750.01	3756.12	3723.31	3727.04
4782	12/30/20	3736.19	3744.63	3730.21	3732.04
4783	12/31/20	3733.27	3760.20	3726.88	3756.07
4784 rows × 5 columns

Note the “4784 rows” line. This dataframe is so much larger than either of the other two I’ve worked with because it contains stock quotes for each day of the past 20 years. That’s obviously far more than we’ll need. And it’ll also make calibrating the new data with my CPI and wages data series harder.

So I’m going to reformat the data in the Date column to look more like the raw data we got from our two BLS resources, and then remove all rows besides a single quarterly quote. We’ll obviously lose an awful lot of the precision in this data set, but that’s not what we’re after here anyway.

I’ll start by making sure the column uses the datetime64 format. Then I’ll use strftime to organize the dates so the year comes first, followed by month and day. Along the way, I’ll also set the dash character - as the delimiter.

sp['Date'] = sp['Date'].astype('datetime64[ns]')
sp['Date'] = sp['Date'].dt.strftime('%Y-%m-%d')

Now I’ll use str.replace in much the same way I did for the wages data to reformat all the quarterly quotes as the “year” plus, say, “Q1”.

sp['Date'] = sp['Date'].str.replace('-03-30', 'Q1')
sp['Date'] = sp['Date'].str.replace('-06-30', 'Q2')
sp['Date'] = sp['Date'].str.replace('-09-30', 'Q3')
sp['Date'] = sp['Date'].str.replace('-12-31', 'Q4')

That shouldn’t be new to you at this point. Nor should the next step: filtering the entire data series and keeping only those rows that contain a “Q”.

sp = sp[(sp.Date.str.contains("Q") == True)]

I’ll wrap up our CSV manipulation by selecting only the Date and Close columns to populate the sp1 variable. Why Close and not Open or High, etc? No reason at all. As long as we’re consistent, I don’t believe it’ll make all that much difference either way.

sp1 = sp[['Date', 'Close']]
sp1.head()

	Date	Close
187	2002Q3	815.28
251	2002Q4	879.82
375	2003Q2	974.50
439	2003Q3	995.97
503	2003Q4	1111.92

Looking through the data will show you that some quarters are actually missing. I’ll let you figure out why that is (hint: you’ll need some very basic general domain knowledge to figure it out).

Now that we’ve got our S&P quotes all nice and comfy in a dataframe, what should we do with them? My first impulse was to throw all three of our data sets into a single plot so we can compare them. That was a terrible idea, but let’s see it through anyway so we can learn why.

The way we merged our first two data sets into a single dataframe earlier won’t work here. But getting it done the right way isn’t at all difficult. I’ll have to import a new tool called reduce that’ll help. First, thought, I’ll create a new dataframe made up of each of our three existing sets.

from functools import reduce
dfs = [newcpi, newwages, sp1]

Now I can use reduce to merge the three sets within dfs with the common Date column acting as a single index.

df_triple = reduce(lambda left,right:
                   pd.merge(left,right,on='Date'), dfs)

And with that, we’re all set to plot our triple-set monster:

ax = df_triple.plot(kind='line', figsize=(20, 7))
ax.set_ylabel('S&P, CPI, and Wages')
ax.set_xlabel('Dates')
ax.grid()

Ouch. The first - and smaller - problem is that the x-axis labels are out of sync with the actual data. That’s because line plots are normally used to represent sequential, numeric data that won’t require step-by-step labelling. Our 2002Q4 labels just weren’t what Python was expecting.

But the bigger issue is that, as a way to visually compare our data sets, this is pretty much unusable. That’s because the S&P data is on a hugely different scale (ranging between 800 and 3400), making differences between the CPI and wages sets nearly invisible.

I suppose we could play with the scale of the S&P data, perhaps dividing all the numbers by, say, 10. But why bother? We can simply generate two graphs, one beneath the other. Even better: let’s stick to bars for the S&P graph. Here’s the code for an S&P visualization, followed by the graph itself:

ax = sp1.plot(kind='bar', figsize=(20, 7))
ax.set_xticklabels(sp1.Date, rotation=45)
ax.set_ylabel('S&P Index quotes by month')
ax.set_xlabel('Dates')

And here’s how the CPI/wages graph will go:

ax = merged_data.plot(kind='line', figsize=(20, 7))
ax.set_xticklabels(merged_data.Date, rotation=45)
ax.set_ylabel('CPI vs. Wages and salaries - 12-month change')
ax.set_xlabel('Dates')

Calculate rate of increase for both CPI and S&P

I’ll let you look at those two graphs and identify the key overlaps - and the events that caused them. But there is one more insight I’d like to pull from all that data: how do the Compound Annual Growth Rates (CAGR) compare between our three data sets?

The CAGR is a useful number that can, at a glance, show us how long-term investments generated (or lost) value incrementally over many years.

Of course, technically, compounded growth rates don’t really make sense in the context of the CPI or even wages. After all, neither measures value that’s reinvested into a sustained pool. But it will allow us to make an apples-to-apples comparison to a market investment.

I’m going to take the simple route and, one at a time, manually insert the first and last values of our three data sets. The period (in years) we’re working with is 20.

first = 815.28
last = 3756.07
periods = 20
    
def CAGR(first, last, periods):
    return (last/first)**(1/periods)-1

print ('The S&P had a CAGR of {:.2%} '.format(CAGR(first, last, periods)))

The S&P had a CAGR of 7.94% 

first = 178.8
last = 260.28
periods = 20
    
def CAGR(first, last, periods):
    return (last/first)**(1/periods)-1

print ('The Consumer Price Index had a CAGR of {:.2%} 
       '.format(CAGR(first, last, periods)))

The Consumer Price Index had a CAGR of 1.90% 

first = 178.64
last = 281.31
periods = 20
    
def CAGR(first, last, periods):
    return (last/first)**(1/periods)-1

print ('Wages had a CAGR of {:.2%} '.format(CAGR(first, last, periods)))

Wages had a CAGR of 2.30% 

So our (artificially inflated) consumer price index growth rate over the past 20 years was 1.9% and the (similarly artificially inflated) employment wages rate was 2.3%. While, over the same time, the S&P 500 market index increased at an average annual rate of 7.94%!

Am I suggesting that owning Exchange Traded Funds (ETFs) built to mirror the S&P 500 assets basket over the long term is a good investment strategy? Well, you should be aware that I’m entirely unqualified as a financial advisor and, in any case, this chapter is focused on data analytics, not investment guidance.

But in answer to the question: Yes.