0. Getting Started with Data Analytics and Python

Page content

This curriculum won’t turn you into a data scientist in the formal sense of the word: don’t think 30 days is enough to compete with a Ph.D in statistics. But what this curriculum will do is teach you how to use Python and Jupyter Notebooks to find, process, manage, and interpret a wide range of data sources.

Is that enough for you? Do you need to become a data scientist? My guess is that you probably wouldn’t be here if that was what you were really after. But whatever you do now, you can probably do it better with some basic analytics tools. And have much more fun while you’re at it.

So why not quit talking about it and get good and started already?

Programming with Python

This curriculum assumes you’ve got at least a very basic understanding of programming in general and Python programming in particular. If you feel you’re not quite there yet, catching up shouldn’t take you too long.

Feel free to check out these valuable resources:

Installing Python

The good news is that most operating systems come with Python pre-installed. Not sure your particular OS got the memo? Open a command prompt/terminal and type: python --version (or python3 --version).

If Python’s installed, you’ll probably see something like this:

$ python --version
Python 3.7.6

or this:

$ python3 --version
Python 3.8.5

At this point, by the way, the version number you get had better not begin with a 2 (as in 2.7) - that’s no longer supported and using such an old release will add some serious vulnerabilities to your software.

If you do need to install Python manually, you’re best off using Python’s official documentation (python.org/downloads) - that’ll be the most complete and up-to-date source available.

Virtual Python environments

It’s important to note that not all Python versions - even those from 3.x - will necessarily behave quite the way you expect. You may, for instance, find that you need a library written for version 3.9, but that there’s no way to get it working on your 3.7 system.

Upgrading your system version to 3.9 might work out well for you, but it could also cause some unexpected and unpleasant consequences. It’s hard to predict when a particular Python library may also be holding up bits of the OS. Pull the original version of the library and you might end up disabling the OS itself. Don’t laugh: I’ve done it to myself.

The solution is to run Python for your project within a special virtual environment that’s isolated from your larger OS. That way, you can install all the libraries and versions you like without having to worry about damaging your work system. You can do this using a full-scale virtual container running a Docker or (as I prefer) LXD image, or on a standalone AWS cloud instance. But you can also use Python’s own venv module. That can be as simple as running this single command:

python3 -m venv /path/to/new/virtual/environment

You’ll want to read the official documentation (docs.python.org/3/library/venv.html) for the virtual environment instructions specific to your host OS.

Working with Python modules

Not all Python functionality will be available out of the box. Sometimes you’ll need to tell Python to load a particular module through a declaration within your code:

import pandas as pd

But some modules will need to be installed manually from the host command line before they can even be imported. For such cases, Python recommends their installer program, pip or, in some cases, the conda tool from the anaconda distribution.

pip3 install pandas matplotlib plotly

You can read more about using pip for the proper care and feeding of your Python system here: docs.python.org/3/installing.

All the software you’ll need to run the projects in this book should be noted where appropriate. But do pay attention to any unexpected error messages you might encounter in case your environment is somehow missing something.

Using Jupyter Notebooks

Once upon a time the lines of code you’d write to pull off the analytics we’re after would find themselves all snuggled up together in a single text file whose name ended with a .py suffix. When you wanted to run your code to see how things went, you’d do it from the command line (or a powerful and complicated source-code editor like Visual Studio Code).

Fun. But it meant that, for anything to work, it all had to work. That would make it harder to troubleshoot when something didn’t go according to spec. But it would also make it a lot harder to play around with specific details just to see what happens - which is where a lot of our most innovative ideas come from. And it also made it tough to share live versions of our code across the internet.

Jupyter Notebooks are JSON-based files (using .ipynb extensions) that, along with their various hosting environments, have gone a long way to solving all those problems. Notebooks move the processing environment for just about any data-oriented programming code using Python (or other languages) from your workstation to your web browser.

For me, a notebook’s most powerful feature is the way you can run subsets of your code within individual cells. This makes it easy to break down long and complex programs into easily readable - and executable - snippets. Whatever values were created will remain in the kernel memory until output for a particular cell or for the entire kernel are cleared.

This lets you re-run previous or subsequent cells to see what impact a change might have. It also means resetting your environment for a complete start-over is as easy as selecting Restart Kernel and Clear All Outputs.

You can install Jupyter Notebooks locally on your Python-enabled host and run it from within your browser. Alternatively, you can run notebooks on third-party hosting services like Google’s Colaboratory or - for a cost - cloud providers like Amazon’s SageMaker Studio Notebooks and Microsoft’s Azure Notebook.

If you do decide to make an old sysadmin happy and host your own notebooks, you’ll need to choose between classic notebooks and the newer JupyterLab. Both run nicely within your browser, but JupyterLab comes with more extensions and lets you work with multiple notebook files (and terminal access) within a single browser tab.

JupyterHub, just to be complete, is a server version built to provide authenticated notebook access to multiple users. You can serve up to 100 or so users from a single cloud server using The Littlest JupyterHub (tljh.jupyter.org). For larger deployments involving clusters of servers, you’d probably be better off with a Kubernetes version known as Zero to JupyterHub with Kubernetes.

Installing Jupyter Notebooks

Whichever version you choose, if you decide to install locally, the Jupyter project officially recommends doing it through the Python Anaconda distribution and its binary package manager, Conda. Various guides to doing that are available for various OS hosts. But this official page jupyterlab.io/install is a good place to start.

At some point you’ll probably run into trouble. With JupyterLabs in particular, extensions can be a bit fiddly when installing. It’s useful therefore to be aware of the labextension tool. These simple commands illustrate how the tool can work.

jupyter labextension list
jupyter labextension install jupyterlab-plotly
jupyter labextension install plotlywidget

And always watch closely for error messages that can tell you important things about your environment.

Storing and protecting data

Please spare a few thoughts for your poor, unappreciated data. By which I mean the CSV or JSON files you might have generated while collecting and cleaning data sets. But I also mean your actual .ipynb notebook files. Remember: it’s true that Jupyter will regularly auto save your notebooks. But where does that saved file actually live? Wherever you left it on the host - which might be as ephemeral as a Docker container.

What happens when that host is shut down for the last time and decommissioned (or corrupted beyond repair)? Your CSV and .ipynb files go with it. What can you do to preserve all that data? Make sure up-to-date copies exist in reliable places.

After all, data doesn’t back itself up.

Getting help

The internet is host to more knowledge than any one human being could possibly remember, or even organize. More often than not, we use search to access the tiny fragments of that knowledge that we need at a given time. Effective search, however, is far more than just typing a few related words into the search field and hitting Enter. There’s method to the madness. Here are some powerful tips that will work on any major search engine. My own favorite is DuckDuckGo.

Use your problem to find a solution

Considering that countless thousands of people have worked with the same technology you’re now learning, the odds are very high that at least some of them have run into the same problem you did. And at least a few of those will have posted their questions to an online user forum like Stack Overflow. The quickest way to get at look at the answers they received is to search using the exact language that you encountered.

Did your problem generate an error message? Paste exactly that text into your search engine. Were there any log messages? Find and post those, too.

Be precise

The internet has billions of pages, so vague search results are bound to include a whole lot of false positives. That’s why you want to be as precise as possible. One powerful trick is to enclose your error message in quotation marks, telling the search engine that you’re looking for an exact phrase, rather than a single result containing all or most of the words somewhere on the page. However, you don’t want to be so specific that you end up narrowing your results down to zero.

Therefore, for an entry from the Apache error log like this:

[Dec 16 02:15:44 2020] [error] [client] Client sent malformed Host header

…you should leave out the date and client IP address, because there’s no way anyone else got those exact details. Instead, include only the “Client sent…” part in quotations:

"Client sent malformed Host header"

If that’s still too broad, consider adding the strings Apache and [error] outside the quotation marks.

Search engines let you narrow down your search by time. If your problem is specific to a relatively recent release version, restrict your search to just the last week or month.

Sometimes an outside search engine will do a better job searching through a large web site than the site’s own internal tool (I’m looking at you: Government of Canada). If you feel the solution to your problem is likely to be somewhere on a particular site – like Stack Overflow’s admin cousin, Server Fault – but you can’t find it yourself, you can restrict results to just that one site:

"gss_accept_sec_context(2) failed:" site:serverfault.com

Finally, if you see that many or all of the false positives you’re getting seem to include a single word that is very unlikely to occur in the pages you’re looking for, exclude it with a dash. In this example you, of course, were looking for help learning how to write Bash scripts, but you kept seeing links with advice for aspiring Hollywood screenwriters. Here’s how to solve it:

writing scripts -movie