EuroPython 2018

Introduction to Pandas, Testing and Test-Driven Data Analysis

Speaker(s) Nick Radcliffe

This is replacement training for Sandrine’s Intro to Pandas and SciKit Learn, which she sadly won’t be able to present because she has recently broken her nose!

This training will introduce Pandas (and Numpy) for people who either haven’t used them at all or are just getting started with them. It will then go on to introduce testing in Python (in general), with unittest, and testing analytical code, in particular, with the test-driven data analysis package.

Planned content:

Pandas and Numpy: - what they are - why you should used them - how to use them - basic numpy concepts (arrays, dtypes, array operations, zeros, ones, creating new arrays, where, choose, sum, size etc.) - pandas basics: DataFrames, creating DataFrames, loading data, indices - creating new columns, appending DataFrames, merging DataFrames, groupby operations, plotting. - loading and saving data (pickle; csv, feather)

Testing - the idea of testing and test-driven development - the unittest library basics: unittest.TestCase, unittest.main, verbosity, self.assertEqual and friends. - slightly advanced unittest basics: setUp, tearDown, setUpClass, tearDownClass - unit tests vs. integration and system tests - pytest as an alternative to unittest

Test-Driven Data Analysis: - making the case for testing analytical processes - extending the ideas from software testing to reference tests for analytical processes with ReferenceTestCase from the TDDA library - Using constraints to test data (input; output; intermediates). - Automatically generating constraints from known good data - Verifying data against generated (or hand-created) constraints - Detecting bad data (and anomalies)

PREPARATION:

As a bare minumum, you will need a working Python (preferably 3.6 or 3.7, though 2.7 could be used) with numpy and pandas installed, preferably with JupyterLab. Anaconda is a great way to get all this.

Ideally, you should also have installed the TDDA library, the feather library and the pmmif library in case the network is overloaded during training.

All dependencies are available from Github:

pip install numpy
pip install pandas
pip install tdda
pip install feather-format
pip install pmmif

in on Monday 23 July at 09:30 See schedule
in on Monday 23 July at 11:15 See schedule

Do you have some questions on this talk?

New comment