How to Use converters in pandas.read_excel When Using MultiIndex at the Same Time

2 minute read

Background/Problem

Ok, I think I had this problem about two years ago (not quite sure). Then a little over a year ago, I saw that someone had pretty much the exact same problem on stackoverflow and I answered it there. Well, I was kind of late and someone else had answered before, but I felt that my solution got to the point of the question just a little better. Then I started to write a post for my website about it, but somehow never quite finished and published it. Today I finally got to do so 😃.

Enough bla bla bla, let’s get to the problem.

Imagen you have an excel file with measurement data. The first header row tells you what was measured, the second one what the measurement unit is. Every row beneath that represents a set of measurements taken at a specific time. Now you need to read in the data from the excel file into a pandas.DataFrame. While doing so, you also want to convert the measurement data into SI units. pandas.read_excel function has an optional argument converters. To use it, you have to pass in a dictionary. Each key in that dictionary is a column name and the corresponding value a lambda function that will be applied to that column. The problem is, the excel file at hand has two header rows represented as a MultiIndex in pandas. So what do you use for the keys in the converters dictionary?

Solution

To better demonstrate the problem and to have an example to work with, imagine you had an excel file with the following table:

width height
nano_meter milli_meter
1 4
2 5
3 6

Back when I had the problem, it took me quite some time to figure out how to set up the converters dictionary. When I look at it now, it’s actually pretty easy. But I guess that’s just how it always is. Once you know how to do something, it seems obvious and easy 🤓. Anyways, all you have to do is, use tuples as keys in the dictionary. Here we go:

import pandas as pd

converters = {
    ("width", "nano_meter"): lambda nm: nm / 1_000_000_000,
    ("height", "milli_meter"): lambda mm: mm / 1_000,
}

data = pd.read_excel("<PATH/TO/EXCEL/FILE>", header=[0, 1], converters=converters)
print(data)

This short script produces the following output:

          width      height
     nano_meter milli_meter
0  1.000000e-09       0.004
1  2.000000e-09       0.005
2  3.000000e-09       0.006

Of course after the conversion the units in the DataFrame are not correct anymore. So you should rename them or remove them. Even though this was not part of the exercise, here is how you can remove the units header:

data.columns = data.columns.droplevel(1)

Printing the data now yields the following:

          width  height
0  1.000000e-09   0.004
1  2.000000e-09   0.005
2  3.000000e-09   0.006



Take care,
Andreas

Updated:

Leave a comment