Quantcast
Channel: Pandas: KeyError when using column names which are included in an index - Stack Overflow
Viewing all articles
Browse latest Browse all 2

Pandas: KeyError when using column names which are included in an index

$
0
0

I have text files that I'm parsing which contain fixed width fields with lines that look like this:

USC00142401201703TMAX  211  H  133  H  161  H  194  H  206  H  161  H  244  H  178  H-9999     250  H   78  H   44  H   67  H   50  H   39  H  106  H  239  H  239  H  217  H  317  H  311  H  178  H  139  H-9999     228  H-9999   -9999   -9999   -9999   -9999   -9999   

I'm parsing these into a pandas DataFrame like so:

from collections import OrderedDictfrom pandas import DataFrameimport pandas as pdimport numpy as npdef read_into_dataframe(station_filepath):    # specify the fixed-width fields    column_specs = [(0, 11),   # ID                    (11, 15),  # year                    (15, 17),  # month                    (17, 21),  # variable (referred to as element in the GHCND readme.txt)                    (21, 26),  # day 1                    (29, 34),  # day 2                    (37, 42),  # day 3                    (45, 50),  # day 4                    (53, 58),  # day 5                    (61, 66),  # day 6                    (69, 74),  # day 7                    (77, 82),  # day 8                    (85, 90),  # day 9                    (93, 98),  # day 10                    (101, 106),  # day 11                    (109, 114),  # day 12                    (117, 122),  # day 13                    (125, 130),  # day 14                    (133, 138),  # day 15                    (141, 146),  # day 16                    (149, 154),  # day 17                    (157, 162),  # day 18                    (165, 170),  # day 19                    (173, 178),  # day 20                    (181, 186),  # day 21                    (189, 194),  # day 22                    (197, 202),  # day 23                    (205, 210),  # day 24                    (213, 218),  # day 25                    (221, 226),  # day 26                    (229, 234),  # day 27                    (237, 242),  # day 28                    (245, 250),  # day 29                    (253, 258),  # day 30                    (261, 266)]  # day 31    # create column names to correspond with the fields specified above    column_names = ['station_id', 'year', 'month', 'variable','01', '02', '03', '04', '05', '06', '07', '08', '09', '10',  '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',  '21', '22', '23', '24', '25', '26', '27', '28', '29', '30',  '31']    # read the fixed width file into a DataFrame columns with the widths and names specified above    df = pd.read_fwf(station_filepath,                      header=None,                     colspecs=column_specs,                     names=column_names,                     na_values=-9999)    # convert the variable column to string data type, all others as integer data type    df.dropna()  #REVISIT do we really want to do this?    df['variable'] = df['variable'].astype(str)    # keep only the rows where the variable value is 'PRCP', 'TMIN', or 'TMAX'    df = df[df['variable'].isin(['PRCP', 'TMAX', 'TMIN'])]    # melt the individual day columns into a single day column    df = pd.melt(df,                 id_vars=['station_id', 'year', 'month', 'variable'],                 value_vars=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10','11', '12', '13', '14', '15', '16', '17', '18', '19', '20','21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],                 var_name='day',                  value_name='value')    # pivot the DataFrame on the variable type (PRCP, TMIN, TMAX), so each    # type has a separate column with the day's value for the type    df = df.pivot_table(index=['year','month','day'],                        columns='variable',                        values='value')    return df

I now get the DataFrame in the shape I want it, except that there are rows for days that don't exist (i.e. February 31st, etc.), and which I'd like to remove.

I've tried to do this using masks, but when I've done so I get a KeyError when I try to use what I think are valid column names. For example if I include the following code in the above function before returning the DataFrame I will get a KeyError:

months_with_31days = [1, 3, 7, 8, 10, 12]df = df[((df['day'] == 31) & (df['month'] in months_with_31days))        |       ((df['day'] == 30) & (df['month'] != 2))        |       ((df['day'] == 29) & (df['month'] != 2))        |       ((df['day'] == 29) & (df['month'] == 2) & calendar.isleap(df['year']))        |         df['day'] < 29]

The above will result in a KeyError:

KeyError: 'day'

The day variable was created by the melt() call, then used within the index in the call to pivot_table(). How this affects the indexing of the DataFrame and why it messes up the ability to use the previous column names is not clear to me.

Edit

I assume that I now have a MultiIndex on the DatFrame, created as a result of the call to pivot_table() via using an index argument.

Initial lines displayed when printing the DataFrame:

variable         PRCP   TMAX   TMINyear month day                     1893 1     01     NaN   61.0   33.0           02     NaN   33.0    6.0           03     NaN   44.0   17.0           04     NaN   78.0   22.0           05     NaN   17.0  -94.0           06     NaN   33.0    0.0           07     NaN    0.0  -67.0

I've tried referencing the DataFrame's columns using dot notation instead of brackets with quoted column names, but I get similar errors. It seems like the year, month, and day columns have been merged into a single index column and can no longer be referenced individually. Or maybe something else is going on here? I'm stumped, and wonder if there is a better way to do it.


Viewing all articles
Browse latest Browse all 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>