Pandas

In [1]:
import pandas as pd
import numpy as np

Let's work with Panda some more. We'll be using the data sheet form lecture, hopefully you still have it.

In [2]:
df = pd.read_excel('data.xlsx', 'Worksheet', index_col=0)
df.describe()
Out[2]:
Phone First Last
count 100 100 100
unique 100 94 95
top 353-651-7605 Todd Nunez
freq 1 2 2

This data frame is rather boring because it only has three columns. Let's modify it a bit and pretend it's a class roster.

First, "Member #" isn't a fitting name for a class roster. Let's give it a better name.

In [3]:
df.index.rename("Student ID", inplace=True)
df.head()
Out[3]:
Phone First Last
Student ID
A1001 125-111-4978 Austin Bell
A1002 763-303-7544 Declan Buck
A1003 695-919-3789 Carol Vazquez
A1004 276-570-7451 Fleur Dunlap
A1005 498-479-7074 Garth Potter

It also doesn't really make sense for a phone number to be in a class roster. Let's remove the phone numbers from this. In lecture, we learned that we can do this by selecting all the other columns. A simpler way is to just drop the column we don't want:

In [4]:
df.drop(columns=['Phone'], inplace=True)
df.head()
Out[4]:
First Last
Student ID
A1001 Austin Bell
A1002 Declan Buck
A1003 Carol Vazquez
A1004 Fleur Dunlap
A1005 Garth Potter

Then we can add some grades. We can do this by just creating some additional columns and assigning numbers at random. (Don't worry, we don't grade this way in this class.)

In [5]:
grades = pd.DataFrame(np.random.randint(65, 100, [100, 5]),
                      index = df.index,
                      columns=["HW1", "HW2", "HW3", "HW4", "Exam"])
df = df.join(grades)
del grades

df.head()
Out[5]:
First Last HW1 HW2 HW3 HW4 Exam
Student ID
A1001 Austin Bell 71 66 75 89 95
A1002 Declan Buck 65 65 77 67 97
A1003 Carol Vazquez 65 66 73 81 80
A1004 Fleur Dunlap 86 68 97 96 98
A1005 Garth Potter 77 78 91 75 93

Warm-up Exercise: Due to a grading mistake, all the grades for HW1 have been inserted one row too high (the first person did not submit anything and will get a 0). Let's fix this mistake together.

In [6]:
hw1_grades = list(df['HW1'])
hw1_grades = [0] + hw1_grades[:-1]
df['HW1'] = hw1_grades

df.head()
Out[6]:
First Last HW1 HW2 HW3 HW4 Exam
Student ID
A1001 Austin Bell 0 66 75 89 95
A1002 Declan Buck 71 65 77 67 97
A1003 Carol Vazquez 65 66 73 81 80
A1004 Fleur Dunlap 65 68 97 96 98
A1005 Garth Potter 86 78 91 75 93

Exercise 1: Get a list of the people in the class with ids between A1050 and A1060. Bonus points if you just get the names and not the rest of the information.

In [7]:
people = df.loc['A1050':'A1060','First':'Last']
people
Out[7]:
First Last
Student ID
A1050 Jena Garcia
A1051 Tasha Hickman
A1052 Chaney Meadows
A1053 Lila Whitehead
A1054 Raven Randolph
A1055 Amal Cooke
A1056 Leonard Green
A1057 Fredericka Shaw
A1058 Illiana Jennings
A1059 Lara Conner
A1060 Tatum Le

Exercise 2: Suppose the grade is 60% homework and 40% exam in this class. Create a Pandas data series final_grade that has the final grade for each student.

Hint: If you get the column of a data frame you get a series. You can do arithmetic on data series in the same way you can with numpy data.

In [8]:
final_grade = df['HW1']*0.15 + df['HW2']*0.15 + df['HW3']*0.15 + df['HW4']*0.15 + df['Exam']*0.4
final_grade
Out[8]:
Student ID
A1001    72.50
A1002    80.80
A1003    74.75
A1004    88.10
A1005    86.70
         ...  
A1096    80.85
A1097    84.70
A1098    85.25
A1099    81.65
A1100    83.15
Length: 100, dtype: float64

We can add this series as a column to the data frame in a similar way we combined the two data frames earlier. When we do so, it will help if we give the series a name because that will become the name of the column. (We can also give the column a name later, but it's easier to just do it first.)

In [9]:
final_grade.name = "Final"

Then we use .join() to add it to the data frame.

In [10]:
df = df.join(final_grade)
df.head()
Out[10]:
First Last HW1 HW2 HW3 HW4 Exam Final
Student ID
A1001 Austin Bell 0 66 75 89 95 72.50
A1002 Declan Buck 71 65 77 67 97 80.80
A1003 Carol Vazquez 65 66 73 81 80 74.75
A1004 Fleur Dunlap 65 68 97 96 98 88.10
A1005 Garth Potter 86 78 91 75 93 86.70

Exercise 3: Next, let's give the students letter grades. Create a column called grade that has either 'A', 'B', 'C', or 'F', depending on the students' final grades.

In [11]:
letter_grade = final_grade.copy()
As = letter_grade >= 90
Bs = (letter_grade >= 80) & (letter_grade < 90)
Cs = (letter_grade >= 70) & (letter_grade < 80)
Fs = letter_grade < 70

letter_grade[As] = 'A'
letter_grade[Bs] = 'B'
letter_grade[Cs] = 'C'
letter_grade[Fs] = 'F'

letter_grade.name = 'Grade'
df = df.join(letter_grade)
df.head()
Out[11]:
First Last HW1 HW2 HW3 HW4 Exam Final Grade
Student ID
A1001 Austin Bell 0 66 75 89 95 72.50 C
A1002 Declan Buck 71 65 77 67 97 80.80 B
A1003 Carol Vazquez 65 66 73 81 80 74.75 C
A1004 Fleur Dunlap 65 68 97 96 98 88.10 B
A1005 Garth Potter 86 78 91 75 93 86.70 B

Exercise 4: Get a list (a data frame is fine) of all the students with A's in the course.

In [12]:
As = df.loc[df['Grade'] == 'A','First':'Last']
display(As) # like print, but works better for data frames

# If we wanted to be literal and get an actual list:
As = As['First'] + ' ' +  As['Last']
list(As)
First Last
Student ID
A1021 Alexander Patrick
A1025 Anne Sanders
A1034 Farrah Rowe
A1041 Hiroko Reed
A1045 Quinlan Berger
A1080 Hilel Mcintyre
Out[12]:
['Alexander Patrick',
 'Anne Sanders',
 'Farrah Rowe',
 'Hiroko Reed',
 'Quinlan Berger',
 'Hilel Mcintyre']