Power of Python Pandas
Power of Python Pandas
The ease of extracting and summarizing large amounts of data using Python Pandas is powerful. Below is an example of using airline data to find out how many passengers went to an airport, accident rate based on reference codes, deaths and the causes of accidents. With a few lines of code, you can quickly see the power of Python Pandas. Enjoy!
Examine passenger volume data and incident and accident data for airlines and airports. See the prior posting on previously manipulated air transportation data.
The Data
Data is from the U.S. Dept. of Transportation (USDOT) and the U.S. Federal Aviation Administration (FAA). The USDOT data are for calendar year 2014.
- Passengers who traveled are in the csv file 2014 CY-YTD Passenger Raw Data_2.csv a2010_14
- Information on this data can be found at: http://www.transportation.gov/policy/aviation-policy/us-international-air-passenger-and-freight-statistics-report#TheData
- The FAA data are “AID” data: they’re about air transportation accidents and incidents. They can be found at: http://av-info.faa.gov/dd_sublevel.asp?Folder=\AID
- The FAA data used are in the file zip, which is a zip archived tab delimited file (It should be tab delimited, but it might be fixed field)
- The layout of this file is in txt Afilelayout
- The FAA airport codes are in txt Airport
- Descriptions of various codes are in DOC Airport
You can find additional documentation on the FAA site, above, if you find that it’s useful.
Note that in the USDOT data file there are commas in quantities that exceed one thousand. Where this is the case, the quantity is enclosed in double quotes.
Python Panda Tasks
Using Python Pandas and the data outlined above, there are four main tasks:
Task A: Determine for the six U.S. airports ABQ, BOI, BUF, CMH, SEA, and TPA, how many passenger departures and arrivals they each had in 2014, and also which airline was each airport’s largest departure carrier and the largest arrival carrier.
Task B: For each of these six airports, determine what airports the largest number of arrivals were from, and the airport that the largest number of departures went to.
Task C: For each of these six airports, determine the number of accidents or incidents that occurred at them between 2010 and 2014 inclusive, according to the FAA. Determine the number of deaths that occurred in each event for each of these five airports.
Task D: Determine what the top ten (primary) causes of 2010-2014 incidents and accidents are for all events resulting in deaths regardless of where they occurred. Provide descriptions (not codes) for the causes.
Air trips and bad things – winter 2017 CC by SA Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0
Task A: Passenger Volume for Six Airports and Largest Carrier of Departures and Arrivals
import pandas as pd
import csv
filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding=’utf-8′, thousands=’,’,
na_values=”)
df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])
#choose the selected airports
selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]
#create a dataframe
dfselect=pd.DataFrame(df)
#reduce the dataframe to only selectedairports
#————Question 1A ———
paxdf = pd.DataFrame(filein, columns = [‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’ ])
selectoriginapt = paxdf[paxdf.OriginApt.isin([‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’])]
q1paxbyapt = selectoriginapt.pivot_table([‘Total’], index = [‘OriginApt’], aggfunc=sum)
print (‘\nQ1: Largest airports by number of passengers’)
print (q1paxbyapt)
#————–Questin 1B—————
#part b largest carriers by departrues or OriginApt
selairpots = dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’)]
q1carriersbyorigin = pd.DataFrame(selairpots)
carrbyorg=q1carriersbyorigin.loc[q1carriersbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’,
‘DestApt’,’Carrier’, ‘Total’])]
carrbyorg[‘Total’].convert_objects(convert_numeric=True)
AptsLgArrivalsQ1B = carrbyorg.groupby([‘OriginApt’])[‘Carrier’, ‘Total’].max()
print (‘Select Airports largest carriers by Origin’)
print (AptsLgArrivalsQ1B)
print (‘Select Airports largest carriers by Destination: NA – no flight based on selected airports in DestApt’)
selairpotsdest =dfselect.loc[dfselect[‘DestApt’].isin(selapts),( ‘DestApt’,’OriginApt’, ‘Carrier’, ‘Total’)]
q1carriersbydest = pd.DataFrame(selairpotsdest)
carrbydest=q1carriersbydest.loc[q1carriersbydest[‘DestApt’].isin(selairpotsdest),([‘OriginApt’, ‘DestApt’,’Carrier’, ‘Total’])]
AptsLgArrivalsQ1B = carrbydest.groupby([‘DestApt’])[‘Carrier’, ‘Total’].max()
[/code]
Output
Q1 Largest airports by number of passengers | ||
Total | ||
OriginApt | ||
ABQ | 264 | |
BOI | 98 | |
BUF | 3891 | |
CMH | 57213 | |
SEA | 3678369 | |
TPA | 615820 | |
Select airports largest carriers by Origin | ||
Carrier | Total | |
OriginApt | ||
ABQ | UA | 133 |
BOI | ZX | 59 |
BUF | YX | 272 |
CMH | XP | 4617 |
SEA | UA | 33414 |
TPA | WS | 25275 |
Task B: Largest Arrivals and Departures based on Six Selected Airports
[code language="css"]
import pandas as pd
import csv
#Read File
filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding
=’utf-8′, thousands=’,’, na_values=”)
df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])
#choose the selected airports
selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]
#create a dataframe
dfselect=pd.DataFrame(df)
#reduce the dataframe to only selectedairports
selairpots =dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’,
‘DestApt’, ‘Total’)]
#part 2 – no airlines in the selapts are in the 6 selected airports
selairpots2 =dfselect.loc[dfselect[‘DestApt’].isin(selapts),(‘OriginApt’,
‘DestApt’, ‘Total’)]
#create subset of dataframe to use for this portion of the question
q1arrivalsbyorigin = pd.DataFrame(selairpots)
NumArrivalbyOrig3=q1arrivalsbyorigin.loc[q1arrivalsbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’, ‘DestApt’, ‘Total’])]
#Convert the column to numbers – two lines leave the below in otherwise will create an error
#ck on types of data – 2 lines
q1arrivalsbyorigin[‘Total’] = q1arrivalsbyorigin[‘Total’].apply(pd.to_numeric, errors=’coerce’) #messes up numbers don’t use
q1arrivalsbyorigin[‘Total’].convert_objects(convert_numeric=True)
#set index
print (‘Question 2 Part A: The number of arrivals from each origin are: ‘)
AptsLgArrivalsQ2A = q1arrivalsbyorigin.groupby([‘OriginApt’])[‘DestApt’, ‘Total’].max()
print (AptsLgArrivalsQ2A)
print (‘Question 2 Part A: There are no selected airports in the destination ‘)
[/code]
Output:
Question 2 Part A: The number of arrivals from each origin are: | ||||
OriginApt | DestApt | Total | ||
ABQ | YVR | 133 | ||
BOI | YYZ | 59 | ||
BUF | YYZ | 272 | ||
CMH | YYZ | 4617 | ||
SEA | YYZ | 33414 | ||
TPA | ZRH | 25275 | ||
Question 2 Part A: There are no selected airports in the destination |
Task C: Number of Accidents/Incidents at Selected Airports with the Associated Number of Deaths.
[code language="css"]
import pandas as pd
import zipfile
#read the zip file
zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)
zip_ref.extractall(‘d:/GrE4/a2010_14’)
zip_ref.close()
#assign na values
na_values = [”, ‘N/A’, ‘0’, ‘000000’]
filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,)
df=filein.filter(items=[‘c143’, ‘c76’, ‘c1’])
#remove spaces from the end of the words
df[‘c143’] = df[‘c143’].str.strip()
df[‘c1’] = df[‘c1’].str.strip()
#put the cleaned version into a dataframe
dffaa = pd.DataFrame(df)
dffaa2 = pd.DataFrame(df)
dffaa2.to_csv(‘d:/20170223df2-dffaa2.csv’)
selapts = [‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]
print (‘———-incidents ——————‘)
#The GOLDEN NUGGET on the lookup of the subset and return results
numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c1’]
selectairports=dffaa2.loc[dffaa2[‘c143’].isin(selapts),([‘c143’, ‘c1’]) ]
dffaa2.set_index([‘c143’, ‘c1’])
print (‘Question 3 Part A: The number of incident and accidents by airport are:’)
#Create a new dataframe
dfselectairport=pd.DataFrame(selectairports)
#print count incidents/accidents by airport
NumAIbyAirport=dfselectairport.groupby([‘c143’, ‘c1’])[‘c1’].count()
print NumAIbyAirport
#The GOLDEN NUGGET on the lookup of the subset and return results
numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c76’]
selectairports2=dffaa.loc[dffaa[‘c143’].isin(selapts),([‘c143’, ‘c76’]) ]
dffaa.set_index([‘c143’, ‘c76’])
#——————————
print (‘Question 3 Part B: The number of deaths by airport are:’)
#Create a new dataframe
dfselectairport2=pd.DataFrame(selectairports2)
#print count incidents/accidents by airport
deathsbyairport=dfselectairport2.groupby([‘c143’])[‘c76’].count()
print deathsbyairport
#boolean version to see what’s in the df
selaptmatch=dffaa[‘c76’].isin(selapts)
#The GOLDEN NUGGET on the lookup of the subset and return results
numdeaths=dffaa.loc[dffaa[‘c143′].isin(selapts),’c76’].count()
print (‘—The total number of deaths at all airports: ‘)
print numdeaths
[/code]
Output
Question 3 Part A: The number of incident and accidents by airport are: | |||||
ABQ | A | 5 | |||
I | 2 | ||||
BOI | A | 3 | |||
I | 3 | ||||
CMH | A | 1 | |||
I | 3 | ||||
SEA | A | 3 | |||
I | 3 | ||||
TPA | A | 2 | |||
I | 4 | ||||
Name: c1, dtype: int64 | |||||
Question 3 Part B: The number of deaths by airport are: | |||||
ABQ | 7 | ||||
BOI | 6 | ||||
CMH | 4 | ||||
SEA | 6 | ||||
TPA | 6 | ||||
Name: c76, dtype: int64 | |||||
The total number of deaths at all airports: | |||||
29 |
Task D: Top 10 Causes of Death with Descriptions
[code language="css"]
zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)
zip_ref.extractall(‘d:/GrE4/a2010_14’)
zip_ref.close()
#assign na values
na_values = [”, ‘N/A’, ‘0’, ‘000000’]
filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,dtype={‘c78′:’str’})
df=filein.filter(items = [None, ‘c78’])
#fill na’s with blanks so they can be deleted in the next step – did so to only come up with accident causes
df = df.fillna({
‘c78’: ”})
#strip spaces
df[‘c78’]=df[‘c78’].str.strip()
#compress to only lines with values
filter=df[‘c78’] !=””
cldf2=df[filter]
cldf2.to_csv(‘d:/q4cldf.csv’) #check the file
#convert set index and create new df
cldf3 = cldf2.set_index(‘c78’)
cldf4 = pd.DataFrame(cldf3)
sublevel_map = {
’32’: ‘Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’}
#cldf4[‘c78’].map(lambda x: sublevel_map[x.lower()])
mapping = {32:’Pilot Fatigue/Windshield’, 85:’Loss of Tail/Gusty Winds’, 73:’Settling w/power / Pilot Fatigue’,72:’Settling w/ Power / Windshield’, 24:’Windshield/Rough Sod’, 27:’Windshield/Settling w/ Power’, 61:’Struck Animal/Vision obscured’, 29:’Windshield/invalid’, 21:’Windshield/Vision Obscured’}
cldf2.index.names = list(map(lambda name: mapping.get(name, name), cldf4.index.names))
cldf2.rename(columns=mapping, inplace=True)
#cldf2.replace([4],[‘jojojo’])
#cldf2 = cldf2′[c78′].replace({’32’:’Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’, ’24’:’Windshield/Rough Sod’, ’27’:’Windshield/Settling w/ Power’, ’61’:’Struck Animal/Vision obscured’, ’29’:’Windshield/invalid’, ’21’:’Windshield/Vision Obscured’})
df.rename(columns = {‘c78’: ‘Primary_Cause’}, inplace=True)
#replace the codes with reasons in text
cldf2.is_copy = False
cldf2.loc[cldf2[‘c78’] == “32”, ‘c78′] = ’32: Pilot Fatigue/Windshield’
cldf2.loc[cldf2[‘c78’] == “85”, ‘c78′] = ’85: Loss of Tail/Gusty Winds’
cldf2.loc[cldf2[‘c78’] == “73”, ‘c78′] = ’73: Settling w/power / Pilot Fatigue’
cldf2.loc[cldf2[‘c78’] == “72”, ‘c78′] = ’72: Settling w/ Power / Windshield’
cldf2.loc[cldf2[‘c78’] == “29”, ‘c78′] = ’29: Windshield/invalid’
cldf2.loc[cldf2[‘c78’] == “24”, ‘c78′] = ’24: Windshield/Rough Sod’
cldf2.loc[cldf2[‘c78’] == “27”, ‘c78′] = ’27: Windshield/Settling w/ Power’
cldf2.loc[cldf2[‘c78’] == “61”, ‘c78′] = ’61: Struck Animal/Vision obscured’
cldf2.loc[cldf2[‘c78’] == “21”, ‘c78′] = ’21: Windshield/Vision Obscured’
cldf2.loc[cldf2[‘c78’] == “71”, ‘c78’] = ‘**********************Other********************’
#show the top ten using groupby method
top10=cldf2.groupby([‘c78’]).size().sort_values(inplace=False, ascending=False)[:(10)]
print ‘The Top 10 Reasons are:’, top10
[/code]
Output
The Top 10 Reasons are: c78
32: Pilot Fatigue/Windshield 549
85: Loss of Tail/Gusty Winds 224
73: Settling w/power / Pilot Fatigue 87
72: Settling w/ Power / Windshield 85
24: Windshield/Rough Sod 62
27: Windshield/Settling w/ Power 52
61: Struck Animal/Vision obscured 50
29: Windshield/invalid 44
21: Windshield/Vision Obscured 38
**********************other******************** 35
As shown above, Python Pandas are a fabulous tool to extract information from a database, make inquiries on and summarize data.