Power of Python Pandas

Power of Python Pandas

The ease of extracting and summarizing large amounts of data using Python Pandas is powerful.  Below is an example of using airline data to find out how many passengers went to an airport, accident rate based on reference codes, deaths and the causes of accidents.  With a few lines of code, you can quickly see the power of Python Pandas.  Enjoy!

Examine passenger volume data and incident and accident data for airlines and airports. See the prior posting on previously manipulated air transportation data.

The Data

Data is from the U.S. Dept. of Transportation (USDOT) and the U.S. Federal Aviation Administration (FAA). The USDOT data are for calendar year 2014.

You can find additional documentation on the FAA site, above, if you find that it’s useful.

Note that in the USDOT data file there are commas in quantities that exceed one thousand. Where this is the case, the quantity is enclosed in double quotes.

Python Panda Tasks

Using Python Pandas and the data outlined above, there are four main tasks:

Task A:  Determine for the six U.S. airports ABQ, BOI, BUF, CMH, SEA, and TPA, how many passenger departures and arrivals they each had in 2014, and also which airline was each airport’s largest departure carrier and the largest arrival carrier.

Task B:  For each of these six airports, determine what airports the largest number of arrivals were from, and the airport that the largest number of departures went to.

Task C:  For each of these six airports, determine the number of accidents or incidents that occurred at them between 2010 and 2014 inclusive, according to the FAA. Determine the number of deaths that occurred in each event for each of these five airports.

Task D:  Determine what the top ten (primary) causes of 2010-2014 incidents and accidents are for all events resulting in deaths regardless of where they occurred. Provide descriptions (not codes) for the causes.

Air trips and bad things – winter 2017 CC by SA Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0

Task A:  Passenger Volume for Six Airports and Largest Carrier of Departures and Arrivals

import pandas as pd

import csv

filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding=’utf-8′, thousands=’,’,

na_values=”)

df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])

#choose the selected airports

selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

#create a dataframe

dfselect=pd.DataFrame(df)

#reduce the dataframe to only selectedairports

#————Question 1A ———

paxdf = pd.DataFrame(filein, columns = [‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’ ])

selectoriginapt = paxdf[paxdf.OriginApt.isin([‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’])]

q1paxbyapt = selectoriginapt.pivot_table([‘Total’], index = [‘OriginApt’], aggfunc=sum)

print (‘\nQ1: Largest airports by number of passengers’)

print (q1paxbyapt)

#————–Questin 1B—————

#part b largest carriers by departrues or OriginApt

selairpots = dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’)]

q1carriersbyorigin = pd.DataFrame(selairpots)

carrbyorg=q1carriersbyorigin.loc[q1carriersbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’,

‘DestApt’,’Carrier’, ‘Total’])]

carrbyorg[‘Total’].convert_objects(convert_numeric=True)

AptsLgArrivalsQ1B = carrbyorg.groupby([‘OriginApt’])[‘Carrier’, ‘Total’].max()

print (‘Select Airports largest carriers by Origin’)

print (AptsLgArrivalsQ1B)

print (‘Select Airports largest carriers by Destination: NA – no flight based on selected airports in DestApt’)

selairpotsdest =dfselect.loc[dfselect[‘DestApt’].isin(selapts),( ‘DestApt’,’OriginApt’, ‘Carrier’, ‘Total’)]

q1carriersbydest = pd.DataFrame(selairpotsdest)

carrbydest=q1carriersbydest.loc[q1carriersbydest[‘DestApt’].isin(selairpotsdest),([‘OriginApt’, ‘DestApt’,’Carrier’, ‘Total’])]

AptsLgArrivalsQ1B = carrbydest.groupby([‘DestApt’])[‘Carrier’, ‘Total’].max()

[/code]

Output

Q1  Largest airports by number of passengers
Total
OriginApt
ABQ 264
BOI 98
BUF 3891
CMH 57213
SEA 3678369
TPA 615820
Select airports largest carriers by Origin
Carrier Total
OriginApt
ABQ UA 133
BOI ZX 59
BUF YX 272
CMH XP 4617
SEA UA 33414
TPA WS 25275

Task B:  Largest Arrivals and Departures based on Six Selected Airports

[code language="css"]

import pandas as pd

import csv

#Read File

filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding

=’utf-8′, thousands=’,’, na_values=”)

df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])

#choose the selected airports

selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

#create a dataframe

dfselect=pd.DataFrame(df)

#reduce the dataframe to only selectedairports

selairpots =dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’,

‘DestApt’, ‘Total’)]

#part 2 – no airlines in the selapts are in the 6 selected airports

selairpots2 =dfselect.loc[dfselect[‘DestApt’].isin(selapts),(‘OriginApt’,

‘DestApt’, ‘Total’)]

#create subset of dataframe to use for this portion of the question

q1arrivalsbyorigin = pd.DataFrame(selairpots)

NumArrivalbyOrig3=q1arrivalsbyorigin.loc[q1arrivalsbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’, ‘DestApt’, ‘Total’])]

#Convert the column to numbers – two lines leave the below in otherwise will create an error

#ck on types of data – 2 lines

q1arrivalsbyorigin[‘Total’] = q1arrivalsbyorigin[‘Total’].apply(pd.to_numeric, errors=’coerce’) #messes up numbers don’t use

q1arrivalsbyorigin[‘Total’].convert_objects(convert_numeric=True)

#set index

print (‘Question 2 Part A: The number of arrivals from each origin are: ‘)

AptsLgArrivalsQ2A = q1arrivalsbyorigin.groupby([‘OriginApt’])[‘DestApt’, ‘Total’].max()

print (AptsLgArrivalsQ2A)

print (‘Question 2 Part A: There are no selected airports in the destination ‘)

[/code]

Output:

Question 2 Part A: The number of arrivals from each origin are:
OriginApt DestApt Total
ABQ YVR 133
BOI YYZ 59
BUF YYZ 272
CMH YYZ 4617
SEA YYZ 33414
TPA ZRH 25275
Question 2 Part A: There are no selected airports in the destination

Task C:  Number of Accidents/Incidents at Selected Airports with the Associated Number of Deaths.

[code language="css"]

import pandas as pd

import zipfile

#read the zip file

zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)

zip_ref.extractall(‘d:/GrE4/a2010_14’)

zip_ref.close()

#assign na values

na_values = [”, ‘N/A’, ‘0’, ‘000000’]

filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,)

df=filein.filter(items=[‘c143’, ‘c76’, ‘c1’])

#remove spaces from the end of the words

df[‘c143’] = df[‘c143’].str.strip()

df[‘c1’] = df[‘c1’].str.strip()

#put the cleaned version into a dataframe

dffaa = pd.DataFrame(df)

dffaa2 = pd.DataFrame(df)

dffaa2.to_csv(‘d:/20170223df2-dffaa2.csv’)

selapts = [‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

print (‘———-incidents ——————‘)

#The GOLDEN NUGGET on the lookup of the subset and return results

numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c1’]

selectairports=dffaa2.loc[dffaa2[‘c143’].isin(selapts),([‘c143’, ‘c1’]) ]

dffaa2.set_index([‘c143’, ‘c1’])

print (‘Question 3 Part A: The number of incident and accidents by airport are:’)

#Create a new dataframe

dfselectairport=pd.DataFrame(selectairports)

#print count incidents/accidents by airport

NumAIbyAirport=dfselectairport.groupby([‘c143’, ‘c1’])[‘c1’].count()

print NumAIbyAirport

#The GOLDEN NUGGET on the lookup of the subset and return results

numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c76’]

selectairports2=dffaa.loc[dffaa[‘c143’].isin(selapts),([‘c143’, ‘c76’]) ]

dffaa.set_index([‘c143’, ‘c76’])

#——————————

print (‘Question 3 Part B: The number of deaths by airport are:’)

#Create a new dataframe

dfselectairport2=pd.DataFrame(selectairports2)

#print count incidents/accidents by airport

deathsbyairport=dfselectairport2.groupby([‘c143’])[‘c76’].count()

print deathsbyairport

#boolean version to see what’s in the df

selaptmatch=dffaa[‘c76’].isin(selapts)

#The GOLDEN NUGGET on the lookup of the subset and return results

numdeaths=dffaa.loc[dffaa[‘c143′].isin(selapts),’c76’].count()

print (‘—The total number of deaths at all airports: ‘)

print numdeaths

[/code]
Output

Question 3 Part A: The number of incident and accidents by airport are:
ABQ A 5
I 2
BOI A 3
I 3
CMH A 1
I 3
SEA A 3
I 3
TPA A 2
I 4
Name: c1, dtype: int64
Question 3 Part B: The number of deaths by airport are:
ABQ 7
BOI 6
CMH 4
SEA 6
TPA 6
Name: c76, dtype: int64
The total number of deaths at all airports:
29

Task D:  Top 10 Causes of Death with Descriptions

[code language="css"]

zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)

zip_ref.extractall(‘d:/GrE4/a2010_14’)

zip_ref.close()

#assign na values

na_values = [”, ‘N/A’, ‘0’, ‘000000’]

filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,dtype={‘c78′:’str’})

df=filein.filter(items = [None, ‘c78’])

#fill na’s with blanks so they can be deleted in the next step – did so to only come up with accident causes

df = df.fillna({

‘c78’: ”})

#strip spaces

df[‘c78’]=df[‘c78’].str.strip()

#compress to only lines with values

filter=df[‘c78’] !=””

cldf2=df[filter]

cldf2.to_csv(‘d:/q4cldf.csv’) #check the file

#convert set index and create new df

cldf3 = cldf2.set_index(‘c78’)

cldf4 = pd.DataFrame(cldf3)

sublevel_map = {

’32’: ‘Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’}

#cldf4[‘c78’].map(lambda x: sublevel_map[x.lower()])

mapping = {32:’Pilot Fatigue/Windshield’, 85:’Loss of Tail/Gusty Winds’, 73:’Settling w/power / Pilot Fatigue’,72:’Settling w/ Power / Windshield’, 24:’Windshield/Rough Sod’, 27:’Windshield/Settling w/ Power’, 61:’Struck Animal/Vision obscured’, 29:’Windshield/invalid’, 21:’Windshield/Vision Obscured’}

cldf2.index.names = list(map(lambda name: mapping.get(name, name), cldf4.index.names))

cldf2.rename(columns=mapping, inplace=True)

#cldf2.replace([4],[‘jojojo’])

#cldf2 = cldf2′[c78′].replace({’32’:’Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’, ’24’:’Windshield/Rough Sod’, ’27’:’Windshield/Settling w/ Power’, ’61’:’Struck Animal/Vision obscured’, ’29’:’Windshield/invalid’, ’21’:’Windshield/Vision Obscured’})

df.rename(columns = {‘c78’: ‘Primary_Cause’}, inplace=True)

#replace the codes with reasons in text

cldf2.is_copy = False

cldf2.loc[cldf2[‘c78’] == “32”, ‘c78′] = ’32: Pilot Fatigue/Windshield’

cldf2.loc[cldf2[‘c78’] == “85”, ‘c78′] = ’85: Loss of Tail/Gusty Winds’

cldf2.loc[cldf2[‘c78’] == “73”, ‘c78′] = ’73: Settling w/power / Pilot Fatigue’

cldf2.loc[cldf2[‘c78’] == “72”, ‘c78′] = ’72: Settling w/ Power / Windshield’

cldf2.loc[cldf2[‘c78’] == “29”, ‘c78′] = ’29: Windshield/invalid’

cldf2.loc[cldf2[‘c78’] == “24”, ‘c78′] = ’24: Windshield/Rough Sod’

cldf2.loc[cldf2[‘c78’] == “27”, ‘c78′] = ’27: Windshield/Settling w/ Power’

cldf2.loc[cldf2[‘c78’] == “61”, ‘c78′] = ’61: Struck Animal/Vision obscured’

cldf2.loc[cldf2[‘c78’] == “21”, ‘c78′] = ’21: Windshield/Vision Obscured’

cldf2.loc[cldf2[‘c78’] == “71”, ‘c78’] = ‘**********************Other********************’

#show the top ten using groupby method

top10=cldf2.groupby([‘c78’]).size().sort_values(inplace=False, ascending=False)[:(10)]

print ‘The Top 10 Reasons are:’, top10

[/code]

Output

The Top 10 Reasons are: c78
32: Pilot Fatigue/Windshield 549
85: Loss of Tail/Gusty Winds 224
73: Settling w/power / Pilot Fatigue 87
72: Settling w/ Power / Windshield 85
24: Windshield/Rough Sod 62
27: Windshield/Settling w/ Power 52
61: Struck Animal/Vision obscured 50
29: Windshield/invalid 44
21: Windshield/Vision Obscured 38
**********************other******************** 35

As shown above, Python Pandas are a fabulous tool to extract information from a database, make inquiries on and summarize data.

Leave a Reply

%d bloggers like this: