Power of Python Pandas

February 25, 2017 admin

Power of Python Pandas

The ease of extracting and summarizing large amounts of data using Python Pandas is powerful. Below is an example of using airline data to find out how many passengers went to an airport, accident rate based on reference codes, deaths and the causes of accidents. With a few lines of code, you can quickly see the power of Python Pandas. Enjoy!

Examine passenger volume data and incident and accident data for airlines and airports. See the prior posting on previously manipulated air transportation data.

The Data

Data is from the U.S. Dept. of Transportation (USDOT) and the U.S. Federal Aviation Administration (FAA). The USDOT data are for calendar year 2014.

Passengers who traveled are in the csv file 2014 CY-YTD Passenger Raw Data_2.csv a2010_14
Information on this data can be found at: http://www.transportation.gov/policy/aviation-policy/us-international-air-passenger-and-freight-statistics-report#TheData
The FAA data are “AID” data: they’re about air transportation accidents and incidents. They can be found at: http://av-info.faa.gov/dd_sublevel.asp?Folder=\AID
The FAA data used are in the file zip, which is a zip archived tab delimited file (It should be tab delimited, but it might be fixed field)
The layout of this file is in txt Afilelayout
The FAA airport codes are in txt Airport
Descriptions of various codes are in DOC Airport

You can find additional documentation on the FAA site, above, if you find that it’s useful.

Note that in the USDOT data file there are commas in quantities that exceed one thousand. Where this is the case, the quantity is enclosed in double quotes.

Python Panda Tasks

Using Python Pandas and the data outlined above, there are four main tasks:

Task A: Determine for the six U.S. airports ABQ, BOI, BUF, CMH, SEA, and TPA, how many passenger departures and arrivals they each had in 2014, and also which airline was each airport’s largest departure carrier and the largest arrival carrier.

Task B: For each of these six airports, determine what airports the largest number of arrivals were from, and the airport that the largest number of departures went to.

Task C: For each of these six airports, determine the number of accidents or incidents that occurred at them between 2010 and 2014 inclusive, according to the FAA. Determine the number of deaths that occurred in each event for each of these five airports.

Task D: Determine what the top ten (primary) causes of 2010-2014 incidents and accidents are for all events resulting in deaths regardless of where they occurred. Provide descriptions (not codes) for the causes.

Air trips and bad things – winter 2017 CC by SA Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0

Task A: Passenger Volume for Six Airports and Largest Carrier of Departures and Arrivals

import pandas as pd

import csv

filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding=’utf-8′, thousands=’,’,

na_values=”)

df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])

#choose the selected airports

selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

#create a dataframe

dfselect=pd.DataFrame(df)

#reduce the dataframe to only selectedairports

#————Question 1A ———

paxdf = pd.DataFrame(filein, columns = [‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’ ])

selectoriginapt = paxdf[paxdf.OriginApt.isin([‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’])]

q1paxbyapt = selectoriginapt.pivot_table([‘Total’], index = [‘OriginApt’], aggfunc=sum)

print (‘\nQ1: Largest airports by number of passengers’)

print (q1paxbyapt)

#————–Questin 1B—————

#part b largest carriers by departrues or OriginApt

selairpots = dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’, ‘DestApt’, ‘Carrier’, ‘Total’)]

q1carriersbyorigin = pd.DataFrame(selairpots)

carrbyorg=q1carriersbyorigin.loc[q1carriersbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’,

‘DestApt’,’Carrier’, ‘Total’])]

carrbyorg[‘Total’].convert_objects(convert_numeric=True)

AptsLgArrivalsQ1B = carrbyorg.groupby([‘OriginApt’])[‘Carrier’, ‘Total’].max()

print (‘Select Airports largest carriers by Origin’)

print (AptsLgArrivalsQ1B)

print (‘Select Airports largest carriers by Destination: NA – no flight based on selected airports in DestApt’)

selairpotsdest =dfselect.loc[dfselect[‘DestApt’].isin(selapts),( ‘DestApt’,’OriginApt’, ‘Carrier’, ‘Total’)]

q1carriersbydest = pd.DataFrame(selairpotsdest)

carrbydest=q1carriersbydest.loc[q1carriersbydest[‘DestApt’].isin(selairpotsdest),([‘OriginApt’, ‘DestApt’,’Carrier’, ‘Total’])]

AptsLgArrivalsQ1B = carrbydest.groupby([‘DestApt’])[‘Carrier’, ‘Total’].max()

[/code]

Output

Q1 Largest airports by number of passengers
	Total
OriginApt
ABQ	264
BOI	98
BUF	3891
CMH	57213
SEA	3678369
TPA	615820

Select airports largest carriers by Origin
	Carrier	Total
OriginApt
ABQ	UA	133
BOI	ZX	59
BUF	YX	272
CMH	XP	4617
SEA	UA	33414
TPA	WS	25275

Task B: Largest Arrivals and Departures based on Six Selected Airports

[code language="css"]

import pandas as pd

import csv

#Read File

filein=pd.read_csv(“d:/GrEx4/2014 CY-YTD Passenger Raw Data.csv”, encoding

=’utf-8′, thousands=’,’, na_values=”)

df=filein.filter(items=[‘OriginApt’, ‘DestApt’, ‘Total’, ‘Carrier’])

#choose the selected airports

selapts =[‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

#create a dataframe

dfselect=pd.DataFrame(df)

#reduce the dataframe to only selectedairports

selairpots =dfselect.loc[dfselect[‘OriginApt’].isin(selapts),(‘OriginApt’,

‘DestApt’, ‘Total’)]

#part 2 – no airlines in the selapts are in the 6 selected airports

selairpots2 =dfselect.loc[dfselect[‘DestApt’].isin(selapts),(‘OriginApt’,

‘DestApt’, ‘Total’)]

#create subset of dataframe to use for this portion of the question

q1arrivalsbyorigin = pd.DataFrame(selairpots)

NumArrivalbyOrig3=q1arrivalsbyorigin.loc[q1arrivalsbyorigin[‘OriginApt’].isin(selapts),([‘OriginApt’, ‘DestApt’, ‘Total’])]

#Convert the column to numbers – two lines leave the below in otherwise will create an error

#ck on types of data – 2 lines

q1arrivalsbyorigin[‘Total’] = q1arrivalsbyorigin[‘Total’].apply(pd.to_numeric, errors=’coerce’) #messes up numbers don’t use

q1arrivalsbyorigin[‘Total’].convert_objects(convert_numeric=True)

#set index

print (‘Question 2 Part A: The number of arrivals from each origin are: ‘)

AptsLgArrivalsQ2A = q1arrivalsbyorigin.groupby([‘OriginApt’])[‘DestApt’, ‘Total’].max()

print (AptsLgArrivalsQ2A)

print (‘Question 2 Part A: There are no selected airports in the destination ‘)

[/code]

Output:

Question 2 Part A: The number of arrivals from each origin are:

OriginApt	DestApt	Total
ABQ	YVR	133
BOI	YYZ	59
BUF	YYZ	272
CMH	YYZ	4617
SEA	YYZ	33414
TPA	ZRH	25275

Question 2 Part A: There are no selected airports in the destination

Task C: Number of Accidents/Incidents at Selected Airports with the Associated Number of Deaths.

[code language="css"]

import pandas as pd

import zipfile

#read the zip file

zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)

zip_ref.extractall(‘d:/GrE4/a2010_14’)

zip_ref.close()

#assign na values

na_values = [”, ‘N/A’, ‘0’, ‘000000’]

filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,)

df=filein.filter(items=[‘c143’, ‘c76’, ‘c1’])

#remove spaces from the end of the words

df[‘c143’] = df[‘c143’].str.strip()

df[‘c1’] = df[‘c1’].str.strip()

#put the cleaned version into a dataframe

dffaa = pd.DataFrame(df)

dffaa2 = pd.DataFrame(df)

dffaa2.to_csv(‘d:/20170223df2-dffaa2.csv’)

selapts = [‘ABQ’, ‘BOI’, ‘BUF’, ‘CMH’, ‘SEA’, ‘TPA’]

print (‘———-incidents ——————‘)

#The GOLDEN NUGGET on the lookup of the subset and return results

numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c1’]

selectairports=dffaa2.loc[dffaa2[‘c143’].isin(selapts),([‘c143’, ‘c1’]) ]

dffaa2.set_index([‘c143’, ‘c1’])

print (‘Question 3 Part A: The number of incident and accidents by airport are:’)

#Create a new dataframe

dfselectairport=pd.DataFrame(selectairports)

#print count incidents/accidents by airport

NumAIbyAirport=dfselectairport.groupby([‘c143’, ‘c1’])[‘c1’].count()

print NumAIbyAirport

#The GOLDEN NUGGET on the lookup of the subset and return results

numincidents=dffaa2.loc[dffaa2[‘c143′].isin(selapts),’c76’]

selectairports2=dffaa.loc[dffaa[‘c143’].isin(selapts),([‘c143’, ‘c76’]) ]

dffaa.set_index([‘c143’, ‘c76’])

#——————————

print (‘Question 3 Part B: The number of deaths by airport are:’)

#Create a new dataframe

dfselectairport2=pd.DataFrame(selectairports2)

#print count incidents/accidents by airport

deathsbyairport=dfselectairport2.groupby([‘c143’])[‘c76’].count()

print deathsbyairport

#boolean version to see what’s in the df

selaptmatch=dffaa[‘c76’].isin(selapts)

#The GOLDEN NUGGET on the lookup of the subset and return results

numdeaths=dffaa.loc[dffaa[‘c143′].isin(selapts),’c76’].count()

print (‘—The total number of deaths at all airports: ‘)

print numdeaths

[/code]
Output

Question 3 Part A: The number of incident and accidents by airport are:
ABQ	A	5
	I	2
BOI	A	3
	I	3
CMH	A	1
	I	3
SEA	A	3
	I	3
TPA	A	2
	I	4
Name: c1, dtype: int64

Question 3 Part B: The number of deaths by airport are:
ABQ	7
BOI	6
CMH	4
SEA	6
TPA	6
Name: c76, dtype: int64

The total number of deaths at all airports:
	29

Task D: Top 10 Causes of Death with Descriptions

[code language="css"]

zip_ref = zipfile.ZipFile(‘d:/GrE4/a2010_14.zip’, ‘r’)

zip_ref.extractall(‘d:/GrE4/a2010_14’)

zip_ref.close()

#assign na values

na_values = [”, ‘N/A’, ‘0’, ‘000000’]

filein = pd.read_table(‘d:/GrE4/a2010_14/A2010_14.txt’, na_values=’NULL’,dtype={‘c78′:’str’})

df=filein.filter(items = [None, ‘c78’])

#fill na’s with blanks so they can be deleted in the next step – did so to only come up with accident causes

df = df.fillna({

‘c78’: ”})

#strip spaces

df[‘c78’]=df[‘c78’].str.strip()

#compress to only lines with values

filter=df[‘c78’] !=””

cldf2=df[filter]

cldf2.to_csv(‘d:/q4cldf.csv’) #check the file

#convert set index and create new df

cldf3 = cldf2.set_index(‘c78’)

cldf4 = pd.DataFrame(cldf3)

sublevel_map = {

’32’: ‘Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’}

#cldf4[‘c78’].map(lambda x: sublevel_map[x.lower()])

mapping = {32:’Pilot Fatigue/Windshield’, 85:’Loss of Tail/Gusty Winds’, 73:’Settling w/power / Pilot Fatigue’,72:’Settling w/ Power / Windshield’, 24:’Windshield/Rough Sod’, 27:’Windshield/Settling w/ Power’, 61:’Struck Animal/Vision obscured’, 29:’Windshield/invalid’, 21:’Windshield/Vision Obscured’}

cldf2.index.names = list(map(lambda name: mapping.get(name, name), cldf4.index.names))

cldf2.rename(columns=mapping, inplace=True)

#cldf2.replace([4],[‘jojojo’])

#cldf2 = cldf2′[c78′].replace({’32’:’Pilot Fatigue/Windshield’, ’85’:’Loss of Tail/Gusty Winds’, ’73’:’Settling w/power / Pilot Fatigue’, ’72’:’Settling w/ Power / Windshield’, ’24’:’Windshield/Rough Sod’, ’27’:’Windshield/Settling w/ Power’, ’61’:’Struck Animal/Vision obscured’, ’29’:’Windshield/invalid’, ’21’:’Windshield/Vision Obscured’})

df.rename(columns = {‘c78’: ‘Primary_Cause’}, inplace=True)

#replace the codes with reasons in text

cldf2.is_copy = False

cldf2.loc[cldf2[‘c78’] == “32”, ‘c78′] = ’32: Pilot Fatigue/Windshield’

cldf2.loc[cldf2[‘c78’] == “85”, ‘c78′] = ’85: Loss of Tail/Gusty Winds’

cldf2.loc[cldf2[‘c78’] == “73”, ‘c78′] = ’73: Settling w/power / Pilot Fatigue’

cldf2.loc[cldf2[‘c78’] == “72”, ‘c78′] = ’72: Settling w/ Power / Windshield’

cldf2.loc[cldf2[‘c78’] == “29”, ‘c78′] = ’29: Windshield/invalid’

cldf2.loc[cldf2[‘c78’] == “24”, ‘c78′] = ’24: Windshield/Rough Sod’

cldf2.loc[cldf2[‘c78’] == “27”, ‘c78′] = ’27: Windshield/Settling w/ Power’

cldf2.loc[cldf2[‘c78’] == “61”, ‘c78′] = ’61: Struck Animal/Vision obscured’

cldf2.loc[cldf2[‘c78’] == “21”, ‘c78′] = ’21: Windshield/Vision Obscured’

cldf2.loc[cldf2[‘c78’] == “71”, ‘c78’] = ‘**********************Other********************’

#show the top ten using groupby method

top10=cldf2.groupby([‘c78’]).size().sort_values(inplace=False, ascending=False)[:(10)]

print ‘The Top 10 Reasons are:’, top10

[/code]

Output

The Top 10 Reasons are: c78
32: Pilot Fatigue/Windshield 549
85: Loss of Tail/Gusty Winds 224
73: Settling w/power / Pilot Fatigue 87
72: Settling w/ Power / Windshield 85
24: Windshield/Rough Sod 62
27: Windshield/Settling w/ Power 52
61: Struck Animal/Vision obscured 50
29: Windshield/invalid 44
21: Windshield/Vision Obscured 38
**********************other******************** 35

As shown above, Python Pandas are a fabulous tool to extract information from a database, make inquiries on and summarize data.

Power of Python Pandas

Power of Python Pandas

Python Panda Tasks

Task A: Passenger Volume for Six Airports and Largest Carrier of Departures and Arrivals

Output

Task B: Largest Arrivals and Departures based on Six Selected Airports

[/code]

Output:

Task C: Number of Accidents/Incidents at Selected Airports with the Associated Number of Deaths.

[/code]
Output

Task D: Top 10 Causes of Death with Descriptions

Output

Related Posts

Leave a Reply Cancel reply