– Predicting Delayed Flights The file FlightDelays.csv/answers –

Predicting Delayed Flights The file FlightDelays.csv contains information on all commercial flights departing the Washington, DC area and arriving at the New York area during January 2004. For each flight, there is information on the departure and arrival airports, the distance of the route, the scheduled time and date of the flight, and so on. Below is the data description:

Variable Description

CRS_DEP_TIME Scheduled departure time.

CARRIER Eight airline codes: CO (Continental), DH (Atlantic Coast), DL (Delta), MQ (American Eagle), OH (Comair), RU (Continental Express), UA (United), and US (USAirways).

DEP_TIME Departure time.

DEST Destination. Three airport codes: JFK (Kennedy), LGA (LaGuardia), EWR (Newark).

DIST Distance in miles.

FL_DATE Flight date.

FL_NUM Flight number.

ORIGIN The origin airport. Three airport codes: DCA (Reagan National), IAD (Dulles), BWI (Baltimore–Washington Int’l).

DAY_WEEK Day of week Coded as 1 = Monday, 2 = Tuesday, . . . , 7 = Sunday.

WEATHER 1 if the weather is inclement and 0 otherwise.

DAY_OF_MONTH Day of the month.

TAIL_NUM Aircraft number

Flight_Status The flight status, either delayed or ontime.

The variable that we are trying to predict is whether or not a flight is delayed. A delay is defined as an arrival that is at least 15 minutes later than scheduled.

Data Preprocessing. Load FlightDelays.csv into the R environment. Do not include DEP_TIME (actual departure time) in the model because it is unknown at the time of prediction (unless we are generating our predictions of delays after the plane takes off, which is unlikely). Also eliminate FL_DATE, FL_NUM, DAY_OF_MONTH, and TAIL_NUM.

1. What is the class of each variable in the imported data frame? Transform the day of the week variable (DAY_WEEK) into a factor. Bin the scheduled departure time (CRS_DEP_TIME) into eight bins (use function cut()). Partition the data into training and validation sets.

2. Fit a classification tree to the flight delay variable using all relevant predictors (in the non-partitioned data set). For the tree depth, use a maximum of 8 levels (using the maxdepth option) and set cp = 0. Plot the tree using the prp() function and express the resulting tree as a set of rules (use the style=”tall” rel=”tall” option in the rpart.rules() command for better readability). How many rules (or terminal tree nodes) are there?

3. Interpret the numbers in the top two node levels of your tree. Also interpret the information in the top 2 rules that your tree has generated.

4. Using the train() function of the the caret package, plot the model accuracy (this time, using the training set) against the cp parameter. At what value of the cp parameter accuracy is maximized?

5. Prune and plot the tree using again the training dataset and the optimal value of the cp parameter obtained above. You will find that the pruned tree contains a single terminal node.

a. How is the pruned tree used for classification? (What is the rule for classifying?)

b. To what is this rule equivalent?

c. Why, technically, does the pruned tree result in a single node?

6. If you were relying on the unpruned tree (from question 1) and needed to fly between DCA and EWR on a Monday at 7:00 AM, would you be able to use the (unpruned) tree to predict whether your flight will be delayed? What other information might you need? Is it available in practice?

7. Exclude the Weather predictor from the dataset and repeat the steps in questions 1-4. Be sure to display both the pruned and unpruned trees?

a. Examine the pruned tree. What are the top three predictors according to this tree?

b. Using the validation set, obtain and display the confusion matrix and accuracy of the pruned tree.

8. Fit a logistic regression to the training set (again without the weather predictor) and using the validation set (w/o the weather predictor) obtain and display the confusion matrix and accuracy of the logistic model. Which model is more accurate, the logistic regression or the pruned tree? (Notes: before fitting a logistic regression you will need to convert the categorical predictors in both the training and validation sets into dummy variables. To that end, use the dummy_cols() function of the fastDummies package, with option remove_first_dummy = TRUE. Also, upon creation, remove all of the original categorical variables. To generate the predicted class of delayed flights, use cut-off probability = 0.5).

9. Compare the confusion matrices and the accuracy rates of the fitted tree and the logistic model. Comment.

10. Again excluding the weather predictor, fit a random forest model using the training set. Using the validation set, check the model accuracy. In terms of accuracy, did the random forest model outperform the single pruned tree or the logistic regression? Display the variable importance chart and comment on the most important variable.

11. Redo question 9 but this time using the boosted tree model.