bitwix

Tangential comments about Software Development

Wednesday, September 02, 2015

R-membering How To Code

It's an interesting exercise to see if you can still code in a language that you haven't used for a while. It's been three years since I used the R language for statistical analysis. I knew it was the right choice for my monthly analysis of my London Bus usage so yesterday I downloaded a copy, built a script (see below) and ran it.

It all went well. Admittedly, it was a case of Computer Programming To Be Officially Renamed 'Googling Stackoverflow' as there was almost nothing I remembered straight off. But it was quicker to re-learn the right tool rather than push myself to make an Excel-based solution or to write a C# program from scratch.

Of course, you're dying to know the results. Over ten months I have taken 652 bus journeys on 54 different routes. There are 12 routes I have used ten or more times, with the 45 bus (Camberwell to Farringdon) the most frequent at 131 times.

# ctrl-L to clear console
rm(list = ls(all.names = TRUE)) 

files = list.files(path="c:/documents/oystercard/", pattern="*.csv")
files <- paste("c:/documents/oystercard/",files, sep="")
all = do.call("rbind", lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))

y <- all[ substr( all$Journey.Action, 0, 11 ) == "Bus journey", ]
y <- droplevels(y)
nrow(y)
nlevels(y$Journey.Action)

# y$Route = factor(substr(y$Journey.Action,20,23))
y$Route = substr(y$Journey.Action,20,23)
y$Month <- factor(substr(y$Date,4,6))
y$Year <- as.integer(substr(y$Date,8,11))
y$Year[ y$Year < 2000 ] = 2000 + y$Year

y$M <- match(y$Month,month.abb)
y$YM <- y$Year * 100 + y$M

summary <- as.data.frame(table(y$YM))
colnames(summary)[1] <- "YM"
colnames(summary)[2] <- "Journeys"

s2 <- as.data.frame(table(y$YM,y$Route))
s2 <- s2[ s2$Freq > 0, ]
s2 <- as.data.frame(table(s2$Var1))
colnames(s2) <- c( "YM", "Routes")

summary <- merge( s2, summary, by="YM" ) 

ByRoute <- as.data.frame(table(y$Route))
colnames(ByRoute) <- c( "Route", "Freq" )

PopularRoutes <- ByRoute[ ByRoute$Freq >= 10, ]

stats <- NULL
stats[ "Number of journeys" ] = nrow(all)
stats[ "Different routes" ] = nrow(ByRoute)
stats <- as.data.frame(stats)

summary
PopularRoutes[ with( PopularRoutes , order( -Freq ) ), ]
stats

ByRoute$Route = as.numeric(levels(ByRoute$Route))[ByRoute$Route]
ByRoute[ with( ByRoute, order( as.numeric(Route) ) ), ]