the data from 1950 onwards is either available in the archaic FishStat Plus format or in an MS Excel format. in both cases the format of the data are in a wide format, where the country, species and division is listed in rows and the year is in column. what follows is a script in R that shows how the data can be extracted directly from the website and then put in a long format, more suitable for further exploration.
# required library
require(gdata)
require(reshape2)
require(stringr)
options(stringsAsFactors=F)
# downloading and extract
temp <- tempfile()
download.file("http://ices.dk/fish/statlant/ICES1950-2010.zip",temp)
dat <- read.xls(paste(unzip(temp, "ICES_1950-2010.xls"),sep=""),
sheet=1,strip.white=T)
unlink(temp)
# some cleaning
names(dat) <- tolower(names(dat))
dat <- melt(dat,c("country","species","division"))
dat$variable <- as.character(dat$variable)
dat$variable <- as.integer(str_sub(dat$variable,2))
dat$value <- str_replace_all(dat$value,",","")
dat$value <- ifelse(dat$value %in% "-",0,dat$value)
dat$value <- ifelse(dat$value %in% ".",NA,dat$value)
dat$value <- ifelse(dat$value %in% "<0.5",0.25,dat$value)
dat$value <- ifelse(dat$value %in% "",NA,dat$value)
dat$value <- as.numeric(dat$value)
names(dat) <- c("country","species","descr","year","landings")
dat <- dat[!is.na(dat$landings),]
dat <- dat[dat$landings > 0,]
ices.1950.2010 <- dat
rm(dat,temp)
require(gdata)
require(reshape2)
require(stringr)
options(stringsAsFactors=F)
# downloading and extract
temp <- tempfile()
download.file("http://ices.dk/fish/statlant/ICES1950-2010.zip",temp)
dat <- read.xls(paste(unzip(temp, "ICES_1950-2010.xls"),sep=""),
sheet=1,strip.white=T)
unlink(temp)
# some cleaning
names(dat) <- tolower(names(dat))
dat <- melt(dat,c("country","species","division"))
dat$variable <- as.character(dat$variable)
dat$variable <- as.integer(str_sub(dat$variable,2))
dat$value <- str_replace_all(dat$value,",","")
dat$value <- ifelse(dat$value %in% "-",0,dat$value)
dat$value <- ifelse(dat$value %in% ".",NA,dat$value)
dat$value <- ifelse(dat$value %in% "<0.5",0.25,dat$value)
dat$value <- ifelse(dat$value %in% "",NA,dat$value)
dat$value <- as.numeric(dat$value)
names(dat) <- c("country","species","descr","year","landings")
dat <- dat[!is.na(dat$landings),]
dat <- dat[dat$landings > 0,]
ices.1950.2010 <- dat
rm(dat,temp)
the data frame ices.1950.2010 now contains 5 variables, i.e. the country, the species, descr (the area fished), the year and the landings in tonnes. some initial scoping of the data, like the total nominal landings by year can now be simply plotted by e.g.
# a simple plot
require(ggplot2)
ggplot(ices.1950.2010,aes(year,weight=landings/1e6)) +
geom_bar(binwidth=1) +
ylab("Nominal landings (million metric tonnes)")
require(ggplot2)
ggplot(ices.1950.2010,aes(year,weight=landings/1e6)) +
geom_bar(binwidth=1) +
ylab("Nominal landings (million metric tonnes)")
Further exploration of the data, e.g. landings by particular species and/or country in the format provided is a breeze. However landings by particular are (division, area and subdivision) require further manipulation on the field/variable "descr". more on that later.
einar
No comments:
Post a Comment