Friday, December 11, 2009

samples from large datasets in R

I have a dataset I want to plot (say 5,000,000 data-points). This can be very slow to plot in R, so you want to take a sample of this data instead.

Say I have a tab delimited file with two columns, say 'time' and 'count'. The columns have these as headers. There are 5M rows and I'd like a simple overview of the count over time.

> data = read.delim('filename', header=T) #read in the tsv (tab separated value) data file
> s = length(data$time) # calculate the number of data-points
> n = 1000 # this is my sample size
> N = sort(sample(1:s, n)) #create a set of indices sampled from the vector 1:s
> plot(data$time[N], data$count[N]) # use the indices to sample from the set

The magic is in the sort(sample(1:s, n)). This takes n samples from the space 1 -> s. Unless a probability vector is provided, each element in the input vector (1:s) has an equal probability of being selected. We sort the output of sample so that the indices are in the correct order to plot. Actually I just tried this without the sort and it seems the plot() function sorts the input vectors anyway.

No comments:

Post a Comment