I have a dataset I want to plot (say 5,000,000 data-points). This can be very slow to plot in R, so you want to take a sample of this data instead.

Say I have a tab delimited file with two columns, say 'time' and 'count'. The columns have these as headers. There are 5M rows and I'd like a simple overview of the count over time.

> data = read.delim('filename', header=T) #read in the tsv (tab separated value) data file

> s = length(data$time) # calculate the number of data-points

> n = 1000 # this is my sample size

> N = sort(sample(1:s, n)) #create a set of indices sampled from the vector 1:s

> plot(data$time[N], data$count[N]) # use the indices to sample from the set

The magic is in the

`sort(sample(1:s, n))`

. This takes n samples from the space 1 -> s. Unless a probability vector is provided, each element in the input vector (1:s) has an equal probability of being selected. We sort the output of sample so that the indices are in the correct order to plot. Actually I just tried this without the sort and it seems the

`plot()`

function sorts the input vectors anyway.