Bramblecloud is right for you if you are looking for an easy way to put a really huge amount of computing power behind your R computations. For example, Bramblecloud is a great help with parallelized simulations that run for more than an hour on your local system. If your simulation takes more time for greater accuracy, Bramblecloud lets you improve your results dramatically without increasing computation time.
Let’s be honest: if you are never waiting on any R computation, then you won’t profit from Bramblecloud’s superior speed. And if you just need a single cloud machine for slightly higher performance, not a whole cluster, you could also set that up yourself (e.g. look for the RStudio Server AMI on AWS). If you’re at that point already, though, chances are that you’ll need more power soon. Bramblecloud is here to give you the computing power that you need, flexibly and with no setup at all.
Standard computations in R work through a script one step after another, doing everything on one CPU. Often times in long-running simulations, however, there’s a lot of things that could be done concurrently at the same time because the different parts don’t depend on each other. That’s where parallel computation comes in: if you have more than one CPU available, you can distribute the different parts to different processors and thus save time.
Here are some guides to get you started with parallel computations in R:
- Marcus Beck gives an introduction with examples in his blog post A brief foray into parallel processing with R
- A thorough overview is coming from Glenn Klockwood: Parallel Options for R
- If you are familiar with the basic ideas, you can browse the full variety of parallel add-ons for R on CRAN’s always-up-to-date overview page High-Performance and Parallel Computing with R maintained by Dirk Eddelbuettel
Currently, we’re offering clusters in three different regions: Singapore, Ireland, and Oregon, USA. Bramblecloud uses Amazon’s EC2 instances. We’re working on expanding our locations. If you’ve got a special request, drop us a line!
Our workhorse is Amazon’s c3.8xlarge instance which is optimized for computational power. Each instance has 32 CPUs and 60 GB of RAM. We’re not offering Amazon’s c4 instances because the spot market for c3s gives you much better prices at this time. We’ll be making that switch eventually though.
Starting the permanent machines in your cluster typically takes 2 to 3 minutes. If you are using a lot of permanent machines, it might also take a little longer. Your spot instances only get requested once your permanent machines are running. Their startup usually takes a little longer as it takes some time for the requests to get fulfilled. Of course, you only pay for machines from the moment that they are at your disposal.
Signing up is easy: head over to the registration page, choose a username and password and enter your email address. We send you an email to confirm your address (also check your spam folder) – this way you can be sure not to miss any communication concerning your account.
Once your account is confirmed, you can log into your account. Before you can start using it, go to your profile and enter your payment and billing information. That’s in two separate forms as we’re using a particularly secure form for credit card data (it won’t be stored on our servers).
When your information is complete, return to your dashboard and you’re good to go.
Yes. Bramblecloud clusters allow you to mix permanent instances and spot instances. You name your own price for the spot instances. Every cluster needs to have at least one permanent machine as a reliable master node.
Spot prices are based on AWS’s spot prices, which are determined dynamically depending on the supply and demand of instances in every region. When your bid is above the spot price, your spot request will be fulfilled at the current price. That means that you only pay the current spot price, even if your bid is higher.
Just like permanent machines, you pay for spot instances by the hour. For every instance, the hourly price is determined by the spot price at the start of every hour of operation. If your spot instance gets terminated because the spot price has outgrown your bid, then you won’t be charged for the last partial hour of operation.
We do not charge any subscription fees or the like. Instead, you only pay for the time that we are providing clusters for you. We are billing your credit card every two weeks for the charges that have accumulated, and you receive an invoice from us with a detailed break-down of your usage. Partially used hours are billed as full hours of usage. We are planning to change this in the future so that you will not have to worry about cutoff times.
New users get 2.40$ credit for testing out Bramblecloud. That’s enough for running a small instance for 6 full hours. As we don’t like switching off your machines when your credits are used up though, we ask you to provide your payment information ahead of spinning up your first instance. If you never exceed your credits, we will obviously never charge your card.
To start a new cluster, click the “Launch Bramble” button in your dashboard, below the spot price chart. In the popup that appears, enter the following information:
– A unique name for the cluster containing only alphanumeric characters and “.” and “_”. No two of your running or stopped clusters may have the same name.
– The number of permanent machines. These machines are very unlikely to fail, so they are always available for your computations.
– The number or spot instances. These machines will terminate when the spot price goes above your bid, so they might not always be available.
– Your spot bid in US-$ cents. Check out the chart of last week’s maximum spot prices to get a feeling for a good bid.
– Your cluster’s location. Choosing a site that’s close to you makes sense if you’re experiencing high latency delays when you’re working on your cluster.
At the bottom, we show you the maximum price per hour that we may charge you. It’s the sum of the permanent machine charges and the spot bids for the number of spot instances you’re requesting. Depending on the spot price, of course, you will end up paying less for spots most of the time.
Hit “Launch” when you’re ready to get your cluster started.
Your dashboard includes a chart with last week’s spot prices, which helps you choose a sensible bid for your spot requests. The chart shows the maximum spot price per hour and per region (also taking the maximum over the region’s availability zones, in case you are familiar with Amazon EC2). The prices are derived from Amazon’s EC2 spot price market.
Choosing a bid per machine which is frequently exceeded in your region increases the risk that your spots will not be available for extended periods of time during your computation. Once the price in the chart exceeds your bid, your spots may be terminated at any time.
For most applications, it is best to choose a bid which is in between the baseline price (typically around 0.50$) and the price of a permanent machine (starting at 2.67$). No matter how high your bid, you only pay the current price every hour. There is no guarantee, however, that the spot price stays below the price of a permanent machine.
Once your dashboard shows that your cluster is running, hit the “Go” button. This opens a new browser window with an RStudio session on your cluster’s master machine. Username and password are the same as for your Bramblecloud account.
The R running on your cluster is like any other R session. So you can install packages, run scripts, create plots etc.
When you start your session, you will find the R-environment
bramblecloud in your workspace. It contains information about your cluster to configure your parallel computing environment. If all machines are permanent (i.e. no spots), you can simply set up a parallel cluster based on R’s parallel package like this:
cl <- makeCluster(bramblecloud$clusterspec)
If you’re using spots, this may fail, either because your spots are not ready yet or later, because your spots get terminated. In that case, the parallel computation will hang. You should use our
brambleterminal package instead which integrates spots in a robust manner:
cl <- makeBrambleCluster()
The cluster object created by
makeBrambleCluster can be used in the same way as a
parallel package cluster, with commands such as
clusterCall, clusterExport, parLapply and so on. The
brambleterminal package is currently in beta phase and you can read more about it here.
Stopping means that your instance goes into hibernation. Your current R session gets saved to disk and will be restored when you wake the instance back up. You should stop your instance if you want to continue your session at a later time without incurring charges in the meantime. Your cluster will also be stopped automatically if it has been idle for more than an hour.
Terminating means that your instance is shut down for good. All your data on that cluster are deleted and cannot be recovered any more. You should terminate your clusters when you don’t need them any more. Stopped clusters are terminated automatically if they have not been used for more than two weeks.
Apart from the R core (latest R version 3.3.0) and our own brambleterminal package, the following packages are already installed:
- class (classification)
- rpart (recursive partitioning and regression trees)
- e1071 (support vector machines)
Further packages can be installed at runtime from CRAN with the usual install.packages(“packagename”) syntax. If you’re missing your favorite package, write us!
brambleterminal package uses load balancing between all nodes. If you want to use spots, this is highly recommended. It enables the computation to get the most performance out of all its computing resources – EC2 instances and spot instances.
We use SSL encryption for this website and all data transfer of your personal information and passwords. Currently, the RStudio browser session is unencrypted (except for authentication). We’re working on it.
Bramblecloud automatically stops clusters that are idle for more than an hour. We do this because we know how easy it is to forget that your cluster is still running. And how annoying it is to check if your long-running simulation has completed, just so you can stop paying for your cluster.
Stopping your cluster preserves your data. Your R session will be restored when you wake your cluster back up. We max out partial hours of cluster runtime before stopping idle machines. Idle means that your cluster’s workload has been below 1% for a while and you are not logged into an RStudio session.
Bramblecloud configures a personal firewall for your clusters that only let your IP through. This way, only you can access your cluster. This is a great security feature, but gets annoying if your IP address changes. Your internet service provider may sometimes change your IP without notice. And your IP may change if you log into a running session from another device. In those cases, your RStudio browser session will not show or hang as there is no connection.
Click the “Update IP” button in your dashboard to update the firewall for the respective cluster. Your browser session should now work again.
You can copy & paste scripts and plots directly to and from the RStudio browser session. If you need to move files to and from your cluster, the most comfortable way is to do this through a publically hosted Git repository (e.g. on Github or Bitbucket). Git repositories can be imported directly from RStudio (read how) and you can push your files back to the repository, all under version control.
Alternatively, you can upload and download files directly to your cluster through an SSH/SFTP connection. The address of your cluster is shown in the dashboard, the username and password are the same as for your account. For Linux and Mac users, a command line connection is established with this command:
If you’re using Windows, you need an SFTP client such as Putty PSFTP. There are many other SFTP clients around, also graphical front ends that are more intuitive than the command line. We are also working on making file upload and download work straight from the Bramblecloud dashboard.
If you have a Git repository that’s reachable from the web (e.g. at GitHub or Bitbucket), then you can clone it directly to your Bramblecloud cluster from its RStudio window. Simply create a new project in RStudio (File > New Project…) and choose the “Version Control” option. RStudio also works with Subversion repositories in the same way.
Once your repository is checked out, you can commit and push your changes from RStudio’s Git menu in the top toolbar (Tools > Version Control). If you need to execute Git commands beyond the graphic interface, this can be done on RStudio’s shell (Tools > Shell).
We have developed the
brambleterminal package to enable you to integrate spots into your cluster. Its functionality is based on R’s
parallel package, but under the hood it rewires quite a lot:
- dynamic load balancing which is robust against the termination of nodes, handles nodes of varying capacity and latency, and keeps computational overhead low
- lets you define start-up commands to configure nodes when they become available, such as spot nodes whose request is fulfilled when the computation is already running
- documentation of intermediate results, so nothing is lost when your computation fails for any reason
- progress bar for the computation
brambleterminal package is pre-installed on all Bramblecloud clusters. From the RStudio session on your cluster, load the package and get the cluster object like this:
cl <- makeBrambleCluster()
The package automatically figures out your cluster configuration, which is why you don't need to supply any arguments.
brambleterminal comes with implementations for the functions that are also supported by the
parallel package. We list them here, for help refer to the
clusterCall(cl, fun, ...)
clusterApply(cl, x, fun, ...)
parApply(cl, X, MARGIN, FUN, ...)
parLapply(cl, x, fun, ...)
parRapply(cl, x, FUN, ...)
parCapply(cl, x, FUN, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
clusterExport(cl, varlist, envir = .GlobalEnv)
clusterSetRNGStream(cl, iseed = NULL)
clusterSetRNGStream sets the seeds of all workers, this will generally not lead to reproducible results as all parallel computations in the
brambleterminal package use dynamic load balancing. This is done to integrate unreliable resources such as spots.
brambleterminal implements two functions to inspect and repair the cluster:
check_health(cl, intrusive = TRUE)
repair(cl, force = FALSE)
check_health checks if all its nodes are reachable and responsive. Setting
intrusive = TRUE (the default) pings all workers with a small test computation. This does not interfere with other running computations, but other computations may lead to a false reply if they take up its computational resources. Setting
intrusive = FALSE does not ping the workers and only checks that all intermediate queues and objects are present.
repair stops and rebuilds all nodes for which
check_health reports problems. If
force = TRUE, this is done for all nodes regardless of their health. This deletes all preliminary data stores of the preceding cluster computation.
brambleterminal‘s cluster objects keep a log and a digest that can be accessed after a computation is done. They are located in the cluster’s data environment
cl@data$logcontains events during computation such as the termination of cluster nodes
cl@data$digestcontains the distribution of jobs to the cluster’s nodes
The following commands can be used to define commands that should be run on the workers of nodes that become available:
bramble.startup executes the expression in
... and saves cluster commands to be executed when new nodes get started. Right now, only
clusterCall, clusterExport, and clusterSetRNGStream are retained in this way. Subsequent calls to
bramble.startup append startup commands to the previous commands. Instead of calling
bramble.startup repeatedly, you can also group multiple startup commands with
bramble.startup.reset clears out all startup commands set for this cluster’s nodes.