Many models utilize random numbers during the phase where parameters are estimated.
Also, the resampling indices are chosen using random numbers. There are two main ways to control the randomness in order to assure reproducible results. How random numbers are used is highly dependent on the package author. There are rare cases where the underlying model function does not control the random number seed, especially if the computations are conducted in C code.
Also, please note that some packages load random numbers when loaded directly or via namespace and this may affect reproducibility.
As previously mentioned, train can pre-process the data in various ways prior to model fitting. The function preProcess is automatically used. This function can be used for centering and scaling, imputation see details below , applying the spatial sign transformation and feature extraction via principal component analysis or independent component analysis. To specify what pre-processing should occur, the train function has an argument called preProcess.
This argument takes a character string of methods that would normally be passed to the method argument of the preProcess function. Additional options to the preProcess function can be passed via the trainControl function.
These processing steps would be applied during any predictions generated using predict. The tuning parameter grid can be specified by the user. The argument tuneGrid can take a data frame with columns for each tuning parameter. For the previously mentioned RDA example, the names would be gamma and lambda. For the boosted tree model, we can fix the learning rate and evaluate more than three values of n. Another option is to use a random sample of possible tuning parameter combinations, i. This functionality is described on this page. In this situation, the tuneLength parameter defines the total number of parameter combinations that will be evaluated.
The plot function can be used to examine the relationship between the estimates of performance and the tuning parameters. For example, a simple invokation of the function shows the results for the first performance measure:. Other types of plot are also available.
The code below shows a heatmap of the results:. There are also plot functions that show more detailed representations of the resampled estimates. From these plots, a different set of tuning parameters may be desired. To change the final values without starting the whole process again, the update. The function trainControl generates parameters that further control how models are created, with possible values:. The user can change the metric used to determine the best settings. Also by default, the parameter values are chosen using RMSE and accuracy, respectively for regression and classification.
The metric argument of the train function allows the user to control which the optimality criterion is used. If none of these parameters are satisfactory, the user can also compute custom performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:. The output to the function should be a vector of numeric summary metrics with non-null names.
In this example, we create a simple test case which uses two properties config and shell and uses those in multiple test methods. Ampersand-router also adds a redirectTo method which is handy for doing "internal" redirects without breaking backbutton functionality in the browser. It allows you to tell that the argument will delegate to a specific type you can also specify the delegation strategy. Pickles should not be used as part of a long-term archival strategy. The technique works quite effectively — see the video that goes with this article for a demonstration. Flow typing has been introduced to reduce the difference in semantics between classic and static Groovy.
By default, train evaluate classification models in terms of the predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities within the resampling process, the argument classProbs in trainControl must be set to TRUE. This merges columns of probabilities into the predictions generated from each resample there is a column per class and the column names are the class names.
As shown in the last section, custom functions can be used to calculate performance scores that are averaged over the resamples. Another built-in function, twoClassSummary , will compute the sensitivity, specificity and area under the ROC curve:. To rebuild the boosted tree model using this criterion, we can see the relationship between the tuning parameters and the area under the ROC curve using the following code:. In this case, the average area under the ROC curve associated with the optimal tuning parameters was 0.
By default, the train function chooses the model with the largest performance value or smallest, for mean squared error in regression models. Other schemes for selecting model can be used. In this case, the model with the best performance value is identified and, using resampling, we can estimate the standard error of performance. The final model used was the simplest model within one standard error of the empirically best model.
With simple trees this makes sense, since these models will start to over-fit as they become more and more specific to the training data. The argument selectionFunction can be used to supply a function to algorithmically determine the final model. As an example, if we chose the previous boosted tree model on the basis of overall accuracy, we would choose: n. However, the scale in this plots is fairly tight, with accuracy values ranging from 0. A less complex model e. This indicates that we can get a less complex model with an area under the ROC curve of 0.
The main issue with these functions is related to ordering the models from simplest to complex. In some cases, this is easy e. For example, is a boosted tree model using iterations and a tree depth of 2 more complex than one with 50 iterations and a depth of 8? The package makes some choices regarding the orderings. In the case of boosted trees, the package assumes that increasing the number of iterations adds complexity at a faster rate than increasing the tree depth, so models are ordered on the number of iterations then ordered with depth.
Predictions can be made from these objects as usual. In some cases, such as pls or gbm objects, additional parameters from the optimized fit may need to be specified. In these cases, the train objects uses the results of the parameter optimization to predict new samples. For example, if predictions were created using predict. Also, for binary classification, the predictions from this function take the form of the probability of one of the classes, so extra steps are required to convert this to a factor vector.
Also, there are very few standard syntaxes for model predictions in R. For example, to get class probabilities, many predict methods have an argument called type that is used to specify whether the classes or probabilities should be generated. Different packages use different values of type , such as "prob" , "posterior" , "response" , "probability" or "raw". In other cases, completely different syntax is used. For predict. For example:. There are several lattice functions than can be used to explore relationships between tuning parameters and the resampling results for a specific model:.
The caret package also includes functions to characterize the differences between models generated using train , sbf or rfe via their resampling distributions. These functions are based on the work of Hothorn et al. First, a support vector machine model is fit to the Sonar data. The data are centered and scaled using the preProc argument. Note that the same random number seed is set prior to the model that is identical to the seed used for the boosted tree model.
This ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models. Given these models, can we make statistical statements about their performance differences? To do this, we first collect the resampling results using resamples. There are several lattice plot methods that can be used to visualize the resampling distributions: density plots, box-whisker plots, scatterplot matrices and scatterplots of summary statistics. Other visualizations are availible in densityplot.
The Rails Tutorial often shows output from various programs. Because of the innumerable small differences between different computer systems, the output you see may not always agree exactly with what is shown in the text, but this is not cause for concern. If you run into any problems while following the tutorial, I suggest consulting the resources listed in the Rails Tutorial help page.
Because the Rails Tutorial covers testing of Rails applications, it is often helpful to know if a particular piece of code causes the test suite to fail indicated by the color red or pass indicated by the color green. For convenience, code resulting in a failing test is thus indicated with red , while code resulting in a passing test is indicated with green. Finally, for convenience the Ruby on Rails Tutorial adopts two conventions designed to make the many code samples easier to understand. First, some code listings include one or more highlighted lines, as seen below:.
Such highlighted lines typically indicate the most important new code in the given sample, and often though not always represent the difference between the present code listing and previous listings. Even for experienced Rails developers, installing Ruby, Rails, and all the associated supporting software can be an exercise in frustration.
Compounding the problem is the multiplicity of environments: different operating systems, version numbers, preferences in text editor and integrated development environment IDE , etc. The Ruby on Rails Tutorial offers two recommended solutions to this problem.
The other possibility, recommended for newer users, is to sidestep such installation and configuration issues by using a cloud integrated development environment , or cloud IDE. The cloud IDE used in this tutorial runs inside an ordinary web browser, and hence works the same across different platforms, which is especially useful for operating systems such as Windows on which Rails development has historically been difficult. It also maintains the current state of your work, so you can take a break from the tutorial and come back to the system just as you left it.
Considering various idiosyncratic customizations, there are probably as many development environments as there are Rails programmers. The resulting workspace environment comes pre-configured with most of the software needed for professional-grade Rails development, including Ruby, RubyGems, Git. Although you are welcome to develop your application locally, setting up a Rails development environment can be challenging, so I recommend the cloud IDE for most readers.
Here are the steps for getting started with the cloud development environment: 7. Because using two spaces for indentation is a near-universal convention in Ruby, I also recommend changing the editor to use two spaces instead of the default four. Here the -v flag ensures that the specified version of Rails gets installed, which is important for getting results consistent with this tutorial.
Virtually all Rails applications start the same way, by running the rails new command. This handy command creates a skeleton Rails application in a directory of your choice. For readers coming from Windows or to a lesser but still significant extent macOS, the Unix command line may be unfamiliar. Luckily, if you are using the recommended cloud environment, you automatically have access to a Unix Linux command line running a standard shell command-line interface known as Bash.
The basic idea of the command line is simple: by issuing short commands, users can perform a large number of operations, such as creating directories mkdir , moving and copying files mv and cp , and navigating the filesystem by changing directories cd. Indeed, you will rarely see the desktop of an experienced developer without several open terminal windows running command-line shells.
Notice how many files and directories the rails command creates. After creating a new Rails application, the next step is to use Bundler to install and include the gems needed by the app. This involves opening the Gemfile with a text editor. With the cloud IDE, this involves clicking the arrow in the file navigator to open the sample app directory and double-clicking the Gemfile icon. Unless you specify a version number to the gem command, Bundler will automatically install the latest requested version of the gem. This is the case, for example, in the code.
There are also two common ways to specify a gem version range, which allows us to exert some control over the version used by Rails. The first looks like this:. The second method looks like this:. Important note: For all the Gemfiles in this book, you should use the version numbers listed at gemfiles-4th-ed. In this case you should… run bundle update first.
When interacting with a Rails application, a browser sends a request , which is received by a webserver and passed on to a Rails controller , which is in charge of what to do next. In some cases, the controller will immediately render a view , which is a template that gets converted to HTML and sent back to the browser. More commonly for dynamic sites, the controller interacts with a model , which is a Ruby object that represents an element of the site such as a user and is in charge of communicating with the database. After invoking the model, the controller then renders the view and returns the complete web page to the browser as HTML.
As implied by their name, controller actions are defined inside controllers. Indeed, at this point the Application controller is the only controller we have, which you can verify by running. In particular, we want to change the default page, the root route , which determines the page that is served on the root URL. The syntax looks like this:. Knowing how to use a version control system is a required skill for every professional-grade software developer.
There are many options for version control, but the Rails community has largely standardized on Git , a distributed version control system originally developed by Linus Torvalds to host the Linux kernel. Before using Git, you should perform a couple of one-time setup steps. These are system setups, meaning you only have to do them once per computer:. Note that the name and email address you use in your Git configuration will be available in any repositories you make public.
Now we come to some steps that are necessary each time you create a new repository sometimes called a repo for short. The first step is to navigate to the root directory of the first app and initialize a new repository:.
This command adds all the files in the current directory apart from those that match the patterns in a special file called. The rails new command automatically generates a. The added files are initially placed in a staging area , which contains pending changes to our project. We can see which files are in the staging area using the status command:. All the examples in this book will use the -m flag.
It is important to note that Git commits are local , recorded only on the machine on which the commits occur. This means we can still undo the changes using the checkout command with the -f flag to force overwriting the current changes:. The missing files and directories are back. By far the two most popular sites for hosting Git repositories are GitHub and Bitbucket. The two services share many similarities: both sites allow for Git repository hosting and collaboration, as well as offering convenient ways to browse and search repositories.
The important differences from the perspective of this tutorial are that GitHub offers unlimited free repositories with collaboration for open-source repositories while charging for private repos, whereas Bitbucket allows unlimited free private repos while charging for more than a certain number of collaborators. Which service you use for a particular repo thus depends on your specific needs.
Learn Enough Git to Be Dangerous and some previous editions of this tutorial use GitHub because of its emphasis on supporting open-source code, but growing concerns about security have led me to recommend that all web application repositories be private by default. The issue is that such repositories might contain potentially sensitive information such as cryptographic keys or passwords, which could be used to compromise the security of a site running the code. It is possible, of course, to arrange for this information to be handled securely by having Git ignore it, for example , but this is error-prone and requires significant expertise.
As it happens, the sample application created in this tutorial is safe for exposure on the web, but it is dangerous to rely on this fact in general. Thus, to be as secure as possible, we will err on the side of caution and use private repositories by default. Since GitHub charges for private repositories while Bitbucket offers an unlimited number for free, for our present purposes Bitbucket is a better fit than GitHub.
By the way, recently a third major Git hosting company has emerged, called GitLab. Originally designed principally as an open-source Git tool you hosted yourself, GitLab now offers a hosted version as well, and in fact allows for unlimited public and private repositories. UPDATE: GitHub announced in early that it will be offering unlimited free private repositories with a limit only on the number of collaborators.
Future editions of this tutorial may switch back to GitHub as a result. For example, the command I ran was. As indicated by the filename extension. This automatic rendering of the README is convenient, but of course it would be better if we tailored the contents of the file to the project at hand. Git is incredibly good at making branches , which are effectively copies of a repository where we can make possibly experimental changes without modifying the parent files. In most cases, the parent repository is the master branch, and we can create a new topic branch by using checkout with the -b flag:.
The full value of branching only becomes clear when working on a project with multiple developers, 23 but branches are helpful even for a single-developer tutorial such as this one. In particular, because the master branch is insulated from any changes we make to the topic branch, even if we really mess things up we can always abandon the changes by checking out the master branch and deleting the topic branch.
Be careful about using the -a flag improperly; if you have added any new files to the project since the last commit, you still have to tell Git about them using git add -A first. Note that we write the commit message in the present tense and, technically speaking, the imperative mood. Git models commits as a series of patches, and in this context it makes sense to describe what each commit does , rather than what it did. Moreover, this usage matches up with the commit messages generated by Git commands themselves. Your exact results will differ in these details, but otherwise should essentially match the output shown above.
This way you can switch back and forth between the topic and master branches, merging in changes every time you reach a natural stopping point. This step is optional, but deploying early and often allows us to catch any deployment problems early in our development cycle.
The alternative—deploying only after laborious effort sealed away in a development environment—often leads to terrible integration headaches when launch time comes. Deploying Rails applications used to be a pain, but the Rails deployment ecosystem has matured rapidly in the past few years, and now there are several great options. These include shared hosts or virtual private servers running Phusion Passenger a module for the Apache and Nginx 25 webservers , full-service deployment companies such as Engine Yard and Rails Machine , and cloud deployment services such as Engine Yard Cloud and Heroku.
My favorite Rails deployment option is Heroku, which is a hosted platform built specifically for deploying Rails and other web applications. Heroku makes deploying Rails applications ridiculously easy—as long as your source code is under version control with Git. The rest of this section is dedicated to deploying our first application to Heroku. We can commit the resulting change as follows:.