R and Python are two popular languages for those who want to do data analysis. In this post, I will cover some libraries, packages and resources that will help you quickly learn how to become proficient with these statistical and scripting languages. This post is intended for the beginner-intermediate level, though you may find some useful tools no matter your skill level. Let’s start with R first.
R Programming Language
Created by Ross Ihaka and Robert Gentlemen in 1993, the source code was written in C and Fortran and is continually updated. The last stable release (3.2.4) rolled out, at the time of writing, less than two weeks ago. It is free to use under the GNU General Public License. R can be accessed through the command line, but most will recommend that you start with the graphical user interface, R Studio. To learn R, most people will simply Google “R tutorial” or “how to learn R”, which returns some Lynda courses, Coursera courses – all of which are behind some paywall or registration form. There are a few high quality resources left which come close to being truly free – Stanford’s online program, Lagunita, and Swirl. The former being a traditional online course, the latter being an interactive learning experience that goes by the slogan “Learn R, in R.” Alternatively there is a new online coding website called Datacamp, reminiscent of Codecademy, that is free to register and begin certain intro-intermediate level courses. To complete all of the coursework it will require payment, however. The web interface is professional looking and the mini-video lectures are high quality as well. Datacamp’s web based lessons are interactive and provide a console so you do not need to download and install anything, which may be a plus for some.
Let me say first that although I think these are great resources, you still won’t get a socially interactive experience that is typical of traditional classroom courses. Having the ability to verbalize a question and someone respond to it instantly is priceless. However, sometimes you can make up for this by joining a community or forum, or even a Skype video call. At any rate, if you prefer to learn by doing something – writing out the math problem, typing in the code, and seeing the results of your actions immediately, start with Swirl. If you prefer to listen to somebody talk, or watch video lectures, take notes and look at slides, begin with a course on Stanford’s Lagunita. I preferred Swirl, it is a really neat way to teach someone how to do statistical programming. I was surprised that I had never heard of. It even strives towards helping you understand what’s under the hood of statistical testing and the logic that supports it.
How to begin? It’s easy.
I was impressed with how straightforward Swirl’s installation and setup was. Go to the learn tab, follow the 5 steps, which involves issuing a grand total of 3 commands in the R console, then choose which course you want to do. Note: the interactive tutorial will ask if you want to receive credit for the class on Coursera. While the courses there were entirely free at one time, you will need to register and pay Coursera in order to “receive credit”. No worries though, you can still proceed with the tutorial for free. It’s about what you know, and what you can do with that knowledge anyway. Here’s a screenshot of the interface:
Here I have loaded the library, ran the Swirl function, and navigated to a list of courses that I have downloaded. Don’t worry, there’s more than just that. The courses work by doing some of the more difficult things for you at first, then gradually shifting those tasks towards the student. In the beginner and intermediate level courses, the lines of code are explicitly given to you for you to re-type and submit. Once the user has done this correctly, the course moves forward with a progress indicator (15% complete, 27%, etc). Sometimes you’ll be asked to select a correct answer, choices 1-4. When you choose the wrong answer, the question is asked again – this time with the answers corresponding with different numbers (in an attempt to prevent lazy guessing, I suppose). However, this forces users to read the choices over again because the number for your second best choice has probably changed. Other than that the material does a good job of teaching the code syntax to you. If you cannot answer a question no matter how hard you try, you may type “skip()” and have the correct answer automatically entered for you. It does however leave something to be desired in terms of explaining the “under the hood” aspect of statistics and mathematics. This is understandable as it’s all through a text interface in the console. That is why I recommend following Stanford online classes (or another of your choosing) in addition to this. If you want to explore other high quality learning resources, take a look at the question answered on Quora.
Where does Python come into the mix when doing a statistical analysis? As you may know, Python is a scripting language and so the code is saved to a text file and executed without needing to be compiled. This is what makes it such a great language to rapidly prototype models and solve constantly changing problems. To learn more about Python and how to get started quick, read here. Python is a natural compliment to R when conducting analyses because it can handle the pre-processing duties with extreme ease. If cell.isBlank == True, delete row. If x in string, string = re.split(‘x’, string). Formatting data to your needs couldn’t be easier.
Without a means to quickly communicate your data and insights, the value of your work is diminished. People want to see the equivalent of the mathematical jargon you’re spewing. While keeping it simple with as few colors/distractors as possible is a good approach when creating visual aides, you’ll need more colors, shapes, and dimensions the more variables your data has. I came across a blog post a while back about using Seaborn, a Python library created by Michael Waskom that helps create aesthetically pleasing graphs.
The graphs are professional looking and make ample use of space. The library is based on Matplotlib. Here is a link to the main page.
What about other more advanced methods? Python has that too – scikit-learn is a free and open source collection of tools that are meant for doing all sorts of data analysis. Classification, regression, clustering, model selection, dimensionality reduction and preprocessing are all supported in this module. You will need to be fairly proficient with Python and statistics before you can begin using scikit-learn meaningfully – while there is some documentation, the libraries assume a lot of existing knowledge.
For analyzing natural language, behold the Natural Language Tool Kit. Also, there is Orange – an open source software, partly written in Python, that provides useful tools and libraries for data mining, visualization, machine learning, etc. I am not aware of any free online courses that use these last two packages, but head over to KDNuggets to find out, it’s a data science/big data/data mining oriented website that has far more information than I’m able to explain here. Update: for 2017, an Oxford class on deep learning and NLP has been made available to all on Github.
Exploration and Environment
Another useful tool is IPython notebook. If you don’t like the IDLE interactive shell, try this. It has a useful web-based environment that will save sessions as json files. It’s great for keeping track of commands and putting together scripts. Also, I’ve heard it makes sharing work done in Python easier to share with others who don’t have Python installed on their machine (if you wan’t to email a colleague an example of your progress on a model and they’re not at work, for instance). As always, it is free to use.
Between R and Python, there are plenty of tools to aid your data exploration. I didn’t even cover popular websites for R, like r-bloggers.com – if you have any questions or suggestions that I add to this post, please comment or email me! I’d love to hear from members of the community. Good luck!