Data Analytics - Which programming language to learn. R vs Python
By: Ashley J in data-science Tutorials on 2017-09-24
According to an IBM Report, openings for data analytics jobs in the US will rise to 2.72 million by 2020. It is not surprising that there are still quite a number of people using spreadsheets like Excel or Google Sheets to crunch numbers. And there are others who use proprietary statistical software such as SAS, Stata, SPSS etc.
While Excel and SAS are powerful tools, they have their limitations. For example Excel cannot handle data sets above certain sizes. Tools like SAS or closed source and therefore there are not contributors who can add newer features to it. So there is a big gap here for people who want to do complex analytics and customize it to their needs. The next step for these people who reached the edge of these programs is to learn R or Python.
Data analysts and Data Scientists use R and Python extensively. R and Python are open source. For anyone interested in machine learning, working with large datasets, or creating complex data visualizations, R and Python comes handy. R is more for statistical analysis while Python is more for general purpose programming.
Often people ask which one is better to learn? R or Python. Python is better for for data manipulation and repeated tasks, while R is good for ad hoc analysis and exploring datasets. For example, take text analysis, where you want to deconstruct paragraphs into words or phrases and then identify patterns. In this use case R is better suited and makes it simple. On the other hand, take for example, pulling the data, to running automated analyses over and over, to producing visualizations like maps and charts from the results then Python is better suited.
And comparing the learning curve, Python is relatively easy to learn compared to R which may be a bit intimidating for beginners. Another advantage for Python is that it is a general purpose programming language which makes it easy for doing stuff other than for analytics. While Python is more like a programming languate and is suited for programmers, R is more of a statistical language and may be confusing for some.
But for data analysis, the differences between R and Python are starting to diminish. Most of the common tasks once associated with one program or the other are now doable in both. So it is matter of self preference for choosing one over the other. As you can see, Python and R both have their pros and cons. Selecting one over the other will depend on the use-cases, the cost of learning, and other common tools required.
When to use R?
R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. When getting started with R, a good first step is to install the RStudio IDE. For easy quickstart analysis, you can use the following popular packages:
- dplyr, plyr and data.table to easily manipulate packages,
- stringr to manipulate strings,
- zoo to work with regular and irregular time series,
- ggvis, lattice, and ggplot2 to visualize data, and
- caret for machine learning
When to use Python?
If you need to integrate data analysis tasks with web apps or if statistics code needs to be incorporated into a production database then you should probably use Python. Being a full-fledged programming language, it's a great tool to implement algorithms for production use.
Being a general purpose language, Python did not have Data Analysis related packages in the past. We can safely say that, this has improved significantly over the years. To get started with Python for Data Analytics, install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis. Also have a look at matplotlib to make graphics, and scikit-learn for machine learning.
As for Python IDE, have a look at Spyder, IPython Notebook and Rodeo to see which one best fits your needs.
Given the background, there is a growing group of individuals using a combination of both languages when appropriate. If you're planning to start a career in data science, you are good with both languages. Job trends indicated an increasing demand for both skills, and wages are well above average.
R: Pros and Cons
Pro: A picture says more
than a thousands words
Visualized data can often be understood more efficiently and
effectively than the raw numbers alone. R and visualization are a
perfect match. Some must-see visualization packages are ggplot2, ggvis,
googleVis and rCharts.
Pro:
R ecosystem
R has a rich ecosystem of cutting-edge packages and active community.
Packages are available at CRAN, BioConductor and Github. You can search
through all R packages at Rdocumentation.
Pro:
R lingua franca of
data science
R is developed by statisticians for statisticians. They can communicate
ideas and concepts through R code and packages, you don't necessarily
need a computer science background to get started.
Furthermore, it is increasingly adopted outside of academia.
Pro/Con:
R is slow
R was developed to make the life of statisticians easier, not the life
of your computer. Although R can be experienced as slow due to poorly
written code, there are multiple packages to improve R's performance:
pqR, renjin and FastR, Riposte and many more.
Con:
R has a steep
learning curve
R's learning curve is non-trivial, especially if you come from a GUI
for your statistical analysis. Even finding packages can be time
consuming if you're not familiar with it.
Python: Pros and Cons
Pro: IPython Notebook
The IPython Notebook makes it easier to work with Python and data. You
can easily share notebooks with colleagues, without having them to
install anything. This drastically reduces the overhead of
organizing code, output and notes files. This will allow you to spend
more time doing real work.
Pro:
A general purpose
language
Python is a general purpose language that is easy and intuitive. This
gives it a relatively flat learning curve, and it increases the speed
at which you can write a program. In short, you need less
time to code and you have more time to play around with it!
Furthermore, the Python testing framework is a built-in, low-barrier-to-entry testing framework that encourages good test coverage. This guarantees your code is reusable and dependable.
Pro:
A multi purpose
language
Python brings people with different backgrounds together. As a common,
easy to understand language that is known by programmers and that can
easily be learnt by statisticians, you can build a single tool that
integrates with every part of your workflow.
Pro/Con:
Visualizations
Visualizations are an important criteria when choosing data analysis
software. Although Python has some nice visualization libraries, such
as Seaborn, Bokeh and Pygal, there are maybe too many options to choose
from. Moreover, compared to R, visualizations are usually more
convoluted, and the results are not always so pleasing to the eye.
Con:
Python is a
challenger
Python is a challenger to R. It does not offer an alternative to the
hundreds of essential R packages. It is however catching up.
Add Comment
This policy contains information about your privacy. By posting, you are declaring that you understand this policy:
- Your name, rating, website address, town, country, state and comment will be publicly displayed if entered.
- Aside from the data entered into these form fields, other stored data about your comment will include:
- Your IP address (not displayed)
- The time/date of your submission (displayed)
- Your email address will not be shared. It is collected for only two reasons:
- Administrative purposes, should a need to contact you arise.
- To inform you of new comments, should you subscribe to receive notifications.
- A cookie may be set on your computer. This is used to remember your inputs. It will expire by itself.
This policy is subject to change at any time and without notice.
These terms and conditions contain rules about posting comments. By submitting a comment, you are declaring that you agree with these rules:
- Although the administrator will attempt to moderate comments, it is impossible for every comment to have been moderated at any given time.
- You acknowledge that all comments express the views and opinions of the original author and not those of the administrator.
- You agree not to post any material which is knowingly false, obscene, hateful, threatening, harassing or invasive of a person's privacy.
- The administrator has the right to edit, move or remove any comment for any reason and without notice.
Failure to comply with these rules may result in being banned from submitting further comments.
These terms and conditions are subject to change at any time and without notice.
- Data Science
- Android
- React Native
- AJAX
- ASP.net
- C
- C++
- C#
- Cocoa
- Cloud Computing
- HTML5
- Java
- Javascript
- JSF
- JSP
- J2ME
- Java Beans
- EJB
- JDBC
- Linux
- Mac OS X
- iPhone
- MySQL
- Office 365
- Perl
- PHP
- Python
- Ruby
- VB.net
- Hibernate
- Struts
- SAP
- Trends
- Tech Reviews
- WebServices
- XML
- Certification
- Interview
Comments