‘Big’ Data Science and Scientists
‘BIG’ DATA SCIENCE
If you could possibly take a trip back in time with a time machine and say to people that today a child can interact with one another from anywhere and query trillions of data all over the globe with a simple click on his/her computer they would have said that it is science fiction !
Today more than 2.9 million emails are sent across the internet every second. 375 megabytes of data is consumed by households each day. Google processes 24 petabyte of data per day. Now that’s a lot of data !! With each click, like and share, the world’s data pool is expanding faster than we comprehend. Data is being created every minute of every day without us even noticing it. Businesses today are paying attention to scores of data sources to make crucial decisions about the future. The rise of digital and mobile communication has made the world become more connected, networked and traceable which has typically resulted in the availability of such large scale data sets.
So what is this buzz word “Big Data” all about ? Big data may be defined as data sets whose size is beyond the ability of typical database software tools to capture, create, manage and process data. The definition can differ by sector, depending on what kinds of software tools are commonly available and what sizes of data sets are common in a particular industry.
The explosion in digital data, bandwidth and processing power – combined with new tools for analyzing the data has sparked massive interest in the emerging field of data science. Big data has now reached every sector in the global economy. Big data has become an integral part of solving the world’s problems. It allows companies to know more about their customers, products and on their own infrastructure. More recently, people have become extensively focused on the monetization of that data.
According to a McKinsey Global Institute Report[1] in 2011, simply making big data more easily accessible to relevant stakeholders in a timely manner can create enormous value. For example, in the public sector, making relevant data more easily accessible across otherwise separated departments can sharply cut search and processing time. Big data also allows organizations to create highly specific segmentations and to tailor products and services precisely to meet those needs. This approach is widely known in marketing and risk management but can be revolutionary elsewhere.
Big Data is improving transportation and power consumption in cities, making our favorite websites & social networks more efficient, and even preventing suicides. Businesses are collecting more data than they know what to do with. Big data is everywhere; the volume of data produced, saved and mined is startling. Today, companies use data collection and analysis to formulate more cogent business strategies. Manufactures use data obtained from the use of real products to improve and develop new products and to create innovative after-sale service offerings. This will continue to be an emerging area for all industries. Data has become a competitive advantage and necessary part of product development.
Companies succeed in the big data era not simply because they have more or better data, but because they have good teams that set clear objectives and define what success looks like by asking the right questions. Big data are also creating new growth opportunities and entirely new categories of companies, such as those that collect and analyze industrial data.
One of the most impressive areas, where the concept of Big data is taking place is the area of machine learning. Machine Learning can be defined as the study of computer algorithms that improve automatically through experience. Machine learning is a branch of artificial intelligence which itself is a branch of computer science. Applications range from data mining programs that discover general rules in large data sets, to information filtering systems that learns automatically the user’s interests.
Rising alongside the relatively new technology of big data is the new job title data scientist. An article by Thomas H. Davenport and D.J. Patil in Harvard Business Review[2] describes ‘Data Scientist’ as the ‘Sexiest Job of the 21st Century’. You have to buy the logic that what makes a career “sexy” is when demand for your skills exceeds supply, allowing you to command a sizable paycheck and options. The Harvard Business Review actually compares these “data scientists” to the quants of 1980s and 1990s on Wall Street, who pioneered “financial engineering” and algorithmic trading. The need for data experts is growing and demand is on track to hit unprecedented levels in the next five years
Who are Data Scientists ?
Data scientists are people who know how to ask the right questions to get the most value out of massive volumes of data. In other words, data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
Good data scientists will not just address business problems; they will choose the right problems that have the most value to the organization. They combine the analytical capabilities of a scientist or an engineer with the business acumen of the enterprise executive.
Data scientists have changed and keep changing the way things work. They integrate big data technology into both IT departments and business functions. Data scientist’s must also understand the business applications of big data and how it will affect the business organization and be able to communicate with IT and business management. The best data scientists are comfortable speaking the language of business and helping companies reformulate their challenges.
Data science due to its interdisciplinary nature requires an intersection of abilities of hacking skills, math and statistics knowledge and substantive expertise in the field of science. Hacking skills are necessary for working with massive amount of electronic data that must be acquired, cleaned and manipulated. Math and statistics knowledge allows a data scientist to choose appropriate methods and tools in order to extract insight from data. Substantive expertise in a scientific field is crucial for generating motivating questions and hypotheses to interpret results. Traditional research lies at the intersection of knowledge of math and statistics with substantive expertise in a scientific field. Machine learning stems from combining hacking skills with math and statistics knowledge, but does not require scientific motivation. Science is about discovery and raising knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. Hacking skills combined with substantive scientific expertise without rigorous methods can beget incorrect analysis.
A good scientist can understand the current state of a field, pick challenging questions were a success will actually lead to useful new knowledge and push that field further through their work.
How to become a Data Scientist ?
No university programs in India have yet been designed to develop data scientists, so recruiting them requires creativity. You cannot become a big data scientist overnight. Data Scientist need to know how to code and should be comfortable with mathematics and statistics. Data Scientist need know machine learning & software engineering. Learning data science can be really hard. They also need to know how to organize large data sets and use visualization tools and techniques.
Data scientists need to know how to code either in SAS, SPSS, Python or R. Statistical Package for the Social Sciences (SPSS) is a software package currently developed by IBM is widely used program for statistical analysis in social science. Statistical Analysis System (SAS) software suite developed by SAS Institute is mainly used in advanced analytics. SAS is the largest market-share holder for advanced analytics. Python is a high-level programming language, which is the most commonly used by data scientist’s community. Finally, R is a free software programming language for statistical computing and graphics. R language has become a de facto standard among statisticians for developing statistical software and is widely used for statistical software development and data analysis. R is a part of the GNU Project which is a collaboration that supports open source projects.
A few online courses would help you learn some of the main coding languages. One such course that is available currently is through the popular MOOCs website coursera.org. A specialization course offered by Johns Hopkins University through coursera helps you learn R programming, visualize data, machine learning and to develop data products. There are few more courses available through coursera that helps you to learn data science. Udacity is another popular MOOCs website that offers courses on Data Science, Machine Learning & Statistics. CodeAcademy also offers similar courses to learn data science and coding in Python.
When you start operating with data at the scale of the web, the fundamental approach and process of analysis must and will change. Most data scientists are working on problems that can’t be run on a single machine. They have large data sets that require distributed processing. Hadoop is an open-source software framework for storing and large-scale processing of data-sets on clusters of commodity hardware. MapReduce is this programming paradigm that allows for massive scalability across the servers in a Hadoop cluster. Apache Spark is Hadoop’s speedy Swiss Army knife. It is a fast -running data analysis system that provides real-time data processing functions to Hadoop. It is important that a data scientist must be able to work with unstructured data, whether it is from social media, videos or even audio.
KDnuggets is a popular website among data scientist that mainly focuses on latest updates and news in the field of Business Analytics, Data Mining, and Data Science. KDnuggets also offers a free Data Mining Course – the teaching modules for a one-semester introductory course on Data Mining, suitable for advanced undergraduates or first-year graduate students.
Kaggle is a platform for data prediction competitions. It is a platform for predictive modeling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. Kaggle hosts many data science competitions where you can practice, test your skills with complex, real world data and tackle actual business problems. Many employers do take Kaggle rankings seriously, as they can be seen as pertinent, hands-on project work. Kaggle aims at making data science a sport.
Finally to be a data scientist you’ll need a good understanding of the industry you’re working in and know what business problems your company is trying to solve. In terms of data science, being able to find out which problems are crucial to solve for the business is critical, in addition to identifying new ways should the business should be leveraging its data.
A study by Burtch Works[3] in April 2014, finds that data scientists earn a median salary that can be up to 40% higher than other Big Data professionals at the same job level. Data scientists have a median of nine years of experience, compared to other Big Data professionals who have a median of 11 years. More than one-third of data scientists are currently in the first five years of their careers. The gaming and technology industries pay higher salaries to data scientists than all other industries.
LinkedIn, a popular business oriented social networking website voted “statistical analysis and data mining” the top skill that got people hired in the year 2014. Data science has a bright future ahead there will only be more data and more of a need for people who can find meaning and value in that data. Despite the growing opportunity, demand for data scientist has outpaced supply of talent and will for the next five years.