Updated: May 3, 2020
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. It is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.
Data science is a broad concept and it seems a bit complex, but when it is broken down into its components, it could become a little bit easier to grasp. The components of data science include:
· Data collection
· Data wrangling
· Data mining and machine learning
· Data visualization
Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional and big data. Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. Big data is data that typically has variety (numbers, text, images, audio, mobile data, etc.), velocity (retrieved and computed in real time), volume (measured in tera-, peta-, exa-bytes), and it is often distributed across a network of computers.
The raw data collected is not always perfect and needs to be made right for further processing. The process of extracting useful details by cleansing and filtering relevant data and enriching it for further exploratory analysis is called as data preparation or data ‘wrangling’. The quality of the final data depends on this step.
Data Mining and Machine Learning
Data mining is a cross-disciplinary field that focuses on discovering properties of data sets. There are different approaches to discovering properties of data sets. Machine learning is one of them. Machine learning is a sub-field of data science that focuses on designing algorithms that can learn from and make predictions on the data. Machine learning can be used for data mining. However, data mining can use other techniques besides or on top of machine learning.
Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. Data visualization is a subset of data science. Whereas data science is insights, data visualization is the representation of the data that enhances the capability of that.
The role of a data scientist is generally being someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. A data scientist would spend a lot of time in the process of collecting, cleaning, and wrangling data, because there is a lot of data and data is never quite so clean. They could be involved in all aspects and components of data science.