Introduction to Data Science

Introduction to Data Science

  1. Why Data Science?
  2. What is Data Science?
  3. What is Data Science process?
  4. What Kinds of Data Can Be Analyzed?
  5. What Kinds of Patterns Can Be Analyzed?
  6. Which Technologies are Used?
  7. Which Kinds of Applications Are Targeted?
  8. Major issues in Data Science
  9. Data Science and Society

Why Data Science?

We live in a world where a vast amount of data are collected daily. It is a significant necessity to analyze such data to discover knowledge from it.

We live in the information age

It is a popular saying, but in fact, we live in the information age. Every day, terabytes or petabytes of data flow into our computer networks, the World Wide Web (WWW), and various data storage devices from the company, society, science and engineering, medicine, and almost every other aspect of everyday life. Powerful and versatile tools are badly required to automatically discover and convert precious information from enormous quantities of data into structured knowledge.

What is Data Science?

Data science is a set of fundamental principles that support and guide the principled extraction of information and knowledge from data [1] . Data mining may be the most strongly associated notion to data science, the real extraction of data information through techniques that integrate these principles. However, data science includes much more than just algorithms for data mining. Many businesses have strategically distinguished themselves from data science, sometimes to the point of developing into data mining businesses. From a data perspective, successful data scientists must be able to view the business problem statement. For data science, a big part of what has traditionally been studied in the field of statistics is essential.

What is Data Science process?

According to Fayyad [3] , data mining is one of the stages of the KDD (Knowledge Data Discovery) method and believes that the data mining stage primarily concerns the means by which the patterns are obtained and listed from the information.CRISP-DM[2] was created through consortium attempts originally made up ofDaimlerChryrler (Daimler AG, an automotive company), SPSS and NCR. CRISP-DM stands for CRoss-Industry Standard Process for Data Mining. It consists of a cycle that comprises six stages

  1. Business understanding – This initial stage focuses on understanding the goals and requirements of the project from a company view, then turning this knowledge into a definition of data mining problems and a preliminary plan for achieving the goals.
  2. Data understanding – The data understanding stage begins with the original data collection and begins with operations to get acquainted with the data, identify data quality issues, find first insights into the data or detect interesting subsets to discover hidden information.
  3. Data preparation – The stage of data preparation includes all activities from the initial raw data to build the final information set.
  4. Modeling – Different modeling methods are chosen and implemented in this stage and their parameters are calibrated to optimum values.
  5. Evaluation – The model (or models) acquired at this point are assessed more carefully and the steps taken to build the model are reviewed to ensure that the business goals are correctly achieved.
  6. Deployment – Model creation is not usually the end of the project. Even if the model’s purpose is to improve data understanding, the data acquired will need to be organized and displayed so that it can be used by the client.
CRISP-DM Framework
CRISP-DM Framework

What Kinds of Data Can Be Analyze? [5]

Data analysis can be applied to any type of data as a general technology as long as the data is relevant to a target application. Database data are the most fundamental forms of information for analytical apps. Traditional DBMS (Database Management System) provides the relational database, which ensures data integrity and transaction consistency. But the NoSQL (Not Only SQL) databases have emerged as a solution in the last few years, witnessing the velocity of data growth and the lack of support from traditional databases for this problem.

Relational Database

A database system consists of a collection of interrelated data, known as a database, and a set of data management and access software programs. A relational database is a set of tables assigning a unique name to each of them. Each table is made up of a set of attributes (columns or fields) and generally holds a large number of tuples (records or rows). For relational databases, a semantic data model such as an entity-relationship (ER) data model is often built.

NoSQL Database

The rapid growth in the amount of data and the issue of altering the schema database over the evolution of the distinct platforms were the first issues that motivated the NoSQL database further evolution. Most NoSQL systems are distributed databases or distributed data storage with a focus on high efficiency, high availability, data replication and scalability as opposed to an emphasis on consistency of strong query languages and structured data storage. Four types of NoSQL databases are available that can be classified based on their data storage. These types are databases based on key-value, column-based, document-based, and graph.

What Kinds of Patterns Can Be Analyzed?

There are many types of patterns that can be analyzed, such as characterization and discrimination, frequent patterns, associations and correlations, classification and regression, analysis of clustering and outlier analysis. Such analytics can generally be categorized as descriptive and predictive in two types. Data characteristics in a target information set are characterized by descriptive analytics. In order to create predictions, predictive analytics conduct induction on current data. Knowledge represents interesting patterns.

Data characterization is a summary of the overall characteristics or features of a target class of data. A query typically collects the data that corresponds to the user-specified class.

Association analysis is a rule-based machine learning technique for finding interesting relationships in large databases between variables. It aims to define strong rules that have been found in databases using some interesting measures.

Classification is the method of discovering a model (or function) describing and distinguishing classes or concepts of data. The model is obtained by analyzing a set of training data (i.e. data objects that are known to be class labels). The model is used to predict the class label of objects that are unknown to the class label.

Clustering analyzes data objects without consulting class labels, unlike classification and regression, which analyze class-labeled (training) data sets. Clustering can be used to produce a data group’s class labels.

Outlier analysis where a data set may contain objects that do not comply with the general behavior or model of the data.

Which Technologies are Used?

For modeling data and data classes, statistical models are commonly used. For instance, statistical models of target classes can be constructed in data analytics functions such as data characterization and classification. Machine learning explores how computers can learn (or enhance their efficiency). For example, programming a computer is a typical machine learning problem so that after learning from a set of examples it can automatically recognize handwritten postal codes on mail. Also, the database system and big data focus on establishing, maintaining and using organizations and end-users databases. In particular, data scientists in database systems and big data have extremely acknowledged principles in data models, query languages, query processing and techniques of optimization, data storage, and techniques of indexing and access. In the processing of very big, comparatively structured data sets, database systems are often well known for their elevated scalability. Information retrieval is the science of searching for data in documents. Natural Language Processing is part of the information retrieval field.

Which Kinds of Applications Are Targeted?

There are to major main applications which are business intelligence (BI) and web search engine. Business intelligence techniques provide business operations with historical, current and predictive views. Clearly, the essence of business intelligence is data analytics. Web search engines are basically very large applications for data analytics. Search engines vary from web directories in that web directories are retained by a human while search engines operate algorithmically or by a mixture of human and algorithmic input.

Major issues in Data Science

Three major issues in data science which is methodology, user interaction, and efficiency and scalability. New data science methodologies have been strongly developed by researchers. In addition, problems such as data uncertainty, noise and incompleteness should be considered by data science methodologies. The user plays a significant part in the process of data analytics. Interesting study fields include how to communicate with a data analytics system, how to integrate background understanding of a user in analysis, and how to visualize and understand outcomes from data analytics. Besides that, efficiency and scalability are always taken into account when comparing the algorithms of data analytics.

Data Science and Society

How is the effect of data science on society? What measures can be taken by data science to maintain individual privacy? In our daily life, do we use data science without even understanding what we are doing? These questions pose the following questions. Social impacts, it is essential to study the effect of data science on society as data science penetrates our everyday life. How can we benefit society by using data science technology? How can we protect ourselves from its misuse? Improper disclosure or use of data and prospective violations of the privacy and data protection freedoms of individuals are areas of concern to be addressed. Publishing of data privacy and data science studies are underway. Data science will assist with scientific discovery, business management, economic recovery and security (the discovery of intruders and cyberattacks in real-time).

[1] Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big data, 1(1), 51-59.

[2] Azevedo, A. I. R. L., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. IADS-DM.

[3] Fayyad, U. M. et al. 1996. From data mining to knowledge discovery: an overview. In Fayyad, U. M.et al (Eds.), Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.

[4] Sahatqija, Kosovare & Ajdari, Jaumin & Zenuni, Xhemal & Raufi, Bujar & Ismaili, Florie. (2018). Comparison between relational and NOSQL databases. 0216-0221. 10.23919/MIPRO.2018.8400041.

[5] Sahatqija, K., Ajdari, J., Zenuni, X., Raufi, B., & Ismaili, F. (2018, May). Comparison between relational and NOSQL databases. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 0216-0221). IEEE.

Leave a Reply

Your email address will not be published. Required fields are marked *