Getting to Know Your Data

Getting to Know Your Data

You’re going to want to understand the following: what kinds of characteristics or areas make up your data? What are the values of each attribute? What are the discrete characteristics and which are continuously valued? What is the look of the data? How are the values distributed? Can we visualize the data in order to get a better sense of it all? Can we find outliers? Can we evaluate the resemblance between some data objects and others? The subsequent assessment will assist to gain such insight into the data. Knowledge of your data is helpful for pre-processing data, the first important task of data analysis.

Data Objects and Attribute Types

They are data tuples if the data objects are stored in a database. Data sets consist of objects of data. What Is an attribute? An attribute vector (or feature vector) is a collection of characteristics used to define a specified object. An attribute is a data field that represents a data object’s trait or function.

Nominal implies “name-related.” Nominal attribute values are symbols or names of stuff. Both hair color and marital status are nominal attributes.

A binary attribute is a nominal attribute with only two states: 0 or 1, where 0 typically implies the attribute is missing, and 1 means it is present. A binary attribute is symmetrical if both of its states are equally valuable and have the same weight; that is, there is no preference as to which result should be coded as either 0 or 1. One such instance might be the gender characteristic with male and female states. A binary attribute is asymmetric if the results of the states, such as the favorable and negative results of an HIV medical test, are not as important. By convention, we code the most important outcome, usually the rarest, 1 (e.g. HIV-positive) and 0 (e.g. HIV-negative) outcome.

Ordinary attributes. Ordinal attributes are useful for the registration of subjective qualities assessments that can not be objectively measured; therefore, ordinal attributes are often used for ratings in surveys.

A numerical attribute is quantitative; that is, it is a measurable quantity represented in real or integer values. Interval-scaling or ratio-scaling of numeric attributes. Interval-scaled attributes, temperatures in Celsius and Fahrenheit do not have a true zero-point, that is, neither 0C nor 0F indicates “no temperature.” (On the Celsius scale, for example, the unit of measurement is 1/100 of the difference between the temperature of melting and the temperature of boiling water in atmospheric pressure.) There is an interval-scaled temperature attribute. Other examples of ratio-scaled attributes include count attributes such as years of experience (for example, objects are employees) and several words (for example, objects are documents).

A discrete attribute has a finite or countlessly infinite set of values that can or may not be represented as integer values. If the set of possible values is infinite, an attribute is countlessly infinite, but the values can be placed in one-to-one correspondence with natural numbers. It is continuous if an attribute is not discrete. In literature, the terms numeric attribute and continuous attribute are frequently used interchangeably.

Basic Statistical Descriptions of Data

Basic statistical descriptions can be used to define the data’s characteristics and highlight the data values that should be treated as noise or outliers. To be successful in pre-processing data, an overall picture of your data is essential. You can get more details including the examples of mean, median and mode here. We are now looking at measures to evaluate the dispersion or distribution of numerical data. The range, quantile, quartile, percentile, and interquartile range measures can be read here. The five-number distribution summary consists of the median (Q2), Q1 and Q3 quartiles, and the smallest and largest individual observers, written in Minimum, Q1, Median, Q3, Maximum order. In order to do this in a boxplot, the whiskers are only extended to the extreme low and high observations if these values are below 1.5 above the quartiles. More explanation here. Variance and standard deviation are data dispersion measurements. They show how to spread the distribution of data.

Graphic Displays of Basic Statistical Descriptions of Data

These include quantile plots, quantile-quantile plots, histograms, and scatter plots. Such graphs are helpful for the visual inspection of data, which is useful for data preprocessing.

Data Visualization

How can we efficiently transmit data to users? Data visualization is intended to transmit data through graphical representation clearly and efficiently. More popularly, by looking at the raw data, we can use visualization methods to find data interactions that are not readily observed otherwise. There are several representative approaches such as pixel-oriented techniques, geometric projection techniques, icon-based techniques, and hierarchical and graph-based techniques.

Measuring Data Similarity and Dissimilarity

A cluster is a set of data objects in such a way that the objects in a cluster are identical to each other and different from the objects in other clusters. The higher the value of dissimilarity, the more the two objects are distinct.

Two data structures widely used, the data matrix (used to store data objects) and the dissimilarity matrix (used to store dissimilarity values for object pairs). For objects defined by nominal attributes, binary attributes, numeric attributes, ordinal attributes, or combinations of these types of attributes, object dissimilarity can be determined. This offers similarity tests for very long and sparse data vectors, such as vectors of term frequency describing documents in the retrieval of information. Knowing how to quantify dissimilarity is useful in the study of clustering features, outlier analysis, and nearest-neighbor classification.

Leave a Reply

Your email address will not be published. Required fields are marked *