Continuing from our last discussion 'Measuring Data Similarity or Dissimilarity #1', In this post we are going to see how to calculate the similarity or dissimilarity between Numeric Data Types.
2. For Numeric Attribute:
For measuring the dissimilarity between two numeric data points, the easiest or most used way to calculate the 'Euclidean distance', Higher the value of distance, higher the dissimilarity.There are two more distance measuring methods named 'Manhattan distance' and 'Minkowski distance'. We are going to look into these one by one.
a. Euclidean distance:
Euclidean distance is widely used to calculate the dissimilarity between numeric data points, this is actually derived from 'Pythagoras Theorem' so also known as 'Pythagorean metric' or `L^2` norm.Euclidean distance between two points `p(x_1, y_1)` and `q(x_2, y_2)` is the length which connects point p from point q.
`dis(p,q) = dis(q,p) = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) = \sqrt(\sum_(i=1)^N(q_i - p_i)^2)`
In One Dimention:
`dis(p,q) = dis(q,p) = \sqrt((q - p)^2) = q - p`
In Two Dimentions:
`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2)`
In Three Dimentions:
`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2)`
In N Dimentions:
`dis(p,q) = dis(q,p) = \sqrt((q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2 +.......................+ (q_N - p_N)^2)`
b. Manhattan distance:
`dis(p, q) = |(x_2 - x_1)| + |(y_2 - y_1)| = \sum_(i=1)^N|(q_i - p_i)|`
Manhattan distance is also know as `L^1` norm.
c. Minkowski distance:
This is the generalized form of Euclidean or Manhattan distance and represented as -
where n = 1, 2, 3.......
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/