联系我们: 手动添加方式: 微信>添加朋友>企业微信联系人>13262280223 或者 QQ: 1483266981
PAPER CODE NO. EXAMINER: Dr Olga Anosova Tel. No. 0770 312 21 68
COMP229 DEPARTMENT: Computer Science
FIRST SEMESTER EXAMINATIONS 2023/24
Introduction to Data Science
TIME ALLOWED : TWO Hours
INSTRUCTIONS TO CANDIDATES
Answer ALL questions. The exam is worth 100 marks (20 marks for each question). Exam weight
in the total module mark is 70%.
Calculators are NOT permitted.
PAPER CODE COMP229 page 1 of 5 Continued
1. (a) Give the definition of a metric. Provide an example of two functions d1, d2 on pairs of real
numbers, such that d1 is not a metric and d2 is a metric. Justify your answer by checking
the definition of a metric.
(b) Give the definitions of the metrics L1, L2 and L∞ on Euclidean space R
n
for n ≥ 1. Use
these definitions to compute all three distances between a place in Nottingham with
Decimal Degrees (DD) coordinates (53, 1) and a place near Paris with DD coordinates
(49, 2).
2. (a) The naive attempt to guess the perimeter of a polygon in R
2
is to use the formula
Perimeter = √
Area. Are there polygons that satisfy this formula Justify your answer.
(b) State the Shoelace formula for the signed area of a polygon in R
2
. Use this formula to
check if the area of the polygon A1BA2A3A4A5 is equal to the sum of areas of polygons
A1A2A3A4A5 and A1BA2, where the points are given by their Euclidean coordinates:
A1 = (3, 1), A2 = (7, 2), A3 = (4, 4), A4 = (8, 6), A5 = (1, 7), B = (4, 3).
3. (a) For a cloud of points in any high-dimensional Euclidean space, state the k-means clus tering as an optimisation problem. Find an optimal solution to the 2-means clustering
problem for the four points (0, 0), (4, 0), (0, 8), (4, 8) in R
2
.
(b) Describe standard Lloyd’s algorithm for k-means clustering. What is the output of this
algorithm for the four points in part (a) if the initial centres are (0, 4) and (4, 4)
4. (a) The table below contains three types of blood test results for five patients. Name two
potential methods to reduce the dimension of the given 3-dimensional data.
Tests patient 1 patient 2 patient 3 patient 4 patient 5
HbA1c 5 5 4 3 3
ACR 30 31 30 30 29
GFR 61 59 60 61 59
(b) Find the first two principal components of the 3-dimensional data in the table above.
5. (a) State Simpson’s amalgamation paradox and give a numerical example of this paradox.
(b) For HbA1c and ACR data in the table from Question 4, calculate the covariance and
the Pearson correlation coefficient. Make a conclusion about the relation between those
variables. Find the regression line y = ax + b by considering HbA1c as the independent
variable and ACR as the dependent variable. Then swap the roles of these variables.
Check if the regression line remains the same and explain your conclusion.
PAPER CODE COMP229 page 2 of 5 Continued
Solutions.
Solution B1 (20 marks).
(a) For any set C of arbitrary elements, a metric is a real-valued function d : C × C → R satisfying
the following axioms
(1) identity: d(p, q) = 0 if and only if p = q;
(2) symmetry : d(p, q) = d(q, p) for any p, q ∈ C;
(3) triangle inequality: d(p, q) + d(q, r) ≥ d(p, r) for any p, q, r ∈ C.
(2 marks for each axiom, 6 in total)
The function d1(a, b) = a
2 ab + b
2 on R
2
is symmetric and always non-negative, but fails the
non-degenerate positivity due to d1(1, 1) = 1 = 0, and fails the triangle inequality, because d1(0, 1)+
d1(1, 3) = 1 + 7 < d1(0, 3) = 9.
The Euclidean distance d2(a, b) = s
nP
i=1
(ai bi)
2 satisfies the metric axioms: the first two follow
trivially from the properties of the absolute value.
The triangle inequality for △abc in terms of vectors u =
→ab, v =
→bc says that | u + v| ≤ | u| + | v|. Any
vector w has angle 0 with itself, so w · w = |w |
2
. Then ( u + v) · ( u + v) ≤ | u|
2 + 2| u| · | v| + | v|
2
follows
from Cauchy’s inequality u · v ≤ | u| · | v|. (3 marks for d1 and 5 marks for d2, 8 in total)
(b) The sum metric is L1(p, q) =
nP
i=1
|pi qi
| = |53 49| + | 1 2| = 4 + 3 = 7.
The Euclidean metric is L2(p, q) = s
nP
i=1
(pi qi)
2 =
√
4
2 + 32 =
√
16 + 9 = 5.
The max metric is L∞(p, q) = max
i=1,…,n
|pi qi
| = max{4, 3} = 4. (2 marks for each part, 6 in total)
Solution B2 (20 marks),
(a) For almost all polygons Perimeter =
√
Area. The formula holds in the degenerate case of a
point when both area and perimeter are equal to 0. (2 marks)
(b) The Shoelace formula: the signed area of any polygon with anticlockwisely ordered vertices
(x1, y1), … , (xn, yn) in R
2
is
nP
i=1
yi + yi+1
2
(xi xi+1) = 1
2
nP
i=1
xi xi+1
yi yi+1
, where we set (xn+1, yn+1) = (x1, y1).
(4 marks)
Area(A1BA2A3A4A5) = 1
2
3 4
1 3
+
4 7
3 2
+
7 4
2 4
+
4 8
4 6
+
8 1
6 7
+
1 3
7 1
=
5 13+20
2
8+50 20 = 17,
Area(A1A2A3A4A5) = 1
2
3 7
1 2
+
7 4
2 4
+
4 8
4 6
+
8 1
6 7
+
1 3
7 1
=
1 + 20
2
8 + 50 20 = 20.5,
Area(A1BA2) = 1
2
3 4
1 3
+
4 7
3 2
+
7 3
2 1
=
5 13 + 1
2
= 3.5,
hence 17 = 20.5 3.5 holds. (6+5+3 marks (one for each correct determinant), 14 in total)
PAPER CODE COMP229 page 3 of 5 Continued
Solution B3 (20 marks).
(a) For a point cloud C R
m and a given number k of clusters, the k-means clustering problem is
to split C into clusters C1, … , Ck in such a way that
k
P
i=1
P p∈Ci
d
2
(p,
ˉCi) is minimal, where d
2
(p,
ˉCi) is
the squared distance from a point p to the cluster centre Cˉ
i =
1
|Ci
|
P p∈Ci
p. (4 marks)
The optimal solution to the 2-means clustering problem for the 4 points (0, 0), (4, 0), (0, 8), (4, 8) in
R
2 consists of the clusters C1 = {(0, 0), (4, 0)} and C2 = {(0, 8), (4, 8)} whose centres ˉC1 = (2, 0) and
Cˉ
2 = (2, 8) minimise the distance function
k
P
i=1
P p∈Ci
d
2
(p,
ˉCi) = 2(22+22
) = 16. For any point cloud C
R
m, the centre that minimises P
p∈C
d
2
(p,
ˉC) is ˉC =
1
|C|
P p∈C
p. The two shortest pairwise distances
between given 4 points have the centres (2, 0) and (2, 8). For any three of the given points, their
centre will give a higher sum of squared distances. For example, the centre of (0, 0), (4, 0), (0, 8) is
p = ( 4
3
,
8
3
) and has the squared sum 2( 3
4
)
2 + 3( 3
8
)
2 + ( 16
3
)
2 =
2 · 4
2 + 3
9
· 8
2 + 162
> 16. (10 marks)
(b) Standard Lloyd’s algorithm starts from k initial centres c1, … , ck ∈ C. Any point p ∈ C is
assigned to its closest centre (the cluster of this centre). The centre of every cluster is updated as
the average over all its current points. For the new centres, the clusters are formed as above, the
centres are updated and so on until clusters are stable or centres move by not more than a given
threshold. (4 marks)
Starting from the centres (4, 4) and (0, 4), standard Lloyd’s algorithm stops at the first step, because
resulting clusters C1 = {(0, 0), (0, 8)} and C2 = {(4, 0), (4, 8)} remain unchanged. (2 marks)
Solution B4 (20 marks).
(a) The potential methods to reduce the dimension of the given data are the Principal Component
Analysis (PCA) and the Singular Value Decomposition (SVD). (2 marks)
(b) For a sample k × m matrix S, where sij is the j-th sample value of the i-th feature, we subtract
the sample means so that each row has the mean 0. In the first step of the PCA, the data table
Tests patient 1 patient 2 patient 3 patient 4 patient 5 sample mean
HbA1c 5 5 4 3 3 4
ACR 30 31 30 30 29 30
GFR 61 59 60 61 59 60
becomes the simpler sample matrix S =
1 1 0 1 1
0 1 0 0 1
1 1 0 1 1
(4 marks)
Then we compute the covariance matrix M =
SST
n 1
. The (i, j) element of SST
is the scalar product
of the i-th and j-th rows of S, i.e. SST =
4 2 0
2 2 0
0 0 4
, M =
0.5 0.5 0
1 0.5 0
0 0 1
. (4 marks)
The third step of the PCA is to find the eigenvalues of M as follows:
PAPER CODE COMP229 page 4 of 5 Continued
det(M λI) = det
1 λ 0.5 0
0.5 0.5 λ 0
0 0 1 λ
= (1 λ)((1 λ)(0.5 λ) 0.25) =
= (1 λ)(λ
2 1.5λ + 0.25) = 0. The eigenvalues are ordered: λ1 =
3 + √
5
4
> λ2 = 1 > λ3 =
3
√
5
4
. The first two principal components of the data are along the eigenvectors with the highest
eigenvalues: v1 = (√
5 + 1, 2, 0) for λ1 =
3 + √
5
4
and v2 = (0, 0, 1) for λ2 = 1. (10 marks)
Solution B5 (20 marks).
(a) Simpson’s (amalgamation) paradox states that a trend (or rate) in different groups can reverse
when these groups are combined. It can be geometrically visualised by slopes in vector addition.
(2 marks)
Let box A1 contain 5 red and 6 blue balls, and box B1 contain 3 red and 4 blue balls. The probability
of taking a red ball from box A1 is 11
5 >
3
7
=probability of taking a red ball from box B1.
Let box A2 contain 6 red and 3 blue balls, and box B2 contain 9 red and 5 blue balls. The probability
of taking a red ball from box A2 is 3
2 > 14
9 =probability of taking a red ball from box B2.
If we combine the balls from boxes A1,A2 into one box A and B1,B2 into one box B, then probability
of taking a red ball from box A is 20
11 <
12
21 =probability of taking a red ball from box B. The relation
between probabilities has reversed after we combined boxes. (6 marks)
(b) The sample means are ˉx = 4, ˉy = 30. The sample deviations are sx =
s
1
n 1
nP
i=1
(xi xˉ)
2 =
q
1+1+0+1+1
4
= 1, sy =
q
0+1+0+0+1
4
=
√
0.5 ≈ 0.7.
The covariance is Cov(x, y) = n
1
1
nP
i=1
(xi xˉ)(yi yˉ) = 1
4
(0 + 1 + 0 + 0 + 1) = 0.5. The Pearson
correlation is rxy =
Cov(x,y)
sx sy
= 0.5/
√
0.5 = √
0.5 ≈ 0.7, which is a positive correlation. (6 marks)
The least squares regression line is y = ax + b, where a = rxy
s
s
y
x
= 0.5 and b = ˉy axˉ = 30 2 = 28.
So the regression line is y = 0.5x + 28.
If we swap x and y, the regression line is x = cy + d, where c = rxy s
sx
y
= 1 and d = ˉx cyˉ = 4 30 =
26. So the regression line is x = y 26, or y = x + 26.
Both lines go through the central point (ˉx, ˉy), but they minimize different sums: the first one op timizes vertical distances, whilst the second line optimizes the horizontal distances, hence these
lines differ. (2+2+2=6 marks)
PAPER CODE COMP229 page 5 of 5 End
计算机科学|EXAMINER: Dr Olga Anosova Tel. No. 0770 312 21 68
COMP229 DEPARTMENT: Computer Science最先出现在KJESSAY历史案例。