In the previous lectures,

you studied examples of desktop GIS application and server GIS application.

In this lecture, we have a more advanced problem of spatial data analysis.

The problem is, the Ministry of Health and Welfare in Korea wants to check

if any spatial relationship exists between administrative district and disease prevalence.

With respect to the disease of spatial dependency,

MOHW wants to find out any regional factor,

which contribute to higher or lower disease prevalence

for the purpose of improving the public health in Korea.

For the problem, two spatial datasets are presented.

First, disease prevalence rate of administrative districts,

and second, a variety of variables related to economy, demography, environment,

public health, land cover, land use,

and industry for finding influential variables.

The solution is designed in two stages.

First, spatial autocorrelation analysis is conducted

with respect to disease prevalence rate of administrative district,

and a list of diseases which has spatial autocorrelation was found.

In other words, spatial factors would have impact on disease prevalence.

For those diseases, decision tree classification is applied,

to find the influential variables from the given 165 input variables.

This is a typical spatial data analysis problem,

which we do not feed in a basic GIS software,

and require more advanced analytic power.

So, I would recommend connecting GIS software such as QGIS,

and data analytics tool such as R studio,

for solving the problem.

In the solution structure,

GIS software will take care of visualization of the outcomes,

and data analytics tool will be used for the two data analysis.

First, spatial autocorrelation analysis

and second, decision tree analysis for deriving

influential variables related to a specific disease prevalence.

Now, let's take step-by-step procedures to get the solutions to the given problems.

The first step, is to prepare data sets for analysis.

Here, we used QGIS to build

GIS layers with joining disease prevalence rate to administrative districts,

and categorized each disease prevalence into three levels,

and eventually produced tertile maps.

You are looking at a tertile map of hypertension with respect to

administrative districts of Korea.

The first analysis is to check if each disease prevalence rate is spatially autocorrelated.

You run the method of checking spatial autocorrelation,

which is Moran's I.

So, using R, Moran's I is computed with respect to each disease prevalence,

taking one more step of statistical tests after converting,

given Moran's I to z-score.

The table describes the analysis results.

Out of 24 diseases,

12 diseases from Allergic Rhinitis and

Angina Pectorals were specially autocorrelated at the confidence level of 95%.

For your better understanding of spatial autocorrelation,

tertile map of each disease prevalence,

and corresponding z-score of Moran's I are presented here.

You're looking at top 6 diseases for which spatial autocorrelation strongly exists.

They are Allergic Rhinitis,

Atopic Dermatitis, Dyslipidemia, Arthritis, Hypertension, and Osteoporosis.

Isn't it interesting?.

We do not know why,

but at least we can speculate

some spatial factors would impact on the 6 disease prevalence rate.

This slide presents the next 6 disease prevalence of high spatial autocorrelation.

Which are still statistically significant at the confidence level of 95%,

which include Cataract, Diabetes,

Stroke, Tuberculosis, Myocardial Infarction, and Angina Pectoris.

I could not figure out the science behind the outcome at this point.

But it is surprising to me that prevalence of Diabetes and

Stroke are spatially correlated in Korea.

Now, you're looking at disease prevalences of no spatial autocorrelation.

One interesting finding is that,

Asthma has no spatial autocorrelation,

which is against the belief that Asthma is

the typical disease affected by environmental factors.

I guess such outcome would be related to scale issue.

The current spatial resolution of administrative district is way too large,

so that it cannot catch such a subtle change of environmental factors.

Again, I can tell, spatial data science is a science of scale.

This slide presents another 6 diseases which

are expected no spatial autocorrelation at the given scale.

Now, we are ready to conduct decision tree analysis

with respect to the disease with a spatially autocorrelated prevalence rate.

Out of 12 candidates,

I selected three diseases,

Hypertension, Stroke and Diabetes.

With respect to 165 input variables

and tertile of disease prevalence rate as a target variable,

decision tree analysis is conducted.

For Hypertension, 7 influential factors are retrieved,

which are marriage rate,

and private insurance applications as positive factors,

which means more the factor,

the less disease prevalence.

On the other hand,

employer rate ratio, widowed percentage,

number of teeth scaling, residence period,

brushing teeth after lunch,

are retrieved as a negative factors.

Some variables make sense,

and some variables do not.

Anyway, it can be said that marital status, economy,

and hygiene of lifestyle are major factors of Hypertension prevalence rate.

The same decision tree analysis was conducted for finding influential factors for Stroke.

Again, 7 factors were retrieved which are marriage rate,

weight control as positive factors,

and number of people experienced the depression,

number of people who get stress counseling, anxiety/depression level,

residence tax as negative factors.

It is also very interesting finding that the root node,

which is the most important factor of decision tree is marriage rate,

for both Hypertension and Stroke.

This is the power of data science.

The data tells, if you want to be healthy from Hypertension and Stroke,

the answer is to get married.

The same analysis was conducted with respect to Diabetes.

Number of health center visits, local income tax,

urban planning tax, and average sleeping time were positive factors.

And exercise capacity, number of peoples in diet control,

number of people of flu vaccine were negative factors.

Diabetes has influential factors mainly related to

economic variables and level of health care.