---
bibliography: blog.bib
csl: acm.csl
date: "2020-10-128T20:20:00Z"
draft: true
title: Process of Data science - Measurement
---

# Measurement variables

In a previous post, the process of data science and forming an
hypothesis is discussed. A hypothesis is the relevant to align a
business objective to a data science problem. The hypothesis provides a
"big-picture" view of the issues which need to considered in further
steps of addressing a data science problem.

The problem being considered is insurance fraud, and a good hypothesis
for success could be “misrepresentation is different from intentional
damage”. This hypothesis attempts to differentiate between
misrepresentation and intentional damage.

> Misrepresentiation is said to occur when a claim is made on
> nonexistent assets
>
> Intentional damage is said to occur when an insured asset is
> intentionally damaged

The next step after an hypothesis is established is to consider
variables or factors affecting the hypothesis.

1.  [Hypothesis](http://knkumar.com/blog/posts/data_science_process/)
2.  Measurement variables (discussed here)
3.  Latent or unobservable factors
4.  Experimental design (0 to 1)
    1.  Controlling other factors to observe primary effect.
5.  Collection and analysis of data for pattern discovery
    1.  Hypothesis driven Exploration
6.  Modeling of patterns for prediction
    1.  Numerical Analysis for error reduction
    2.  Qualitative modeling
7.  Generalizing or scaling the experiment (1 to n)
8.  Establishing a baseline
9.  Monitoring through controls and baselines
10. Ethics and governance

## The null Hypothesis

Let us call our hypothesis “misrepresentation is different from
intentional damage” - $H$ for mathematical convenience. This can be a
hard thing to determine and we can use ideas from *statistical testing*
to develop a solution. A statistical testing process works by
determining an antithesis often called the null hypothesis, i.e., if the
antithesis were true the hypothesis under consideration would not be
true. An antithesis could be "misrepresentation is indifferentiable from
intentional damage", call this $H\_0$.

In a traditional scientific experiment, a statistical experiment would
be possible by random assignment to conditions under test. In this
scenario, one group of insured would generate misrepresentation whereas
another group would generate intentional damage claims. Traditional
hypothesis testing would calculate a statistic, say a mean, for data
generated from two groups and observe if statistic is significantly
different from each other. $$ Experimental\\ question:
\\underbrace{\\begin{cases} H: statistic*{misrepresentation}\\neq
statistic*{intentional}\\ H*0: statistic*{misrepresentation} =
statistic*{intentional} \\end{cases} \\text{verify truth of both
statements}}* {\\text{equality/inequality with an acceptable margin of
statistical error}} $$ In this scenario, misrepresentation and
intentional damange are not randomly assigned or generated from insured
parties. In fact, it would be facetious to conduct an experiment to
study the problem at hand. Such a problem falls under the umbrella of a
natural experiment or observational study depending on the circles you
are in.

In an observational study the assignment of population to groups or
conditions of the experiment are outside the investigator's purview. A
hypothesis such as "smoking causes cancer" or "video games cause
violence" [@engelhardt2011your] is harder to perform in a pure
scientific manner. In fact, the earlier position on video games by
[@engelhardt2011your] has been attributed to priming by
[@kuhn2019does] and the jury could still be out on this since we
cannot guarantee homogenity of the sample in testing for observed
effects. In such scenarios the best we can do are observational studies
to gain more information about our hypothesis.

## What are Measurement Variables (aka Direct Factors)?

In order to perform a *scientific study*, a data scientist should start
by picking up on *signals* of misrepresentation and intentional damage.
These signals are often referred to as measurement variables for
modeling. The model of choice for such a problem is a discriminative
model, i.e., a model discriminating fraud of misrepresentation and
intentional damage. In the old but popular example of discriminating the
iris species [@fisher1936use], the petal length/width and sepal
length/width provided sufficient measurement variables for
discrimination of the species using linear functions. In this iris
analysis, the experiment was natural, i.e., not in the control of an
experimenter.

The term ***natural*** means the experimenter did not genetically modify
the species to show variations, the variation in the species was
naturally selected. On the other hand, in cases such as experiments with
[fruit flies](https://bdsc.indiana.edu/about/index.html) (available at
Indiana University for research), a scientist would study the species by
"knocking out genes" or "inducing variations" creating a *controlled*
experiment. The key in either case would be understanding the *factors*
or **measurement variables** for the hypothesis under study.

A **natural/observational experiment** is a useful alternative when a
controlled experiment cannot be undertaken like the insurance example.
It is important to note that a natural experiment can also have issues
regarding confounding variables and bias which potentially invalidate
the experiment.

A ***confound*** (or confounding variable) can be defined as a factor
which could directly or indirectly affect the response variable when
considering a direct measurement. Let's take a concrete example here to
understand this concept. Assume a scout is looking for talent in
basketball (or a VC firm is scouting for investment, the analogy is
similar). The scout assesses the talent using a few metrics such as
average points per game, assists for offense and rebounds, block, steals
for defense. There are *other aspects* (or confounds) which come into
the purview of a scout, such as medical history and
stability/improvement of stats because these indicate the progression of
a player and future outcomes. In many cases, a *confound* plays a large
role. For example, a player with a debilitating shoulder injury could be
a red flag since the future outcome could be weaker with a higher
probability. The difficulty would be in ascertaining confounds for the
hypothesis under study, and requires understanding the true nature of
the effect a confound has on the hypothesis. A *targeted interview* with
an expert (such as claims investigator for insurance or talent scout for
sports) is a valuable tool in a data scientists arsenal to understand
the factors and confounds which should be considered as data to be
included in a model. An interview provides the intuition or priors in a
bayesian context for data gathering and evaluation.

A variable or factor discriminating ***misrepresentation*** from
***intentional damage*** could be identified based on multiple
perspectives. Personally, I choose the word perspective as a line of
attack/strategy to understand the contributing factors from first
principles. This is a preferred approach, in my opinion, to throwing the
kitchen sink at a dataset.

#### Historical variables

Historical variables can be obtained from similar category of claims in
the past. They are useful in understanding patterns of normal insurance
claims and misrepresentation. Cost per type of damage could be a general
factor to monitor, which needs categorizing types of damage available in
historical data. In many cases, the insurance system would place
restrictions on type of damages covered and bundle similar damages under
a large umbrella (because its easier to deal with one type and have a
single process). For example flooding could be due to natural events
like weather (rain, storm, waves, etc) or a pipe breaking due to stress
or damage. Classifying the category at the right level is important in
order to provide models the right level of information, not focusing on
data driven approaches when collecting data can *misclassify* labels by
not having appropriate levels for a category losing a lot of context.

#### Textual variables

Textual variables can be obtained from an insurance claim which asks
pointed questions to a claimant. Many of the responses to the questions
can be free form text or speech which allow representation of the
situation in the claim. A misrepresented claim can potentially have
signals in the text to describe the situation. Simple constructs would
be overuse of certain elements to provide validity to the claim. A
speech pattern can have inflection when misrepresenting facts which can
be captured by a model.

Another common pattern to obtain signals be asking the same question
with a different phrase. Text or speech patterns for both questions
should ideally be the similar and a measure of dissimilarity can be used
by a model to discriminate between misrepresentation and intentional
damage. The details of spacing between the questions and phrasing are
experimental variables at the hands of the data scientist to gather
useful signals.

#### Social variables

Social variables can be obtained from aspects of social interaction such
as association to similar groups, participation in similar events or
mining social media sites such as Facebook, Twitter, Snapchat etc. The
usage of social variables stem from the phrase - "neurons that are fire
together wire together" implying that if there is a person who filed a
claim with misrepresentation or intentional damage another person could
be correlated to do so through social bonds.

Personally, I am not a proponent of using social variables but in some
cases they can provide useful information akin to a prior for the model.
A data scientist needs to be careful in ensuring the prior or social
variables can be overcome by evidence in either direction.

#### Economic variables

## Identifying measurement variables

### Correlation

### Separation of classes