--- bibliography: blog.bib csl: acm.csl date: "2020-10-128T20:20:00Z" draft: true title: Process of Data science - Measurement --- # Measurement variables In a previous post, the process of data science and forming an hypothesis is discussed. A hypothesis is the relevant to align a business objective to a data science problem. The hypothesis provides a "big-picture" view of the issues which need to considered in further steps of addressing a data science problem. The problem being considered is insurance fraud, and a good hypothesis for success could be “misrepresentation is different from intentional damage”. This hypothesis attempts to differentiate between misrepresentation and intentional damage. > Misrepresentiation is said to occur when a claim is made on > nonexistent assets > > Intentional damage is said to occur when an insured asset is > intentionally damaged The next step after an hypothesis is established is to consider variables or factors affecting the hypothesis. 1. [Hypothesis](http://knkumar.com/blog/posts/data_science_process/) 2. Measurement variables (discussed here) 3. Latent or unobservable factors 4. Experimental design (0 to 1) 1. Controlling other factors to observe primary effect. 5. Collection and analysis of data for pattern discovery 1. Hypothesis driven Exploration 6. Modeling of patterns for prediction 1. Numerical Analysis for error reduction 2. Qualitative modeling 7. Generalizing or scaling the experiment (1 to n) 8. Establishing a baseline 9. Monitoring through controls and baselines 10. Ethics and governance ## The null Hypothesis Let us call our hypothesis “misrepresentation is different from intentional damage” - $H$ for mathematical convenience. This can be a hard thing to determine and we can use ideas from *statistical testing* to develop a solution. A statistical testing process works by determining an antithesis often called the null hypothesis, i.e., if the antithesis were true the hypothesis under consideration would not be true. An antithesis could be "misrepresentation is indifferentiable from intentional damage", call this $H\_0$. In a traditional scientific experiment, a statistical experiment would be possible by random assignment to conditions under test. In this scenario, one group of insured would generate misrepresentation whereas another group would generate intentional damage claims. Traditional hypothesis testing would calculate a statistic, say a mean, for data generated from two groups and observe if statistic is significantly different from each other. $$ Experimental\\ question: \\underbrace{\\begin{cases} H: statistic*{misrepresentation}\\neq statistic*{intentional}\\ H*0: statistic*{misrepresentation} = statistic*{intentional} \\end{cases} \\text{verify truth of both statements}}* {\\text{equality/inequality with an acceptable margin of statistical error}} $$ In this scenario, misrepresentation and intentional damange are not randomly assigned or generated from insured parties. In fact, it would be facetious to conduct an experiment to study the problem at hand. Such a problem falls under the umbrella of a natural experiment or observational study depending on the circles you are in. In an observational study the assignment of population to groups or conditions of the experiment are outside the investigator's purview. A hypothesis such as "smoking causes cancer" or "video games cause violence" [@engelhardt2011your] is harder to perform in a pure scientific manner. In fact, the earlier position on video games by [@engelhardt2011your] has been attributed to priming by [@kuhn2019does] and the jury could still be out on this since we cannot guarantee homogenity of the sample in testing for observed effects. In such scenarios the best we can do are observational studies to gain more information about our hypothesis. ## What are Measurement Variables (aka Direct Factors)? In order to perform a *scientific study*, a data scientist should start by picking up on *signals* of misrepresentation and intentional damage. These signals are often referred to as measurement variables for modeling. The model of choice for such a problem is a discriminative model, i.e., a model discriminating fraud of misrepresentation and intentional damage. In the old but popular example of discriminating the iris species [@fisher1936use], the petal length/width and sepal length/width provided sufficient measurement variables for discrimination of the species using linear functions. In this iris analysis, the experiment was natural, i.e., not in the control of an experimenter. The term ***natural*** means the experimenter did not genetically modify the species to show variations, the variation in the species was naturally selected. On the other hand, in cases such as experiments with [fruit flies](https://bdsc.indiana.edu/about/index.html) (available at Indiana University for research), a scientist would study the species by "knocking out genes" or "inducing variations" creating a *controlled* experiment. The key in either case would be understanding the *factors* or **measurement variables** for the hypothesis under study. A **natural/observational experiment** is a useful alternative when a controlled experiment cannot be undertaken like the insurance example. It is important to note that a natural experiment can also have issues regarding confounding variables and bias which potentially invalidate the experiment. A ***confound*** (or confounding variable) can be defined as a factor which could directly or indirectly affect the response variable when considering a direct measurement. Let's take a concrete example here to understand this concept. Assume a scout is looking for talent in basketball (or a VC firm is scouting for investment, the analogy is similar). The scout assesses the talent using a few metrics such as average points per game, assists for offense and rebounds, block, steals for defense. There are *other aspects* (or confounds) which come into the purview of a scout, such as medical history and stability/improvement of stats because these indicate the progression of a player and future outcomes. In many cases, a *confound* plays a large role. For example, a player with a debilitating shoulder injury could be a red flag since the future outcome could be weaker with a higher probability. The difficulty would be in ascertaining confounds for the hypothesis under study, and requires understanding the true nature of the effect a confound has on the hypothesis. A *targeted interview* with an expert (such as claims investigator for insurance or talent scout for sports) is a valuable tool in a data scientists arsenal to understand the factors and confounds which should be considered as data to be included in a model. An interview provides the intuition or priors in a bayesian context for data gathering and evaluation. A variable or factor discriminating ***misrepresentation*** from ***intentional damage*** could be identified based on multiple perspectives. Personally, I choose the word perspective as a line of attack/strategy to understand the contributing factors from first principles. This is a preferred approach, in my opinion, to throwing the kitchen sink at a dataset. #### Historical variables Historical variables can be obtained from similar category of claims in the past. They are useful in understanding patterns of normal insurance claims and misrepresentation. Cost per type of damage could be a general factor to monitor, which needs categorizing types of damage available in historical data. In many cases, the insurance system would place restrictions on type of damages covered and bundle similar damages under a large umbrella (because its easier to deal with one type and have a single process). For example flooding could be due to natural events like weather (rain, storm, waves, etc) or a pipe breaking due to stress or damage. Classifying the category at the right level is important in order to provide models the right level of information, not focusing on data driven approaches when collecting data can *misclassify* labels by not having appropriate levels for a category losing a lot of context. #### Textual variables Textual variables can be obtained from an insurance claim which asks pointed questions to a claimant. Many of the responses to the questions can be free form text or speech which allow representation of the situation in the claim. A misrepresented claim can potentially have signals in the text to describe the situation. Simple constructs would be overuse of certain elements to provide validity to the claim. A speech pattern can have inflection when misrepresenting facts which can be captured by a model. Another common pattern to obtain signals be asking the same question with a different phrase. Text or speech patterns for both questions should ideally be the similar and a measure of dissimilarity can be used by a model to discriminate between misrepresentation and intentional damage. The details of spacing between the questions and phrasing are experimental variables at the hands of the data scientist to gather useful signals. #### Social variables Social variables can be obtained from aspects of social interaction such as association to similar groups, participation in similar events or mining social media sites such as Facebook, Twitter, Snapchat etc. The usage of social variables stem from the phrase - "neurons that are fire together wire together" implying that if there is a person who filed a claim with misrepresentation or intentional damage another person could be correlated to do so through social bonds. Personally, I am not a proponent of using social variables but in some cases they can provide useful information akin to a prior for the model. A data scientist needs to be careful in ensuring the prior or social variables can be overcome by evidence in either direction. #### Economic variables ## Identifying measurement variables ### Correlation ### Separation of classes