public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: kiran kumar <krankumar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Unable to generate citations in markdown_strict
Date: Wed, 16 Jun 2021 10:01:30 -0700 (PDT)	[thread overview]
Message-ID: <f0b66f7d-b530-4c2e-9979-2fd40ff51dd4n@googlegroups.com> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 761 bytes --]

 

Using the following command to generate citations

pandoc test.md  -citeproc -f markdown_strict+yaml_metadata_block -t 
markdown_strict+citations+smart+yaml_metadata_block -s --bibliography 
blog.bib --csl acm.csl -o check.md

The test.md has a few citations but it is not rendered as references in the 
check.md

Is there something I am missing?

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f0b66f7d-b530-4c2e-9979-2fd40ff51dd4n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1079 bytes --]

[-- Attachment #2: test.md --]
[-- Type: text/markdown, Size: 10299 bytes --]

---
bibliography: blog.bib
csl: acm.csl
date: "2020-10-128T20:20:00Z"
draft: true
title: Process of Data science - Measurement
---

# Measurement variables

In a previous post, the process of data science and forming an
hypothesis is discussed. A hypothesis is the relevant to align a
business objective to a data science problem. The hypothesis provides a
"big-picture" view of the issues which need to considered in further
steps of addressing a data science problem.

The problem being considered is insurance fraud, and a good hypothesis
for success could be “misrepresentation is different from intentional
damage”. This hypothesis attempts to differentiate between
misrepresentation and intentional damage.

> Misrepresentiation is said to occur when a claim is made on
> nonexistent assets
>
> Intentional damage is said to occur when an insured asset is
> intentionally damaged

The next step after an hypothesis is established is to consider
variables or factors affecting the hypothesis.

1.  [Hypothesis](http://knkumar.com/blog/posts/data_science_process/)
2.  Measurement variables (discussed here)
3.  Latent or unobservable factors
4.  Experimental design (0 to 1)
    1.  Controlling other factors to observe primary effect.
5.  Collection and analysis of data for pattern discovery
    1.  Hypothesis driven Exploration
6.  Modeling of patterns for prediction
    1.  Numerical Analysis for error reduction
    2.  Qualitative modeling
7.  Generalizing or scaling the experiment (1 to n)
8.  Establishing a baseline
9.  Monitoring through controls and baselines
10. Ethics and governance

## The null Hypothesis

Let us call our hypothesis “misrepresentation is different from
intentional damage” - $H$ for mathematical convenience. This can be a
hard thing to determine and we can use ideas from *statistical testing*
to develop a solution. A statistical testing process works by
determining an antithesis often called the null hypothesis, i.e., if the
antithesis were true the hypothesis under consideration would not be
true. An antithesis could be "misrepresentation is indifferentiable from
intentional damage", call this $H\_0$.

In a traditional scientific experiment, a statistical experiment would
be possible by random assignment to conditions under test. In this
scenario, one group of insured would generate misrepresentation whereas
another group would generate intentional damage claims. Traditional
hypothesis testing would calculate a statistic, say a mean, for data
generated from two groups and observe if statistic is significantly
different from each other. $$ Experimental\\ question:
\\underbrace{\\begin{cases} H: statistic*{misrepresentation}\\neq
statistic*{intentional}\\ H*0: statistic*{misrepresentation} =
statistic*{intentional} \\end{cases} \\text{verify truth of both
statements}}* {\\text{equality/inequality with an acceptable margin of
statistical error}} $$ In this scenario, misrepresentation and
intentional damange are not randomly assigned or generated from insured
parties. In fact, it would be facetious to conduct an experiment to
study the problem at hand. Such a problem falls under the umbrella of a
natural experiment or observational study depending on the circles you
are in.

In an observational study the assignment of population to groups or
conditions of the experiment are outside the investigator's purview. A
hypothesis such as "smoking causes cancer" or "video games cause
violence" [@engelhardt2011your] is harder to perform in a pure
scientific manner. In fact, the earlier position on video games by
[@engelhardt2011your] has been attributed to priming by
[@kuhn2019does] and the jury could still be out on this since we
cannot guarantee homogenity of the sample in testing for observed
effects. In such scenarios the best we can do are observational studies
to gain more information about our hypothesis.

## What are Measurement Variables (aka Direct Factors)?

In order to perform a *scientific study*, a data scientist should start
by picking up on *signals* of misrepresentation and intentional damage.
These signals are often referred to as measurement variables for
modeling. The model of choice for such a problem is a discriminative
model, i.e., a model discriminating fraud of misrepresentation and
intentional damage. In the old but popular example of discriminating the
iris species [@fisher1936use], the petal length/width and sepal
length/width provided sufficient measurement variables for
discrimination of the species using linear functions. In this iris
analysis, the experiment was natural, i.e., not in the control of an
experimenter.

The term ***natural*** means the experimenter did not genetically modify
the species to show variations, the variation in the species was
naturally selected. On the other hand, in cases such as experiments with
[fruit flies](https://bdsc.indiana.edu/about/index.html) (available at
Indiana University for research), a scientist would study the species by
"knocking out genes" or "inducing variations" creating a *controlled*
experiment. The key in either case would be understanding the *factors*
or **measurement variables** for the hypothesis under study.

A **natural/observational experiment** is a useful alternative when a
controlled experiment cannot be undertaken like the insurance example.
It is important to note that a natural experiment can also have issues
regarding confounding variables and bias which potentially invalidate
the experiment.

A ***confound*** (or confounding variable) can be defined as a factor
which could directly or indirectly affect the response variable when
considering a direct measurement. Let's take a concrete example here to
understand this concept. Assume a scout is looking for talent in
basketball (or a VC firm is scouting for investment, the analogy is
similar). The scout assesses the talent using a few metrics such as
average points per game, assists for offense and rebounds, block, steals
for defense. There are *other aspects* (or confounds) which come into
the purview of a scout, such as medical history and
stability/improvement of stats because these indicate the progression of
a player and future outcomes. In many cases, a *confound* plays a large
role. For example, a player with a debilitating shoulder injury could be
a red flag since the future outcome could be weaker with a higher
probability. The difficulty would be in ascertaining confounds for the
hypothesis under study, and requires understanding the true nature of
the effect a confound has on the hypothesis. A *targeted interview* with
an expert (such as claims investigator for insurance or talent scout for
sports) is a valuable tool in a data scientists arsenal to understand
the factors and confounds which should be considered as data to be
included in a model. An interview provides the intuition or priors in a
bayesian context for data gathering and evaluation.

A variable or factor discriminating ***misrepresentation*** from
***intentional damage*** could be identified based on multiple
perspectives. Personally, I choose the word perspective as a line of
attack/strategy to understand the contributing factors from first
principles. This is a preferred approach, in my opinion, to throwing the
kitchen sink at a dataset.

#### Historical variables

Historical variables can be obtained from similar category of claims in
the past. They are useful in understanding patterns of normal insurance
claims and misrepresentation. Cost per type of damage could be a general
factor to monitor, which needs categorizing types of damage available in
historical data. In many cases, the insurance system would place
restrictions on type of damages covered and bundle similar damages under
a large umbrella (because its easier to deal with one type and have a
single process). For example flooding could be due to natural events
like weather (rain, storm, waves, etc) or a pipe breaking due to stress
or damage. Classifying the category at the right level is important in
order to provide models the right level of information, not focusing on
data driven approaches when collecting data can *misclassify* labels by
not having appropriate levels for a category losing a lot of context.

#### Textual variables

Textual variables can be obtained from an insurance claim which asks
pointed questions to a claimant. Many of the responses to the questions
can be free form text or speech which allow representation of the
situation in the claim. A misrepresented claim can potentially have
signals in the text to describe the situation. Simple constructs would
be overuse of certain elements to provide validity to the claim. A
speech pattern can have inflection when misrepresenting facts which can
be captured by a model.

Another common pattern to obtain signals be asking the same question
with a different phrase. Text or speech patterns for both questions
should ideally be the similar and a measure of dissimilarity can be used
by a model to discriminate between misrepresentation and intentional
damage. The details of spacing between the questions and phrasing are
experimental variables at the hands of the data scientist to gather
useful signals.

#### Social variables

Social variables can be obtained from aspects of social interaction such
as association to similar groups, participation in similar events or
mining social media sites such as Facebook, Twitter, Snapchat etc. The
usage of social variables stem from the phrase - "neurons that are fire
together wire together" implying that if there is a person who filed a
claim with misrepresentation or intentional damage another person could
be correlated to do so through social bonds.

Personally, I am not a proponent of using social variables but in some
cases they can provide useful information akin to a prior for the model.
A data scientist needs to be careful in ensuring the prior or social
variables can be overcome by evidence in either direction.

#### Economic variables

## Identifying measurement variables

### Correlation

### Separation of classes

[-- Attachment #3: blog.bib --]
[-- Type: text/x-bibtex, Size: 2073 bytes --]

# articles for reinforcement learning
@article{vinyals2017starcraft,
  title={Starcraft ii: A new challenge for reinforcement learning},
  author={Vinyals, Oriol and Ewalds, Timo and Bartunov, Sergey and Georgiev, Petko and Vezhnevets, Alexander Sasha and Yeo, Michelle and Makhzani, Alireza and K{\"u}ttler, Heinrich and Agapiou, John and Schrittwieser, Julian and others},
  journal={arXiv preprint arXiv:1708.04782},
  url={https://arxiv.org/pdf/1708.04782},
  year={2017}
}
@article{dulac2019challenges,
  title={Challenges of real-world reinforcement learning},
  author={Dulac-Arnold, Gabriel and Mankowitz, Daniel and Hester, Todd},
  journal={arXiv preprint arXiv:1904.12901},
  url={https://arxiv.org/pdf/1904.12901},
  year={2019}
}

# articles on data science
@article{engelhardt2011your,
  title={This is your brain on violent video games: Neural desensitization to violence predicts increased aggression following violent video game exposure},
  author={Engelhardt, Christopher R and Bartholow, Bruce D and Kerr, Geoffrey T and Bushman, Brad J},
  journal={Journal of Experimental Social Psychology},
  volume={47},
  number={5},
  pages={1033--1036},
  year={2011},
  url={https://hal.archives-ouvertes.fr/peer-00995254/document},
  publisher={Elsevier}
}
@article{kuhn2019does,
  title={Does playing violent video games cause aggression? A longitudinal intervention study},
  author={K{\"u}hn, Simone and Kugler, Dimitrij Tycho and Schmalen, Katharina and Weichenberger, Markus and Witt, Charlotte and Gallinat, J{\"u}rgen},
  journal={Molecular psychiatry},
  volume={24},
  number={8},
  pages={1220--1234},
  year={2019},
  url={https://www.nature.com/articles/s41380-018-0031-7},
  publisher={Nature Publishing Group}
}
@article{fisher1936use,
  title={The use of multiple measurements in taxonomic problems},
  author={Fisher, Ronald A},
  journal={Annals of eugenics},
  volume={7},
  number={2},
  pages={179--188},
  year={1936},
  url={https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x},
  publisher={Wiley Online Library}
}

             reply	other threads:[~2021-06-16 17:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-16 17:01 kiran kumar [this message]
     [not found] ` <f0b66f7d-b530-4c2e-9979-2fd40ff51dd4n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 17:19   ` Joseph Reagle
2021-06-16 19:34   ` John MacFarlane
     [not found]     ` <m21r9138os.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-06-16 21:53       ` kiran kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f0b66f7d-b530-4c2e-9979-2fd40ff51dd4n@googlegroups.com \
    --to=krankumar-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).