Data analysis approach

From Reproducible Research Course Notes, Xing Su.

Data Collection

When you have a research question, it is probably about a population as a whole. However, you are only ever able to collect a sample. You want to collect your sample in such a way that the population estimates you compute from the sample are as accurate as possible.


To do so, you want to collect your sample randomly. However, even then there is possibility of introducing some bias into the sample, which can cause your sample to be non-representative.

A couple examples of bias:

There are a few different ways of randomly collecting your sample:


Broadly, data may be collected in one of two ways, which are further broken down into categories:

The only data you have available to analyze is the data that you collect. That is, you can only examine the variables for which you have collected data. Perhaps you want to see how two of them are correlated. But there is always the possibility of a confounding variable; that is, some variable that correlates with both. Say you see that $A$ is correlated with $B$ and you suspect there may be a causal relationship. Well there may be a third variable $C$ which correlates with both $A, B$ and might be the underlying cause.


Learning: Tips

(my main takeaways from the Sparse Spaces, Phonology lecture)

An ideal approach to AI problems is:

  1. Specify the problem
  2. Devise a representation suited to the problem
  3. Determine an approach or method
  4. Pick a mechanism or devise an algorithm
  5. Experiment

This isn't a linear process; throughout working on the problem you are likely to jump around, perhaps redefine the problem, etc. But generally, this is sequence you'd want to go in.

However, in practice many AI practitioners fall in love with a particular mechanism or algorithm and try to apply it to everything, which is too inflexible approach - you should match the mechanism to the problem, not vice versa.

Coming up with the right representation is crucial to success in these kinds of problems. So how do you do it? There are a few heuristics about what makes a good representation:



from Rich Caruana (Microsoft Research), at