Data analysis approach
- Clearly and unambiguously define the question you wish to solve
- Determine the ideal dataset for your goal
- Descriptive - whole population
- Exploratory - random sample with multiple variables measured
- Inferential - drawing a conclusion about a larger population from a random sample
- Predictive - need training and testing data from the same population to build a model and a classifier
- Causal - experimental data from a randomized study
- Mechanistic - data from all components of the system you want to describe
- Obtain the data that you need - you may need to pay for it, or generate it yourself
- Clean the data and convert it to a format suitable for your needs
- Exploratory data analysis - become familiar with the data
- Statistical modeling/prediction
- Interpret results, with detailed explanations
- Challenge your results (and entire process), audit the whole thing. Try to come up with alternative analyses, and so on.
- Synthesize and write up your results
- Create reproducible code
From Reproducible Research Course Notes, Xing Su.
When you have a research question, it is probably about a population as a whole. However, you are only ever able to collect a sample. You want to collect your sample in such a way that the population estimates you compute from the sample are as accurate as possible.
To do so, you want to collect your sample randomly. However, even then there is possibility of introducing some bias into the sample, which can cause your sample to be non-representative.
A couple examples of bias:
- non-response bias - if collect data via surveys, and non-response is high, the collected results might not really reflect the population. Perhaps there are common factors which led many to not respond; perhaps these factors will influence your study in an unintentional way.
- convenience bias - similarly, your sample may only reflect cases which were more accessible due to your collection methodology.
There are a few different ways of randomly collecting your sample:
- simple random sampling - just randomly select your sample
- stratified sampling - divide the population into strata, which are groups of similar cases. Then simple random sampling is used within each stratum. All strata must be sampled from. One complication is that stratified samples are analyzed differently than simple random samples.
- cluster sampling - divide the population into clusters, randomly sample a fixed number of clusters, then collect simple random samples from each cluster. Not all clusters must be sampled from. Again, however, cluster samples require different analysis than simple random samples.
Broadly, data may be collected in one of two ways, which are further broken down into categories:
- observational study - the data collection process does not interfere with the population being studied
- prospective study - identify individuals and collect data as they happen, going forward
- restrospective study - use data which has already been collected, such as archived data
- experimental study - the data collection process involves some kind of intervention, the effect of which is under study
The only data you have available to analyze is the data that you collect. That is, you can only examine the variables for which you have collected data. Perhaps you want to see how two of them are correlated. But there is always the possibility of a confounding variable; that is, some variable that correlates with both. Say you see that $A$ is correlated with $B$ and you suspect there may be a causal relationship. Well there may be a third variable $C$ which correlates with both $A, B$ and might be the underlying cause.
(my main takeaways from the Sparse Spaces, Phonology lecture)
An ideal approach to AI problems is:
- Specify the problem
- Devise a representation suited to the problem
- Determine an approach or method
- Pick a mechanism or devise an algorithm
This isn't a linear process; throughout working on the problem you are likely to jump around, perhaps redefine the problem, etc. But generally, this is sequence you'd want to go in.
However, in practice many AI practitioners fall in love with a particular mechanism or algorithm and try to apply it to everything, which is too inflexible approach - you should match the mechanism to the problem, not vice versa.
Coming up with the right representation is crucial to success in these kinds of problems. So how do you do it? There are a few heuristics about what makes a good representation:
- It makes the right things (distinctive features, relationships, etc) explicit
- It exposes constraints to work off of
- There is a localness - more compact rather than spread out
- Problem definition
- Data collection
- Data cleaning
- Data coding (feature engineering)
- Metric selection
- Algorithm selection
- Parameter optimization
- Online evaluation
from Rich Caruana (Microsoft Research), at https://chronicles.mfglabs.com/learning-to-learn-or-the-advent-of-augmented-data-scientists-20873282e181