Technical Considerations for Longitudinal Data Analyses

Noushin Nabavi
DataDrivenInvestor

--

At a technical level, a data analytics workflow is consisted of a series of steps from designing an analytic cohort, ideally with control or randomized comparison group, to determining outcomes. These include (i) setting up of data while carefully considering time and interventions, (ii) visual inspection of data plots for outliers, trends, and data quality issues, (iii) preliminary analyses to examine correlations and models, and finally (iv) predicting relative and absolute effects or changes due to an intervention.

Throughout the process of data analysis, we should remind ourselves of the existence of internal and external validation steps. Internal validations pertain to the conclusions we make of data within a project while external validations ask how the results hold against the rest of the world. Points to consider when determining internal validations are threats to study assumptions or hypotheses. These counterfactuals include thorough considerations of (i) history, (ii) maturation, (iii) instrumentation and testing, (iv) statistical regression, and (v) confounding or selection bias* of variables.

The first step requires some time to intimately familiarize oneself with the data while considering ‘wild points’, linear trends, interventions or co-interventions, as well as data quality checks. This could include missing data or errors in classification. At this point, it will be useful to consider a group as the control or comparison cohort. It is also often helpful to add additional variables such as time, trend, and levels of change manually to the dataset. This addition requires being aware of counterfactuals in the data acquisition process. This can further include the variant, invariant, or multivariate predictors. This step does not preclude a gross visual plotting of data at a high level.

Next, one can consider adding the time, trend, level variables in order to set a basis for finding patterns in the data. This requires the introduction of machine learning algorithms or models one at a time or all at the same time to infer, test, train, find errors and losses or predict differential changes in time, levels, or trend. For these, one can take refuge to supervised, unsupervised, or reinforcement learning algorithms. Supervised learning for linear or logistic regression can be used for classification or regression of patterns. This can be done with segmented regression on control or comparison cohort specifications. The tests include Durbin-Watson, as well as residuals, and (partial) autocorrelation (p) and moving average (q) plots. In the Durbin-Watson test, a value of 2 of residuals would indicate no autocorrelation, higher values show positive correlation and lower values show a negative correlation. These algorithms require training, testing, and predicting steps using the dataset. For unlabeled data, unsupervised learning algorithms such as clustering, anomaly detection, neural networks* can be used. Reinforcement learning algorithms are the next modes to consider if data does not fall into the former categories.

Once we have ended with a model, we move to visualize the results and predicting change. It is also important to identify and troubleshoot potential modeling issues at this point. This may include considering the wild points. These may arise by several factors that could be anticipatory, data quality issue, or occurrence of historical events. Some options for considering these are explicitly model them (fitted lines in regression modeling) or omitting them out of the analysis. Other modeling issues may arise due to seasonality effects in the data either as natural patterns or program designs. To deal with these, we may need to model specific time periods or add functions to model time (e.g. sines or cosines). Finally, in some instances, policy implementation isn’t instantaneous and so it can lead to delays in observing effects in the data. Some modeling solutions may be to exclude the data from the time-series analyses or model those as a separate segment. As such, several strategies to model multiple interventions in the same time series can be employed and assessed to address both non-linear (quadratic or different outcome models) and linear trends that may explain the datasets.

Finally, these analyses may be applicable to a variety of research questions from political science (e.g. impact of changes in laws and policies on the behavior of a population); economics (e.g. impact of changes in microcredit loans on business growth); sociological studies(e.g. income maintenance for welfare programs); history (e.g. impact of major historical events on the affected population); to medicine (e.g. treatment strategies as intervention).

References:

*Interrupted time series regression for the evaluation of public health interventions

*Experimental and Quasi-Experimental Designs for Generalized Causal Inference

*Policy Analysis

*Applied Longitudinal data analysis

*Clustering techniques

--

--