Translational User Guide

This page provides the code and annotated code to explain the Python routines used to generate variables and to execute the predictions in Multivariate random forest prediction of poverty and malnutrition prevalence(2021). Here you can learn how to recreate the code for your needs.

The annotated guides can not compile. They contain explanations that are not commented. If you want the code to compile, please download the desired Stata, R, or Python files.

Correlated Random Forest Sequential Nowcasting and Contemporaneous Prediction using Python

Code written by Chris Browne.

Contemporaneous Forecasting

Annotated Guide

Uncompilable Word Document

Python File

Complilable File

  import numpy as np
  import pandas as pd
  import pickle
  import pdb
  from sklearn.ensemble import RandomForestRegressor as RF
  from sklearn.metrics import r2_score as r2
  from sklearn.metrics import mean_squared_error as mse
  from sklearn.model_selection import train_test_split
  from sklearn.linear_model import Ridge as ridge
  ##set seed
  np.random.seed(6023)

  Setting the Forest Size
  nt = 2000

  Five Fold Cross Validation Testing and Training are performed 5 times for each survey
  folds = 5

  Creating a vector that has different random forest methods: independent(ind) or joint
  rftypes = ['ind','joint'] ##ind must be first

  Setting the countries, these might be different for you.
  countries = ["Bangladesh","Ethiopia","Ghana","Guatemala","Honduras","Mali","Nepal","Kenya","Senegal","Uganda","Nigeria"]

  Setting the years, these might be different for you.
  pyears = [["04","07","11","14"],["05","11","16"],["08","14"],["14"],["11"],["06","12"],["06","11","16"],['08','14'],["05",'10'],["06",'11','16'],["08","13"]]

  Setting the different possible outcomes, these might be different for you.
  outcomes = ['stunted', 'wasted',  'healthy', 'poorest','underweight_bmi']

  For loop to sort the years into a list.
  syears = np.sort(list(set([y for years in pyears for y in years])))

  Create a pandas dataframe using the data that was cleaned earlier.
  xa = pd.read_csv('data/data.csv')
  imps = [[] for country in countries]

  ##for each survey year, for each country
  for i,year in enumerate(syears):
      for j,country in enumerate(countries):
  	 prints the index of the current syears and index of current countries
          print(i,j)
          ##filter data
          x = xa[xa.country == country]
          x = x.drop('country',axis=1)
  	  Set the year for x. We do 20 + year because year only contains the last 2 digits of the year.
          x = x[x.year <= int('20' + year)]
    if statement states if x is larger than 0, it hasn't finished looking at all the countries yet
          if np.shape(x)[0]>0:

  The below code chunk takes out data where more than 20% of the data is missing. Here it is just taking out the market price data. You may need to shift this number according to your dataset.
              ##drop largely missing features
        mask = np.array(x.isnull().sum()/len(x)) The average when x is null into mask.
              x = x.loc[:,mask<.2] Locates the index of x.

   		Drops NA rows.
              ##drop rows with missing data
              x = x.dropna(axis=0)

              ##test train split and CV
  	      Set the x training data when the year is greater than the x year.
              xtr = x[x.year < int('20' + year)]
  	Set the x year when it’s equal to year.
              x = x[x.year == int('20' + year)]
  		Keep doing this when there’s still information in x(x is greater than 0)
              if np.shape(x)[0]>0:
   		Set q as the vector of x[0] from the first element to the second to last.
                  q = np.arange(np.shape(x)[0])
   	Cross Validation
                  for fold in range(folds):
  			Shuffle the countries randomly
                      np.random.shuffle(q)
                      trinds = q[:int(.8*len(q))]
                      teinds = q[int(.8*len(q)):]
                      xtrf = pd.concat([xtr,x.iloc[trinds,:]],ignore_index=True)
                      xtef = x.iloc[teinds,:]

                      ##make random forests
                      Labs is created because xtr contains the predictions at the end of the list, but when recreating this, we suggest keeping the predictions in a separate variable to minimize confusion. Additionally, the country is in the 0th column and we need to omit this when making the predictions, hence why labs starts at the 1st column.
                      labs = list(xtrf.columns)[1:-5]
                      for rftype in rftypes:
                          Joint Random Forest
                          if rftype =='joint':
                              Load the data into a numpy data frame called W
                              W = np.load('data/W/W'+year+country+'c.npy')
                              Create a variable ytrf with the y training data by taking the dot product of W and the transpose of xtr[outcomes] and the transpose of that
                              ytrf = np.dot(W,xtrf[outcomes].T).T
                              Create a variable yte with the y testing data by taking the dot product of W and the transpose of xte[outcomes] and the transpose of that
                              ytef = np.dot(W,xtef[outcomes].T).T
                              Create the Random Forests.
                              rf = RF(nt,max_depth=4,max_features = .333)
                              rf = rf.fit(xtrf[labs],ytrf)
                              Set predtr(prediction training) as the prediction of xtrf[labs](x testing random forest) from the given decision trees.
                              predtr =  rf.predict(xtrf[labs])
                              Set predte(prediction testing) as the prediction of xtef[labs](x training random forest) from the given decision trees.
                              predte = rf.predict(xtef[labs])

                              Sklearn does not support using the Random forest method on dependent data. Below, a whitening technique is used to make the data independent by orthogonalizing the data so the random forest technique can be used.

                              Set Winv as the inverse of W.
                              Winv = np.linalg.inv(W)
                              Set predtr as the transpose of the dot product of Winv and predtr.T(transpose of predtr)
                              predtr = np.dot(Winv,predtr.T).T
                              Set predte as the transpose of the dot product of Winv and predte.T(transpose of predte)
                              predte = np.dot(Winv,predte.T).T
                              Append the important features from the decision trees to imps[j]. This variable is used to produce some figures but not used to make predictions.
                              imps[j].append(rf.feature_importances_)

                              For loop of the outcomes to set each respective outcome.
                              for k,outcome in enumerate(outcomes):
                              This code stores the outcomes in xtrf,/xtef when replicating it is best to create a new variable for outcomes to minimize confusion.

                                  Set the outcome + 'rf'+ joint = the predicted outcome found above in xtrf.
                                  xtrf[outcome+'rf' + rftype] = predtr[:,k]
                                  Set the outcome +'rf' + joint = the predicted outcome found above in xtef.
                                  xtef[outcome+'rf'+ rftype] = predte[:,k]

                          Independent Random Forests
                          else:
                              Set ytrf as the training data with the specified outcomes from the top.
                              ytrf = xtrf[outcomes]
                              Set ytef as the testing data with the specified outcomes from the top.
                              ytef = xtef[outcomes]

                              For loop through the outcomes to set each respective outcome.
                              for outcome in outcomes:
                                  rf = RF(nt,max_depth=4,max_features = .333)
                                  rf = rf.fit(xtrf[labs],ytrf[outcome])
                                  This code stores the outcomes in xtrf,/xtef when replicating it is best to create a new variable for outcomes to minimize confusion.

                                  Set the x training data with the outcomes from the prediction from the random forests.
                                  xtrf[outcome+'rf' + rftype] = rf.predict(xtrf[labs])
                                  Set the x testing data with the outcomes from the prediction from the random forests.
                                  xtef[outcome+'rf' + rftype] = rf.predict(xtef[labs])

                              Set W as the correlation of each pair of residuals from the training data’s transposed outcomes, then takes the singular value decomposition and multiply the observation of that matrix which will orthogonalize them and make them independent.
                              W = np.linalg.svd(np.corrcoef((np.array(xtrf[outcomes])-np.array(xtrf[[outcome + 'rfind' for outcome in outcomes]])).T))[2]
                              Save the data file using W set above as a Numpy object.
                              np.save('data/W/W'+year+country+'c',W)

                      Set xtrf(x training) with the outcomes of independent and joint random forest of the training data. The results are kept in xtef, but you can change this variable so that the results are in a separate variable to minimize confusion.
                      xtrf = xtrf[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]
                      Set xtef(x testing) with the outcomes of independent and joint random forest of the testing data. The results are kept in xtef, but you can change this variable so that the results are in a separate variable to minimize confusion. 
                      xtef = xtef[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]

  Open the file with the specified year and convert training and testing results and the file as a byte stream.
                      with open('results/'+year+country+'resultscrf' +str(fold) +'.pkl','wb') as f:
                          pickle.dump([xtrf,xtef],f)

Sequential Nowcasting

Annotated Guide

Uncompilable Word Document

Python File

Complilable File

  Sequential Now-casting: To be used in Early Warning Systems.

  import numpy as np
  import pandas as pd
  import pickle
  import pdb
  from sklearn.ensemble import RandomForestRegressor as RF
  from sklearn.metrics import r2_score as r2
  from sklearn.metrics import mean_squared_error as mse
  from sklearn.model_selection import train_test_split
  from sklearn.linear_model import Ridge as ridge
  Setting the seed ensures that the results will be the same when reproduced.
  ##set seed
  np.random.seed(9823)

  Setting the Forest Size to 2000
  nt = 2000

  Creating a vector that has different random forest methods: independent(ind) or joint
  rftypes = ['ind','joint'] ##ind must be first

  Setting the countries, these might be different for you.
  countries = ["Bangladesh","Ethiopia","Ghana","Kenya","Mali","Nepal","Nigeria","Senegal","Uganda"]

  Setting the years, these might be different for you.
  pyears = [["04","07","11","14"],["05","11","16"],["08","14"],['08','14'],["06","12"],["06","11","16"],["08","13"],["05",'10'],["06",'11','16']]

  Setting the different possible outcomes, these might be different for you.
  outcomes = ['stunted', 'wasted',  'healthy', 'poorest','underweight_bmi']

  For loop to sort the years into a list. 
  syears = np.sort(list(set([y for years in pyears for y in years])))

  Create a pandas dataframe using the data that was cleaned by Linden McBride. 
  xa = pd.read_csv('data/data.csv')
  Imps is used to produce some figures in the paper and is not needed for predictions.
  imps = [[] for country in countries]

  ##for each survey year, for each country
  for i,year in enumerate(syears):
      for j,country in enumerate(countries):
   	Prints the index of the current syears and prints index of the current countries.
          print(i,j)
          ##filter data
  	  Only keep countries that we’re looking at, these might be different for you.
          x = xa[xa.country == country]
          x = x.drop('country',axis=1)
   	   Set the year for x. We do 20 + year because year only contains the last 2 digits of the year.
          x = x[x.year <= int('20' + year)]

          Split the data into Training(xtr) and Testing(xte) data
          ##test train split
          Training data is when the data’s year is less than the specified year.
          xtr = x[x.year < int('20' + year)]

          ##sequential
  	Testing Data is the data with the specified year.
          xte = x[x.year == int('20' + year)]

          This specific data handling is dependent on the dataset used.
          ##handles senegal 12
          if (np.shape(xtr)[0]*np.shape(xte)[0]>0)*(not country == 'Senegal' or not year == '12'):
   The below code chunk takes out data where more than 20% of the data is missing. Here it is just taking out the market price data. You may need to shift this number according to your dataset.
              ##drop largely missing features
              xf = pd.concat([xtr,xte],ignore_index=True)
              mask = np.array(xf.isnull().sum()/len(x))
              xtr = xtr.loc[:,mask<.2]
              xte = xte.loc[:,mask<.2]

  		Split the data into Training and Testing Data
              ##drop rows with missing data
    	Drop x training data that are NA
              xtr = xtr.dropna(axis=0)
   	 	 Drop x testing data that are NA
              xte = xte.dropna(axis=0)

              ##make random forests
              g=dict({'max_depth':4,'max_features':.333,'bootstrap':True})

   	Labs is created because xtr contains the predictions at the end of the list, but when recreating this, we suggest keeping the predictions in a separate variable to minimize confusion. Additionally, the country is in the 0th column and we need to omit this when making the predictions, hence why labs starts at the 1st column.
              labs = list(xtr.columns)[1:-5]

   	 Joint Random Forest
              for rftype in rftypes:
                  if rftype =='joint':

   		   	Load the data into a numpy data frame called W
                      W = np.load('data/W/W'+year+country+'seq.npy')

   			   Create a variable ytr with the y training data by taking the dot product of W and the transpose of xtr[outcomes] and the transpose of that
                      ytr = np.dot(W,xtr[outcomes].T).T

   			   Create a variable yte with the y testing data by taking the dot product of W and the transpose of xte[outcomes] and the transpose of that
                      yte = np.dot(W,xte[outcomes].T).T

   			   Create the random Forests
                      rf = RF(nt,max_depth=4,max_features = .333)
                      rf = rf.fit(xtr[labs],ytr)

   			   Set predtr(prediction training) as the prediction of xtr[labs](x testing random forest) from the given decision trees.
                      predtr =  rf.predict(xtr[labs])

   			Set predte(prediction testing) as the prediction of xte[labs](x training random forest) from the given decision trees.
                      predte = rf.predict(xte[labs])

  Sklearn does not support using the Random forest method on dependent data. Below, a whitening technique is used to make the data independent by orthogonalizing the data so the random forest technique can be used.

              Set Winv as the inverse of W
                      Winv = np.linalg.inv(W)

              Set predtr as the transpose of the dot product of Winv and predtr.T(transpose of predtr)
                      predtr = np.dot(Winv,predtr.T).T

              Set predte as the transpose of the dot product of Winv and predte.T(transpose of predte)
                      predte = np.dot(Winv,predte.T).T

              Append the important features from the decision trees to imps[j]. This variable is used to produce some figures but not used to make predictions.
                      imps[j].append(rf.feature_importances_)

   			      For loop of the outcomes
                      for k,outcome in enumerate(outcomes):

   			           Set the outcome + 'rf'+ joint = the predicted outcome found above in xtr
                          xtr[outcome+'rf' + rftype] = predtr[:,k]

   			     Set the outcome +'rf' + joint = the predicted outcome found above in xte
                          xte[outcome+'rf' + rftype] = predte[:,k]

         Independent Random Forests
                  else:

   			Set ytr as the training data with the specified outcomes from the top
   			 ytr = xtr[outcomes]
   			Set yte as the testing data with the specified outcomes from the top
                      yte = xte[outcomes]

   			For loop through the outcomes.
                      for outcome in outcomes:

                          rf = RF(nt,max_depth=4,max_features = .333)
                          rf = rf.fit(xtr[labs],ytr[outcome])

   				Set the x training data to the prediction from the random forests
                          xtr[outcome+'rf' + rftype] = rf.predict(xtr[labs])

   				Set the x testing data to the prediction from the random forests
                          xte[outcome+'rf' + rftype] = rf.predict(xte[labs])

   			Set W as the correlation of each pair of residuals from the training data’s transposed outcomes, then takes the singular value decomposition and multiply the observation of that matrix which will orthogonalize them and make them independent.
                      W = np.linalg.svd(np.corrcoef((np.array(xtr[outcomes])-np.array(xtr[[outcome + 'rfind' for outcome in outcomes]])).T))[2]

   			Save the data file using W set above as a Numpy object.
                      np.save('data/W/W'+year+country+'seq',W)

   		Set xtr(x training) with the outcomes of independent and joint random forest of the training data. The results are kept in xte, but you can change this variable so that the results are in a separate variable to minimize confusion.
              xtr = xtr[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]

   		Set xte(x testing) with the outcomes of independent and joint random forest of the testing data. The results are kept in xte, but you can change this variable so that the results are in a separate variable to minimize confusion.
              xte = xte[outcomes + [outcome + 'rfind' for outcome in outcomes] + [outcome + 'rfjoint' for outcome in outcomes]]

   		Open the file with the specified year and convert training, testing results, and file as a byte stream.
              with open('results/'+year+country+'resultsseqrf.pkl','wb') as f:
                  pickle.dump([xtr,xte],f)

Python Packages Documentations

PDF file of Package Documentation Links

numpy
pandas
pickle
pbd
randomForestRegressor as RF from sklearn.ensemble
r2_score as r2 from sklearn.metrics
mean_squared_error as mse from sklearn.metrics