Skip to main content

Generating accurate and valid scientific results

Machine learning (ML) allows software applications to become more accurate in predicting outcomes with increased use. ML involves building algorithms that can predict an output value within an acceptable range.

CSISA generates numerous scientific datasets on crop production practices and agronomic field trials, but generating frequent and valid results from these thousands of observations is a challenge. ML tools can help.

CSISA organized a five-day workshop in Odisha to train CSISA scientists from Bihar, eastern Uttar Pradesh (EUP) and Odisha in the use of ML tools – based on the open-source statistical computing and graphics software, ‘R,’ – to analyze CSISA’s crop cut and production practice survey datasets.

Each year, CSISA generates data from multi-location adaptive trials, production practice diagnostic surveys and a few other targeted needs-based surveys in Bangladesh, India and Nepal. These datasets are used to determine the most important yield attributing factor(s), information that could help policymakers target and refine recommendations and advisories. ML allows us to draw quick, accurate and valid results from these datasets.

Under the leadership of CSISA-Nepal’s Socioeconomist, Gokul Paudel, participants jointly reviewed production practice survey datasets, cleaned the data, applied relevant analytical tools and generated results.

The group started by reviewing basic statistics and R-software, the rationale behind ML and algorithms such as classification and regression tree (CART) and random forest models. Using R, participants checked data summary statistics and visualized in histograms, boxplots, scattered plots and correlation plots. With CART, the participants produced graphical results by chronologically classifying covariates in terms of their possible predictive roles in a particular outcome. CART showed that sowing date is the most important factor in determining wheat yield in Bihar and EUP, followed by crop establishment method, amount of nitrogen applied and number of irrigations.

Participants also used the random forest model, which is more robust in terms of training and validation performance because multiple decision trees, based on different characteristics, are built. Results also identified sowing date as the most important factor, also matching CART results for other covariates determining wheat yield.

These ML results provide sufficient evidence of the role of sowing date in wheat yield in UP and Bihar, which has also been documented earlier by CSISA.

This team of CSISA scientists successfully analyzed and visualized data with modern statistical tools and gained confidence to consistently undertake robust diagnostic surveys and collaborative research trials, as well as generate location specific insights, discuss these insights with partners and inform decision makers at relevant levels. All publications, along with full datasets, will be made available to the public through open source channels.