Forecasting Crop Yield Using Multi-Layered, Whole-Farm Data Sets and Machine Learning
P. Filippi, M. Fajardo, E. J. Jones, B. M. Whelan, T. F. Bishop
The University of Sydney, School of Life and Environmental Sciences, Sydney Institute of Agriculture, Sydney, New South Wales, Australia
The ultimate goal of Precision Agriculture is to improve decision making in the business of farming. Many broadacre farmers now have a number of years of crop yield data for their fields which are often augmented with additional spatial data, such as apparent soil electrical conductivity (ECa), soil gamma radiometrics, terrain attributes and soil sample information. In addition there are now freely available public datasets, such as rainfall, digital soil maps and archives of satellite remote sensing which can be used to interpret the crop-growing environment. However, rather than analysing one field at a time as is typical in precision agriculture research, there is an opportunity to explore the value of combining all this data for multiple fields/farms and years into one dataset. Using these datasets in conjunction with machine learning approaches offers the possibility of building predictive models of crop yield. In this study, several large farms in Western Australia were used as a case study, and yield monitor data from wheat, barley and canola crops from three sequential that covered approximately 11,000 to 17,000 hectares in each year were used. The yield data was processed to a 10 m grid, and a space-time cube of predictor variables was built at this scalle. This consisted of grower-collected data such as ECa and gamma radiometrics surveys, and the freely-available public data. The data was aggregated to a 100 m spatial resolution for modelling yield. Random Forest models were used to predict crop yield of wheat, barley and canola using this dataset. Three separate models were created based on pre-sowing, mid-season and late-season conditions to explore the changes in the predictive ability of the model as more within-season information became available. These time points also coincide with points in the season when a management decision is made, such as the application of fertiliser. The models were evaluated with cross-validation using both fields and years for data splitting, and this was assessed at the field spatial resolution. Cross-validated results showed the models predicted yield accurately, with a root mean square error (RMSE) of 0.36 to 0.42 t ha-1, and a Lin’s concordance correlation coefficient (LCCC) of 0.89 to 0.92 at the field resolution. The models performed better as the season progressed, largely because more information about within-season data became available (e.g. rainfall, remote sensing). The yield forecasts were used to formulate basic nitrogen application scenarios. The more years of yield data that were available for a field, the better the predictions were, and future work should use a longer time-series of yield data. The generic nature of this method makes it possible to apply to other agricultural systems where yield monitor data is available.