Skip to content
English
  • There are no suggestions because the search field is empty.

Data Cleaning

Detailed supply chain data is increasingly important for deforestation compliance. In industries like palm oil, this kind of geographic information has been widely accessible for several years when it comes to large industrial plantations. However other commodities are only now gathering plot data on a large scale due to the remote and decentralized nature of the data. We see this happen mostly within the cocoa and coffee industry, as well as increased mapping effort for oil palm smallholder / plasma. In this document we will outline some basic steps we recommend to take before providing Satelligence with your supply chain data.

 

Visual check for obvious outliers: polygons gathered in the field often contain one or more coordinates that deviate from the other points due to a lower GPS accuracy.

Inaccurate coordinates from field visits often lead to these obvious errors. Visualizing only the outer boundaries of a polygon will make these stand out.

 

If you are using QGIS, you can verify the validity of the polygons with the tool Check Validity. In case invalid geometries are found, this will result in a separate output dataset showing the farms that require fixing.

 

Repair invalid geometries: this is a standard tool that can be used in most modern GIS software (ArcGIS: repair, QGIS: fix). 

 

Note: This will only fix self-intersections of polygons, but will not make the plot prettier or remove outliers mentioned in step 1. If there are a lot of polygons with incorrect holes or errors in them, you can choose to apply a “convex hull”. This will only return the outer boundaries of each polygon.

“Fix geometries” tool in QGIS will make sure self-intersecting polygons like these will be usable for other GIS operations. They will not change any of the border points to make the polygon visually more consistent. Alternatively, applying a buffer of  0 should also get rid of these self-intersecting polygons.

 

Remove polygons that are in the wrong place: some points / polygons might end up in different countries/continents/offshore or basically far from the regular supply chain entities. 

Somehow these Ugandan coffee farms ended up in a city in East India. “Zoom to layer” will make these cases very clear if the extent zooms to a much larger area than you expect.

 

Remove null/empty geometries: sometimes field data have all the necessary information (e.g. farm ID, supplier) but are missing geographic information. This means the data is unusable for any GIS analyses and will need to be removed. Note: these empty rows might still contain valid and valuable information like farm IDs and cooperative names, so it’s best to keep a good record of which farms are removed, so these can be updated in the future or forwarded to the original supplier of the data.

 

Remove duplicates: GIS software offers ways to remove geometries that are copies of others in the same dataset, which is a recommended way to easily remove unwanted overlap. However, in many cases overlapping farms do not exactly match, causing these to be missed in these automated tools. To visually check for overlap between farms, is to apply a semi-transparent style to the dataset in your GIS software. This will make it very clear when farms overlap, since the features will stand out more.

 

 

50% transparency of a cocoa farm dataset shows where polygons overlap

 

If polygon data is not available, the next best thing is to have GPS locations of your supply chain instead. While this gives you a lot of information about the regions you are sourcing from, it does not specifically delineate the area that is used for cultivating the crop. These data are often only available in excel files (xlsx), which makes it harder to visualize and check the validity of the locations. In order to visualize them, save the excel file as a CSV file with “longitude” and “latitude” columns containing the coordinates. This format can be visualized in GIS software. 

In QGIS you can visualize CSV files as geographic points by  “Add Delimited Text Layer”. This allows you to perform the same visual quality checks as described for the polygons. 

 

Make sure that coordinates are in the same reference system.

Do not use coordinates in degrees, minutes and seconds (38°53′23″N, 77°00′32″W), which are very error prone when manually editing the dataset. 

 

Instead, use geographic coordinates in latitude and longitude in decimal degrees (e.g. 51.702, 5.545), The most commonly used coordinate system should be WGS84 (EPSG:4326). When using an undefined coordinate system you can “Assign projection” to make sure the polygons/points are displayed in the right place. A local projected coordinate system requires you to use “Reprojection” to translate the coordinates from a non-standard system to a more common one.