EduSource Speciality: Data Cleanup Via Machine Learning
Our client is a nonprofit that since 1997 has developed STEM curricula for use in US schools from pre-K through high school across all 50 states. This company’s data system had 5.5 million student records that contained massive numbers of duplicate data. Also, many student records had been lost or disconnected. At the time, the nonprofit did not know how to efficiently clear duplicate data records and reconnect lost student data.
Our goal was two-fold:
- Find ways to connect disconnected student records by looking through old database backups.
- Train an algorithm on machine learning to sort through sample data and identify how to recognize duplicates. Then run the algorithm on the entire data set to identify duplicates in the system.
How We Solved it
The EduSource team was given data backups and was able to reconnect the majority of disconnected student records using Python code to populate a new Postgres database. Then, utilizing the Dedupe.io library, EduSource manually trained a machine-learning algorithm to identify duplicates. The machine learned which fields were important for identifying duplicates, and then applied that knowledge to the rest of the data. Half a million student records were identified as duplicates with a 95% accuracy, and were automatically merged. Then the EduSource team worked on a way that this process could be re-run on a regular basis to continue to de-dupe new student records.
The EduSource team can craft all kinds of custom software, but one of our sweet spots is big data cleanup and data analytics.
If you aren’t getting the information you need out of your data, email Brett Ridoux at firstname.lastname@example.org for a free consultation.