The Problem:

An Indianapolis-based nonprofit that regulates student athletes receives student transcripts from high schools around the country. The transcripts contain the required information, but in thousands of different formats. As a result, employees had to enter the documents into a database manually, costing the company hundreds of man-hours per week. This was a time consuming and error-prone process. Other local companies were tasked with automating the data entry but were ultimately unsuccessful.

The Goal:

Systematically identify courses and grades across academic year regardless of transcript format, and parse all transcripts into a standard data format – eventually entering it into the client’s transcript system. This reduces the manual time and cost.


How We Solved it:

  • EduSource used a text extraction library to pinpoint the x and y coordinates for course name, course ID, and course grade.
  • We then captured the coordinates of each data point within each document format.
  • We also prescrubbed and parsed the data before loading it into the database – Up until now those steps had to be done manually.


  • EduSource demonstrated the ability to automate the process of identifying and capturing transcript data across thousands of formats. Phase 2 of this project will save the company hundreds of thousands of dollars annually.
  • We are also starting the process of using machine learning from previous transcripts so when the algorithm recognizes the format it puts the transcript into the correct template.
  • Future steps are to build API to the database and utilize Optical Character Recognition to capture information from scanned transcripts and images.