CSCE Capstone

Student Site for Individual and Collaborative Activites

Team 7 – PDF Extraction and Cleanup

Team Members:

Sarah Bondurant

Nathan Davis

Richard Mays

Keegan Riley

Hayden Willeford

Project Summary:

PDF files are one of the most used file formats in today’s academic world. PDF files are easily readable to humans, but machines can struggle to interpret the text. This project will use Natural Language Processing (NLP) and computer vision to restructure the text into a more readable format for the machine making PDFs even easier to use.

Project Task List:
  1. Research NLP/Computer Vision: 1/13 – 1/17
  2. Integrate into Sorcero’s workflow: 1/21
  3. Researching promising PDF extraction implementations: 1/20 – 2/1
  4. Set-up & test Parsr: 2/3 – 2/15
  5. Familiarizing with Parsr codebase: 2/17 – 2/29
  6. Work on contribution to Parsr: 3/2 – 3/14
  7. Experiment with Python NLP: 3/16 – 3/28
  8. Continue experiments: 3/30 – 4/11
  9. Documentation write-up: 4/13 – 4/18
  10. Final Report & Presentation: 4/20 – 4/25
Documents:

Final Proposal 

Final Report

Presentation Video (Must be logged into UARK account)

Poster