CSCE Capstone
Student Site for Individual and Collaborative Activites
Team 7 – PDF Extraction and Cleanup
Team Members:
Project Summary:
PDF files are one of the most used file formats in today’s academic world. PDF files are easily readable to humans, but machines can struggle to interpret the text. This project will use Natural Language Processing (NLP) and computer vision to restructure the text into a more readable format for the machine making PDFs even easier to use.
Project Task List:
- Research NLP/Computer Vision: 1/13 – 1/17
- Integrate into Sorcero’s workflow: 1/21
- Researching promising PDF extraction implementations: 1/20 – 2/1
- Set-up & test Parsr: 2/3 – 2/15
- Familiarizing with Parsr codebase: 2/17 – 2/29
- Work on contribution to Parsr: 3/2 – 3/14
- Experiment with Python NLP: 3/16 – 3/28
- Continue experiments: 3/30 – 4/11
- Documentation write-up: 4/13 – 4/18
- Final Report & Presentation: 4/20 – 4/25
Documents:
Presentation Video (Must be logged into UARK account)