Team 7 – PDF Extraction and Cleanup

CSCE Capstone

Student Site for Individual and Collaborative Activites

Team 7 – PDF Extraction and Cleanup

Team Members:

Project Summary:

PDF files are one of the most used file formats in today’s academic world. PDF files are easily readable to humans, but machines can struggle to interpret the text. This project will use Natural Language Processing (NLP) and computer vision to restructure the text into a more readable format for the machine making PDFs even easier to use.

Project Task List:

Research NLP/Computer Vision: 1/13 – 1/17
Integrate into Sorcero’s workflow: 1/21
Researching promising PDF extraction implementations: 1/20 – 2/1
Set-up & test Parsr: 2/3 – 2/15
Familiarizing with Parsr codebase: 2/17 – 2/29
Work on contribution to Parsr: 3/2 – 3/14
Experiment with Python NLP: 3/16 – 3/28
Continue experiments: 3/30 – 4/11
Documentation write-up: 4/13 – 4/18
Final Report & Presentation: 4/20 – 4/25

Documents:

Final Proposal

Final Report

Presentation Video (Must be logged into UARK account)

Poster

CSCE Capstone

Team 7 – PDF Extraction and Cleanup

Team Members:

Project Summary:

Project Task List:

Documents:

Meta