Story squad- Trending on new grounds

Rose Wachira
4 min readJun 24, 2021

Learning at lambda school has been a journey that has been both challenging and useful. I have learned a lot and at the end of it, I got to work in a team for the last unit labs. For 8 weeks we have worked with the stakeholders of story squad who want to create and fine a tesseract OCR model that would be better at transcribing kids handwriting than the Google vision model that they have been using

Story Squad

Story squad is the creation of Graig Peterson a former 6th-grade teacher. His dream is to create opportunities for kids to be creative in writing and drawing and reduce the time they spend on screen. How it works:

Kids read a chapter of a story each week

Kids write and upload their original handwritten stories and drawings

They then receive and give real feedback

Share points in a squad vs squad match

See if their team wins

The handwritten stories are then transcribed to text and inspected for inappropriate language before been sent to a moderator. The kids are then paired and matched with another team where a squad is created to head to a game and compete voting for the best submissions.

The work begins:

We were working in three teams each composed of IOS, Web, Data Science, and a Technical product lead. For our Data Science team, our objective was to improve the text transcription done by the previous team as well as generate synthetic data and fine-tune the tesseract model. As this is an ongoing project we had access to data and models that were done by previous cohorts

For my part, I was working on the generation of synthetic data with one other teammate. I did hit a roadblock at the beginning as I was having difficulties scraping google images using beautiful soup.

I had to find another way to scrape the images. In my search, I came across parsehub. It is a free web scraping tool that made extracting images easy and quick. For step by step guide follow this link: parsehub web scraping guide.

Scraped kids handwriting images gif

The images’ URLs are stored as a CSV file which can then be downloaded using the Tab save extension on chrome and saved as jpg images.

Tab Save extension

Blockers

It was not easy finding images of kids' handwriting and it took longer than expected to get good quality images for generating the synthetic training data.

Configuring tesseract:

a copy of local tesseract code had to be located in the system. in addition there were packages that needed to be installed for successful configuration

  1. libtool
  2. pkg-config
  3. build-essential
  4. automake
  5. libpango1.0-dev
  6. libicu-dev
  7. libcairo-dev

It took me a long time to run this successfully on my computer and still was having difficulties processing the images

Image Preprocessing

It is a long and tedious process that involves converting images to png, segmenting, and removing the noise so they can be used. Some of the techniques used include:

Binarization

Image rescaling to 300dpi

Noise Removal

Rotation/Deskewing

As a Data team, we were able to preprocess at least one zip file folder of images provided by the stakeholders. The generation of synthetic data was the other accomplishment.

With more time I would have been able to finish processing the images that I had. There are several steps involved and I went as far as converting the images from jpg to png files. Other steps for future cohorts will be rescaling the synthetic data images and removing noise

Working with all these new tools has made me realize that with the right mindset we can achieve a lot. I feel bolder to pursue a job as a Data Scientist/Machine Learning Engineer and with more experience and teamwork continue to deliver better results.

--

--