Wednesday, February 17, 2010

Week 7 Update

Current progress:

  • Image loading and feature extraction is working. I am computing the mean and standard deviation in grayscale brightness over adjacent rectangular regions. These values are my features for training.
  • I have processed available training that I have and I'm currently working with roughly 12000 negative samples (no text in image or text not centered) and 6000 positive samples (text completely fills window).
  • I'm constructing a giant training matrix labeled with 0 or 1 and passing it into OpenCV's SVM class.
  • I am having issues with training -- I believe the SVM is overfitting the data. I'm researching more into training SVMs to see if I can prepare my training data better or set the parameters to better initial values. Right now my results are terrible. I get all false positives or all false negatives.
  • In addition to the training code I have written the testing code. This will take an arbitrary image and pass a 48x24 window over it at various resolutions in order to detect text.
  • Essentially the only thing preventing a demo at this point is that my classifier is not working!
  • Need to incorporate gradients into feature analysis. The trouble with mean and standard deviation is that they don't take into account spatial relationships within the regions I'm summing up. In other words, I could randomly permute all of the pixels within each region and still get exactly the same results. Using gradient magnitude and angle would create some sort of metric of the texture in the region rather than just average brightness. This should be simple to include in my existing pipeline.
  • Clean up code a little bit. Everything has been written really quickly in prototype-mode but my code is getting to the point (375+ lines) that I need to organize it a little better to keep working.
  • Once my detector is reasonably accurate, perform OCR on the results. I've looked into Tesseract as an OCR engine I can add at the end of my pipeline. I'm not going to look into optimizing the results at this stage, I'm going to hope it will "just work."
  • Lastly, I need to work in Google Street View data and find a way to test results.
  • Week 7/8: Debug classifier, add gradient angle/magnitude as features, clean up code
  • Week 8/9: Incorporate Tesseract & GSV data into pipeline
  • Week 9/10: Freeze any implementations of new features, focus on getting optimal results from what I have, and work on final report
CONCEPT: (excuse my crude drawings)

Feature extraction:

Sliding window to detect text (with background scaled)

Wednesday, January 27, 2010

Week 4 Update

Change in Course

I was going to be doing this project in Matlab and using a multilayer perceptron as my classifier. However I've since made the decision to work on this in C++/OpenCV and use an SVM.

Reason for C++:
I've been working with Kai Wang, a graduate student in the CSE dept. He is also working with text detection and his project is implemented in C++ with OpenCV. I figured that it would be a lot easier to utilize his help if the two projects were under the same platform.

Reason for SVM:
In my experience, ANNs can be very effective classifiers but can take a very long time to train properly. Since this project will only be active for 6 more weeks, I think that an SVM is a safer bet.


First feature set I will be trying: Image integral boxes

By taking various per-pixel features, such as brightness, gradient magnitude, and gradient orientation, and then computing the sum/std of various boxes of that information, we can achieve a reasonable feature set for detecting text.

(Image from Chen & Yuille)

Once the features have been computed for all regions of an image, this vector can be sent to the classifier to determine whether or not a given region contains text.

To give everyone an idea of what kind of image I'll be training on, here is an example from the training set:

TODO: for next week.

Implement the full pipeline of load image -> compute feature set -> pass into training algorithm for text detection. I want to get a basic pipeline running and then will debug and add more interesting features for detection.

Wednesday, January 13, 2010

Introduction and Project Outline


Recognition of text in arbitrary real world images is a largely unsolved problem in Computer Vision and Machine Learning. Contrast this with document-based OCR of scanned text which is solved to a near-human degree of accuracy.

Image Text Recognition includes all of the difficulties of traditional OCR methods with some additional challenges.

  • The first problem I will refer to as "text detection." Text detection is the problem of finding the bounding boxes of possible text in the sample image. It is essentially a binary classification problem where the goal is to determine whether or not a given region of an image may contain letters or words. Humans are very good at this problem. Even when looking at a language we don't understand or when words are obscured or too distant to recognize entirely, we can still determine the presence of written language.
  • The second problem I will refer to as "word recognition." This is the challenge of taking an image that contains text and outputting the text as a string. This is what OCR engines do but they are designed for the very basic case of black text on a white background, with uniform font face, size, distortion, color, angle, lighting, lexicon, etc. In recognition of text in an arbitrary real-world image, all of these variables are unconstrained.

My goal is to implement a text-detection and word-recognition algorithm and apply it to Google Street View as demonstrated in the following slides:

Work To Be Done
  • Implement and train text detection algorithm
  • Implement word recognition algorithm
  • Collect google street view data (screenshots)
  • Examine the results of correlating metadata from the map with text recognized in images
One addition I would like to make to my text recognition algorithm if things move on schedule would be to allow different angles of the same text to be used to boost the recognition algorithm. This would require registering the same text from different angles and combining the results of recognizing the text in either image. This would help to provide invariance to view angle, lighting angle, and occlusions of some parts of the text.

Training Data

I won't be training my detector with GSV images, I will instead use pre-made data sets designed for this kind of work. There are a few here which I'll be using primarily to train my algorithm:

A Few Links to Work I'll Be Building Off