Beating Captchas With Machine Learning

Neil Narvekar
12 min readMay 13, 2021

Team members: Pearse Flood, Sam Dauenbaugh, Neil Nervekar, Allen Shufer

Abstract

CAPTCHA images are widely used as a verification method to make sure that the user is a human while performing various actions like making a new account or submitting a form. While this can be a very useful tool for security, our team wondered how accurately a machine learning model could read captchas, and thereby highlight vulnerable designs that should include greater complexity to avoid being susceptible to hacks. Two different datasets of captcha images from Kaggle were used to test the effectiveness of our model: one with color variant, yet uniformly shaped letters, and another captcha design with very distorted black and white letters. After taking many steps to preprocess the images and experiment with our model using tools such as OpenCV and the Keras API, we were able to create a simple machine learning model that was able to predict the text of both captcha datasets with a quite high accuracy, although a greater experimental classification accuracy was found with the black and white, distorted dataset.

Introduction & Background

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart, and it has become a commonly-used tool online to prevent bot fraud. These tests come in a variety of designs, but usually in the form of an image test. Tasks such as reading characters from a distorted image or selecting images that contain a part of an object are some examples, as they are quick and easy for a human to complete but can be very difficult for a human to solve. A CAPTCHA design that can’t trick computers effectively fails to serve its purpose, so for our project, we wanted to try and create a model that can predict CAPTCHA solutions to point out vulnerable designs.

To guide our approach, we looked at how similar character-recognition problems were solved, especially how models were trained to recognize images from the MNIST database of handwritten characters. Since there is a lot of distortion and color variation in CAPTCHA designs, we also knew that most of the work would be focused around image preprocessing. This way, we can simplify the images down to just the characters and make it easier for the model to predict. We used the OpenCV library to filter the noise from the image, which was then fed to a CNN for character recognition.

Data Collection/Description

It was important for us to include a variety of different datasets to train and test our model with since we were interested in figuring out which designs could be the most easily predicted. We found a selection of unique datasets on Kaggle, including this 10-character color CAPTCHA.

A few of the other datasets we used included a 5-character colorful design with heavy distortion and a 5-character black and white dataset.

The data we used was pretty easy to obtain. The answers to the CAPTCHAs were all included in each file’s name, and all the files were neatly organized in a .zip file that could be extracted for preprocessing within our source code.

Data Pre-Processing & Exploration

Before we can begin making predictions on our data, we need to identify the features we want to use to model the problem. We knew, both from previous experience and from studying what others had done on similar problems, that we would want to isolate a grayscale version of each letter for our model to predict on.

We began by looking at ways to separate letters from the background. Using morphologies proved to be the most effective at removing lines and dots that we didn’t need. Then, a threshold applied over the image would remove some of the noise introduced into the background and remove a few remaining distortions.

After cleaning the image, we needed to extract the letters from their surroundings. We accomplished this with a function provided in OpenCV called findContours(). This function detects continuous white spaces in the black and white image and returns a set of pixels that make up that region. We can then fit a bounding box around that region and extract the sections of the image that were most likely to be letters.

Moving into our final preprocessing design, we chose to add a few more filters that refined the process. The first set of filters were a low-pass and a high-pass filter. We apply one of these filters to the image depending on if the image is blurry (apply a high-pass) or noisy (apply a low-pass). Removing frequency distortions improves the functionality of the morphologies. We also introduced some parameters that would give us finer control over the way the preprocessor functions.

One thing to note is that each captcha dataset requires a unique set of parameters for preprocessing. What works well for one set of captchas doesn’t always work well for another set. So, we had to experiment with each dataset we used to try to find the parameters that would yield the best results.

Ten character set original
Five character set original
Ten character set parameter 1
Five character set parameter 1
Ten character set parameter 2
Five character set parameter 2

Learning/Modeling

Based on the information we learned from experimenting with the data and preprocessing, we were able to come up with a model that maximized our classification accuracy for both datasets. Below is a screenshot of the code that encompassed the bulk of the model and training methods.

When we were researching approaches on how to create a model for this problem, one of the main elements that was discussed was utilizing convolutional neural networks. To accomplish this, we applied one convolutional layer to the dataset, effectively applying a filter to the input data to generate a feature map summarizing the presence of detected characters or text in the captcha. Because this feature map was still quite busy and would give a relatively low accuracy on the convolutional layer alone, we needed to isolate only the most likely features. To do this, we implemented a maximum pooling layer to focus on the characters with maximal presence in the feature map.

The images were then flattened, or converted into a 1D array in order to be fed into a dense node layer for the final character predictions to be generated. This dense node layer was made up of 62 nodes, the sum of each of the 26 uppercase and lowercase letters, and the 9 digits. Feeding the feature maps into this allows for the text predictions to be created.

This image does a good job of illustrating how implementing the convolutional and max pooling layers, and then flattening the images can create a more precise and specified dataset that ends up generating a connected node network like what we see on the right hand side. Luckily, these processes were not too difficult to implement in the code because we were able to leverage the built in functions of the Keras API to perform them.

Although it is often beneficial to run a convolutional and max pooling layer multiple times on the data for greater accuracy, we found that this was not the case with our dataset. When we implemented two or more of these layers we found our classification accuracy decreased beyond what it was with just one of each of these layers present in our model. Thus, while this example image gives a process with 2 of each of these layers, our model only consists of one of these sequences because we found it to give us the highest classification accuracy, which will be discussed in the next section.

Results

After running our model for 10 epochs on both of our datasets, our results were as follows.

For our dataset with uniform, colored characters, our model achieved a mean accuracy of 76.888% on our testing dataset.

In these plots, the X axis is the epoch number and y axis is the cross entropy loss and classification accuracy respectively. The blue line is for training data, and the yellow is for the validation data. We can see how the testing accuracy increases slightly and then flattens out to about 77% as the epochs increase.

In this example, we can see our model in action attempting to predict individual images from the colored characters dataset. The characters have been preprocessed by making the images black and white and split into separate characters. We can see that some of these characters become almost impossible to read by humans, showing the limits of our preprocessing on this dataset. However, we can see the model at work for the many readable characters.

For our dataset with the grayscale images, our results were significantly better. The exact same model was able to achieve about an 88% accuracy on this dataset.

We can see how the accuracy remains fairly stable at 88% for all the epochs for the testing data.

Here is an example of our model working with this second dataset. Compared to the colored images dataset, we can see that this dataset was preprocessed much better, as there are not as many characters that are unreadable by humans as in the colored images dataset. This is likely why this dataset ended up much more accurate than the colored images dataset.

The difference between the 77% accuracy for the color images dataset and the 88% accuracy for the grayscale dataset likely comes down to preprocessing. For both these dataset, we have 2 main preprocessing steps. First, we make the image black and white, and second we create bounding boxes around each character. For this first step where we make the image black and white, significantly more work must be done on the color images dataset, since the grayscale dataset is much easier to make into black and white.

These are 2 letters, e and a, from the color images dataset. What the letter truly looks like is on the bottom, and our preprocessed version is above it. We can see from these examples of the color images dataset the difficulty in converting these images to black and white. The processed images are basically unrecognizable by humans, so we can hypothesize that the same problem may happen for the neural network. For our grayscale dataset, there are less examples of a character becoming unrecognizable after this preprocessing step occurring, likely improving the accuracy of this dataset.

For our second step in creating bounding boxes around each character, both datasets suffer from bounding boxes being created incorrectly.

This is an example from the dataset of grayscale characters. For this image, the bounding box was placed around both the 9 and F, combining them into one letter. This would also be given the incorrect true label of ‘9’, which would cause or model potentially be training and validating on incorrect data such as this. This issue also occurs with the colored images dataset, and thus is likely a large reason that the accuracy of both these datasets isn’t higher.

Another approach we tried to alleviate the issue in the first preprocessing step, which was that characters in the color dataset are unreadable after converting the image to black and white, was to simply not convert the image to black and white. Instead, we would simply skip this step and create bounding boxes around each letter, but change our network slightly so that it can take 3 channel RGB images rather than only black and white images. We hypothesized that since converting colored images to black and white often makes them unreadable, leaving them in their original state may improve the accuracy of our model.

The results of this were that the dataset with black and white characters retained about the same accuracy of 88%. This is likely because there wasn’t that much different between conversion to black and white images and the original image. However, the colored images dataset unfortunately saw a drop in accuracy from 77% to about 50%. We hypothesize this result is due to the fact that the color of each individual letter has no correlation to which letter it is. This is intuitive to us as humans, as looking at what color a letter in doesn’t affect what we read the letter as. However, the model may be making some connection between the color of the letter and its label, in which case it makes sense to simplify the problem to a black and white image.

Conclusion

Overall, our final model gave us a 77% accuracy on the colored images dataset and an 88% accuracy on the grayscale dataset.

One important takeaway from our other approach discussed in the results section, which is our failed attempt to add the entire color image rather than the image converted to black and white, is that it is important to understand when it is beneficial to simplify the problem by converting an image to black and white and when it is beneficial to leave the problem as is.

For instance, after doing research on neural networks on classifying dogs and cats, most of these networks do no preprocessing and simply give the network the entire color image. This is likely because color is an important attribute that helps the network distinguish dogs and cats. However, this isn’t the case for a character recognition problem like ours. Thus, if we can hypothesize that color is an attribute that isn’t important in the classification, it is likely that removing color will increase the accuracy. However, if color might be important in the classification problem, then the original image can be used.

The main improvement that could be made to this model is the preprocessing. As discussed above, the conversion of our color image dataset to a black and white image distorts many letters. Improving this would be the best way to improve our accuracy on this color dataset. Additionally, the contour boxes often take 2 letters that are close together instead of one, or take some part of the image that doesn’t have any characters due to stray marks on the image. Fixing these contour boxes would improve our accuracy on both the color image and grayscale image dataset.

Transfer learning would be the next in improving our actual neural network. Currently, our neural network is fairly simple, but using transfer learning with an already trained image classification network may improve our network’s accuracy. Transfer learning allows us to use knowledge a network gained from a more general task and apply it to a more specific task. There are many already trained networks out there as part of keras that we can use for this, such as ResNet50, VGG16, and Inception. These models have been trained on more general image classifications, but by using feature extraction of these models, we could improve our own model greatly. Additionally, another idea would be to train our own model on a general character recognition dataset that requires no preprocessing to a high accuracy. We could then use this model as our transfer learning model in order to improve the accuracy of our Captcha reader model.

References

Black and White characters dataset

https://www.kaggle.com/greysky/captcha-dataset

Color captchas dataset

https://www.kaggle.com/aadhavvignesh/captcha-images

Convolutional Neural Network Image

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Neural Network on MNIST

https://machinelearningmastery.com/how-to-develop-a-convolutional-neural-network-from-scratch-for-mnist-handwritten-digit-classification/

Relevant project links

https://drive.google.com/drive/folders/1TYxsM-8dxTftjuQf-uw33GJ2Bml11gXU?usp=sharing

--

--