Turk Berserk: Requesters: Getting High Quality Results

Sounds like TagCow has had some great publicity and response to their enterprise! This means more work, but a few snags as well. Georgetag posted some questions to Turker Nation. These are problems that most Requesters will encounter, but particularly those with high-volume HITs. Although my response is particularly about the Image Tagging set, the tactics outlined below are general enough to be used by any Requester.

Quote:

Garbage tags (intentional
and unintentional):
We filter out meaningless words (the, it, an, a, with) but we are getting some totally irrelevant tags
Vulgarity (jokesters are hijacking the program)"
It looks like MTurk is doing some filtering and we are doing some filtering as well. (Can anyone confirm that?)
...
Incomplete taggings:
We have gotten some images tagged with "boy" where there is more that could obviously be said about the photo, like "boy", "playing", "trains"

There are several things you can do to keep the quality high. Here is a list of tactics implemented by other Requesters on mTurk. I'm certainly not suggesting you use all these ideas, but one or two might work well.

Have a good, representative list of examples, and a good description.
I know folks over at Turker Nation have already mentioned this, but your description is very vague. We aren't certain if you want us to make a list of everything we see in the picture, or just keep it as simple as possible. A separate webpage with lots of examples will go very far. We can then emulate these examples.

Warn workers what response will be rejected, and what behavior will get them banned.
A clear (but not overly-dramatic) warning might just be enough to scare off Roboform-type workers.

Require a minimum approval rating for workers.
Some recent HITs by Amazon's Media Content, and the information extraction group had the approval rating greater than 90%. (Smart Travel Media also has this threshold.) This sounds about right to me. After doing 40k HITs, mine is 99.8%. In the forums, even those who do HITs with high rejections (like the items HITs) still seem to have above 90%. This will prevent some workers who are continually trying to "game" the system.

Offer a bonus for high-quality work.
Georgetag is already offering up a volume bonus based on approvals, which is excellent! Few requesters do this, and more should. Once you get a good verification workflow established, you could track workers' responses and reward those that have submitted the highest quality and volume. Money talks.

Include "gotchas."
Include some pictures that should have an obvious response, like a flower, bird, etc., and particularly ones with a word to include. Then you can start to weed out or ban workers who miss these images. The "are these items different" has a qualification set up and a large set of gotchas. When you miss one, your qualification goes down by 200 points. Basically, after getting 3 wrong, you get timed out for some length of time.

Ban very bad workers.
And ban them quickly. If you haven't already implemented it, check if a single worker is giving you the same response (or few responses) over and over again. I wouldn't be surprised if workers are bypassing the "Enter ALL text found in image" step. Luckily this can be automated. And don't forget to give them some rejections if they replicate their responses beyond some acceptable level.

Set up verification HITs.
After getting all the tags for a given image, you could then create a HIT where the worker verifies that the tags are relevant, and could let you know if any are vulgar or meaningless. An image with a row of checkboxes would be ideal, where you select any tags that are bad, and a comment field to let you know about anything unusual. Hopefully then you can get a great set of tags and you can identify the bad workers using another method.

Use a qualification test.
This one might be a pain to grade, but you could have 5 images that the worker has to successfully tag before being able to do your HITs. Some qualifications even have a quiz about the purpose of the HIT and whether an example is appropriate or not. The quiz-style could be automatically graded.

Implementing 1, 2, and 3 is dead-easy. Making a nice page with a list of good and bad examples will go a long way to fixing some of your problems. I think some workers might be inadvertently giving you poor tags due to lack of understanding.

Politically, using any of the tactics 2-8 can be slightly tricky. (Item 1 should be done by all Requesters. Don't forget, the "description" field cannot be seen by workers once they are in the HIT.) You don't want to scare off your best workers, nor stop people from trying your HITs. Don't be too threatening, or too strict. Workers get very upset if they feel wrongly slighted, and will happily share with all on the Turker Nation forums.

To end on a positive note:
You'll find that most workers really do want to give you exactly the high-quality response that you require. When paid well, we are eager to perfect our responses, and love having discussions and feedback. Continue a good dialog, and you will have a group of willing, quality workers in no time!

Wednesday, April 2, 2008

Requesters: Getting High Quality Results

No comments:

Pages

Useful Links

Labels

Blog Archive