Amazonfail
Apr. 15th, 2009 08:36 pmI have a sneaking suspicion that this may have involved second-order data analysis.
That is, nobody went and explicitly selected a set of metadata categories as "adult" with the aim of delisting them.
My guess is more along the following lines:
If you analyze all the individual works which have been manually flagged as "adult" in the past -- possibly with no effect, merely at the outset as an internal marker -- you can run a high-level analysis which looks at categories grouping the works. You'd put in extra requirements -- for example, that there be a minimum number of "adult" works in a category, and possibly might require more than one category once matched to tag a new work.
The actual human inputs into this would be the seed data -- which would be works, not categories -- plus the numeric thresholds used as parameters to the program.
The problem with this approach, which might not be visible to a programmer trying to implement an automatic labelling scheme, is that category metadata, which is basically CIP information, is set by the publisher, is wildly inconsistent, and can't really be used in this way in the first place. In addition, trying to cross-check by using multiple categories won't work because the labels aren't orthogonal.
The next problem (probably not on the developer's side, since they'd probably set this up to be tweakable) is that if you set the thresholds for this kind of analysis too low you get very unexpected results.
Finally, you would have to do extensive hardcoding tweaking for (1) categories which are too broad, and therefore useless at actually capturing useful metadata for this purpose and (2) categories which are so small that, although they are what you want to target, never get enough input to push them over the trigger limits: you would really need to do an iterative application without generating anything other than internal lists (generate tagging; check with human judgement; tweak; run again; tweak...) with knowledgeable people assessing the results each time. (One problem with this is that it's iterative -- even assuming it would work in the first place, every run produces more inputs for the next one, which generates horrendous positive feedback unless something keeps it strictly in check.) What you would really have to do is always have any new additions to the adult category found by this sort of iterative "search" vetted by human eyes: but Amazon's whole model (As far as I can tell) is to have as much done as automatically as possible (e.g. many of their recommendations, which are based on purchasing patterns rather than CIP data, are wide of the mark, but enough are close to the mark that there's a better cost/benefit point in just generating them than to have them vetted by anyone (except the end user, who can provide tuning feedback).
That's assuming that it was a good idea in the first place -- or at least, a good idea as an automatic filter rather than one which could be turned on optionally by the user. The presence of multiple communities coexisting on the net basically renders it unlikely to impossible that you'd ever get any consensus on what was a "proper" result.