Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Battling the Internet Water Army: Detection of Hidden Paid Posters (arxiv.org)
66 points by yuhong on Nov 23, 2011 | hide | past | favorite | 24 comments


OK, so I've learned from this paper that this sort of paid postings to influence public opinion occur systematically by trained teams in China. But my praises end there.

I don't trust much their machine learning detection approach. But before we get into that, let's describe what they did. These researchers collected about 21,000 comments from a news site representing 552 users. The researchers then manually selected which of these 552 users they thought to be paid posters without any external confirmation. According to their methodology they assumed stupid comments or contradictory comments were by paid posters. The approach from here was to do lots of fancy math on novel comments not screened by the researchers, and make a classification on if these novel comments were by paid posters or not.

If you haven't guessed it by now, the flaw here is the researchers assigning which commenters they thought were paid or not, based on the stupidity of the comment. Its pretty easy to manually classify e-mail spam, but I'd have a hard time classifying a paid public opinion shill based on a comment. Furthermore, if the researchers are using the intelligence of the comment as a marker, my experience with youtube comments, r/politics, and a few other internet forums leads me to believe that there's no shortage of stupid, contradictory ideas espoused by real unpaid people.


Automatic Detection of Stupid Posters would also be acceptable.



"To the best of our knowledge, this paper is the first to study the social phenomenon of paid posters."

A simple Google search shows there are all sorts of papers that have been written on detecting astroturfing. This statement is an epic fail.


Well.. maybe you don't get those results in China.

But seriously, you're right, even a similar one posted to arxiv: http://arxiv.org/abs/1011.3768


Sorry about following your OT here, but it's a very interesting thought - That the great firewall might be a factor in over-representing the uniqueness of your own work if prior art is censored from your view.

This reminds me of a video I saw a while back where they showed Chinese students the iconic "man in front of a line of tanks" picture and they failed to recognize it. I wonder whether irony will eat itself in the future - maybe some "capitalist" country will decide to use tanks on its people and they will try to stop it with a human shield. Cue "look at the failure of capitalism - using tanks on their own people who bravely stand up to it!" comments from Chinas statesmen.

Shielding your citizens from material that encourages dissenting thought is a cute plan, but you may end up making them look like fools on the international stage.


The simple reason is tbe author of this paper apperantely doesn't know the vocabulary "astroturfing"


Might be. Personally, I discovered this word some time around last week, via HN.


I'd love to see this applied to some of the "patriotic" Facebook pages (like the "Being American" page).



In the USA, it is now law that bloggers have to mention any gifts of products, books, licenses, etc. that might influence blog posts.

I think this is good!

I am an author, and I get comped a lot of books, and some software products. It just feels right to say something like, for example, "thanks to publisher Z for sending me a revue copy of book B" when talking about book B. If I get comped something that I don't like, then I won't talk about it.

Paid posters on Reddit, etc., are more insidious because you just have to guess if they might be paid by a company or government to push desired hype.


not actually a "law" -- more of an F.T.C. regulation

"The Guides are administrative interpretations of the law intended to help advertisers comply with the Federal Trade Commission Act; they are not binding law themselves." -- http://www.ftc.gov/opa/2009/10/endortest.shtm


FTC regs are backed by truth-in-advertising law.


Can you point me in the direction of more info about that law? Maybe I'm just choosing the wrong search terms, but I couldn't find any concrete info.



Relevant "law":

   § 255.5   Disclosure of material connections.

   When there exists a connection between the endorser and
   the seller of the advertised product that might materially 
   affect the weight or credibility of the endorsement ( i.e.,
   the connection is not reasonably expected by the audience), 
   such connection must be fully disclosed.
Relevant example:

   Example 7:  A college student who has earned a reputation as
   a video game expert maintains a personal weblog or “blog” 
   where he posts entries about his gaming experiences.  Readers 
   of his blog frequently seek his opinions about video game 
   hardware and software.  As it has done in the past, the 
   manufacturer of a newly released video game system sends the 
   student a free copy of the system and asks him to write about 
   it on his blog.  He tests the new gaming system and writes a 
   favorable review. Because his review is disseminated via a 
   form of consumer-generated media in which his relationship to 
   the advertiser is not inherently obvious, readers are 
   unlikely to know that he has received the video game system 
   free of charge in exchange for his review of the product, and 
   given the value of the video game system, this fact likely 
   would materially affect the credibility they attach to his 
   endorsement.  Accordingly, the blogger should clearly and 
   conspicuously disclose that he received the gaming system 
   free of charge.  The manufacturer should advise him at the 
   time it provides the gaming system that this connection 
   should be disclosed, and it should have procedures in place 
   to try to monitor his postings for compliance.


from the geographical distribution of users vs. paid posters on pg.7 the 2 provinces - SICHUAN and esp. SHANDONG - are noticeable by lower ratio of paid posters. Any idea why it so?


I guess their data example is not strong enough. I don't think the two provinces have any reason for a ratio so low.


or bad GeoIP db


This is a fantastic study! The authors are top class in their field and I think we should all take notice of what they say!!!


I see what you did there...


I suspect quite a bit of astroturfing is going on in app store reviews. My national iOS app store has a fairly small volume of ratings/reviews, usually single or double digits, only hitting hundreds for a very few apps. I've spotted two or three weird scattershots of desultory, ill-fitting, generic-sounding reviews. It even looked like the effort needed was overestimated out of ignorance ...


Oh, that's a certainty. I've seen Amazon Mechanical Turk work items that have you place a five star review and receive payment for having done so.


Hmm, that explains why they are "reviews" (with verifiable names) instead of just ratings (untraceable externally).

Fortuitous interesting excursion on arXiv: top-200 heavy physics papers readers, "2010 Institutional arXiv Usage Data" at http://arxiv.org/help/support/2010_usage




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: