John Malloy—SEO Specialist

Archive for September 2011

Technological Singularity I started a little miniseries last week talking about the types of online personas and yesterday I continued with why Google would want to know if you were human or not. I just want to finish the thought concerning online personalities here by cataloging a few of the potential ways that humans are separated from automated processes in the online world.

The beginnings of separating artificial intelligence from human intelligence starts in the late 40s with Alan Turing and the Turing Test. What was a somewhat fantastical notion then is commonplace today. Often when entering form data you are challenged to solve a small puzzle to prove you are human. In an extremely interesting twist, researchers at Carnegie Mellon developed reCAPTCHA which not only identifies humans but slyly uses them to translate scanned text. For example, a great deal of the New York Times is not in digital format. We can try to scan this text but because computers can’t always read text (or the print may be slightly garbled) some of it doesn’t get translated. These bits are placed alongside CAPTCHAs for human translation, a brilliant bit of crowdsourcing developed in part by Luis von Ahn.

reCAPTCHA

The strange coincidence of simultaneously using computers to test humans and humans to test computers has led to some interesting modern applications of process filtering:

Once again, we see that using machines to identify humans turns into an arms race. Spammers even fight back using the exact same conceptual approaches, like bypassing email spam filters using images.

I love this quote from an interview in Walrus Magazine with Luis von Ahn, that he had “unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles”.

This is exactly what Google and other search engines need in order to find the best content on the web: the real value of human brain cycles. We’ve seen the flaws of humans identifying machines, and likewise machines identifying humans. Until automated processes become indistinguishable from humans, the solution may simply be to just let humans to identify humans through actual interaction, one human being at a time.

No tags

The Human Being I just finished reading this great post on the spam arms race problem that Google and other search engines have. This is a great summary of the problem in general. A really interesting implied point of the article is that in some ways the algorithmic approach to search indexing is impossibly flawed. It puts algorithm gamers eternally in front of great content because people focused on great content aren’t gaming algorithms (or more realistically they are afraid to because they might get black listed or penalized). So there will always be some latency between the great content coming to the top of search results while search engines look to filter out gamed results.

Near the end of the post he says:

    “The good news is that webmasters who don’t invest in gaming will still see the best long-term results. Focusing on quality, basic promotion through guest blogging or social sites, and honestly providing value will get organic results over time – and won’t be tossed to the sidelines with algorithm updates.”

Why? I mean, we all hope this is true, but what is the actual next step in addressing the problems Rob raises? What advantage could search engines get back over the algorithm gamers? There are human indexes, but they can’t seem to keep up with all the content on the web, specially emerging, fresh, or daily content. One solution might be to crowdsource the problem. I think Google has attempted to do this a little bit with it’s +1 button. But this could be gamed as well by sending out bots to vote up content.

In every case, the problem keeps coming back to automated processes. And there are only two solutions to this problem that I can see:

1. Build an automated process to identify automated process. Presumably automated processes have patterns or signals to them that can be identified. Content that is promoted or generated by an automated process could potentially have lower value with a search ranking system.
2. Somehow harness the abilities of real human beings to help identify valuable content.

The problem with #1 is that is remains an arms race. With each development comes a new battleground. With every adjustment Google makes comes a counter-adjustment by anyone looking to game them. #1 will presumably always be a part to the search indexing process, but it alone is not enough,

#2 is the answer: because at the end of the day humans know what they want. They may even *want* what spammers are trying to get to them. But Google doesn’t necessarily care about that, it just wants to get searchers paired with what they are looking for. So by threading real human feedback into the quality of the results you can assure that stuff people don’t want is getting marginalized or de-prioritized.

Of course easier said than done.

The gamers may create an army of things that look human, or even employ cheap labor in order create networks of people that are directed to promote their product, creating an artificial demand that would get noticed by Google.

Last Friday I talked about the different kinds of personas that may exist on the web. Tomorrow I will take a look at the different ways human begins can be validated and therefore used to improve search quality.

No tags

Actroid: Human? The way we interact with each other within a digital space is sort of incredible if you take a step back and think about it. It removes all physical time and space and replaces these concepts with new definitions, new ways of moving around, and new types of engagement, all of which were impossible before.

On Google Plus I can socialize with someone who doesn’t even speak my language as if they were just another friend. On a place like Reddit, I can sign up for an account with a minimum amount of detail and instantly have a brand new persona, history-free. A new start on digital life or a way to separate my real life from my false one. Lately we’ve seen groups of users harness the concept of an anonymous mob into the form of a powerful, anti-corporate vigilante.

The arguments over who you are and the ethics of purposefully shedding your “true” identity are of primary importance to how we interact digitally. On the front lines of this at the moment are Google, it’s new social network Google Plus and a vocal group of users.

In this context, there are essentially four different kinds of identity within the digital world:

1. Named Persons – this would be a direct mapping of a real word person to their online presence. The level of detail given might be extremely minimal, gratuitously unnecessary or anywhere in between. An offline parallel would be a driver’s license or a passport.

2. Pseudonyms – an identity taken up that is usually intentionally separated from the offline person. These are useful when discussing sensitive topics, protecting identity or just providing a new outlet of expression wholly unattached to an offline history. Pseudonyms generally intend to have their own, separate history, almost like the creation of a new person. In the offline world, these are often pen names, dummy corporations or front organizations.

3. Anonymous Personas – unlike named persons, anonymity means no tie to an offline personality but also unlike pseudonym there is no intent to establish a new personality or even have a history. Anonymous could be anyone. Websites like 4chan demonstrate what anonymous communities with extended lifetime can develop into. Offline examples include symbolic uses, anonymous donation, and protection of sources in media.

4. Non-humans, Bots, or Automated processes – In this context, these are programs that are written in order to accomplish tasks that are either well-defined, repetitive or technological. In most cases within this context, these programs are made to look like humans. An offline example that I found yesterday is this actroid, which at the moment is better at acting human than performing tasks. Automated technology is certainly improving, though. Meanwhile AI is a little behind.

What does any of this have to do with SEO? I think identity is important to SEO for two reasons: search indexing and search quality.

If you didn’t know already, web crawlers index the web for search engines. Since the web is so big, automating the job of indexing what’s out there makes sense. These web crawlers are part of group #4, especially in that they are an attempt to mimic human behavior: how would a human navigate the page and how would a human understand what the page is all about. In this context web forms are an excellent illustration of the problem behind this methodology. Bots don’t have personal data. They aren’t human. So how can a search engine truly index the content on the web that requires human information or personal data? Back in 2008 Google announced it was experimenting with forms, but we haven’t heard much about this since.

One cool side note for technical SEOs, you can now fetch your website *as* Googlebot to see what your website looks like to the crawler.

But the real interest here is search quality. Search engines have a vested interest in identifying the four identity types above.

The most obvious of these is #4 again. Generally speaking, if people can write bots that look like humans, they can create a false market of demand for content. If the search engines don’t identify these false markets, their search results could be things that aren’t what people really want, or simply biased toward one product or service.

At the moment however, Google is particularly interested in identifying the difference between the first two: named people and pseudonyms. Google has stated they are an identity service and that Google “works best in an identified state“. Twitter, Facebook and other networks have also encountered this problem especially with celebrity impersonators, often resorting to “verified accounts”.

But I think Google has a bigger problem on its hands that it isn’t discussing openly. My belief is that Google wants to use individual opinions as a major search ranking signal. This could potentially significantly help with search quality: if real humans are helping shape the results in addition to bots, we can index the web much better (even the stuff behind forms!). Individuals are becoming more prominent on the web (think about blogs becoming as popular as commercial websites). If links from websites were a way of understanding value in the last decade, links (or referrals) from individuals could become a big part of understanding value during the next decade.

But what if people make TWO profiles? Or five? Or 500? Will they have a greater voice? If Google allows pseudonyms it has to account for this problem. Google has already dealt with this problem with the proliferation of websites to create links. I don’t think they want to deal with it again.

It seems to me pseudonyms, like websites, could have their own PageRank. Who cares if it’s a pseudonym? What really matters is whether or not that pseudonym is human.

No tags

Theme Design by devolux.nh2.me