# Pros &Amp; Cons Of Using Google Totals?

Continuing from here:

BokehAll those occure on both sides of the equation. In maths/statistics when that happens such effects are considered cancelled.
How do you know they occur equally on both sides, out of interest?

Oh, come on MrP, there was no need to open a thread! Well, if you want and you're interested in discussing this... I guess it's ok. It's just that the title seems a little strange to me: Pros and cons of using Googletotals. What do you mean by totals? Using the number of results Google gives on the first page? If so, than that has no pros at all, in my opinion, and I told you why. Just look at that example I gave earlier: if you don't check the real number of pages, you'll see that one string gets 3,380 hits and the other 94,400 and you'll think that the latter is more common on the net. There are lot of examples like that one, just try some searches.

As you said, we need to limit results in order to be able to check them. Oh, ok, I'll give a little example of what I do. The first one who starts laughing is a dead man, it that clear? LOL
Us girls are... or We girls are... Hmmm:
"us girls are" site:www.myspace.com ---> 1,820 results indicated on the first page --> 69 real results
"we girls are" site:www.myspace.com ---> 139 results indicated on the first page --> 20 real results
So "us girls are" seems 13 times more common according to the data on the first page, but it is actually 3.5 times more common.

Now, what can you say about those real results? Well, you should check them. Not all of them, but when there aren't too many it is easier to take a quick look and see if most of them are relevant. Then, you should check some pages to see the context they are in. For example, in blogs, it's a good thing to check the profile and see if that kind English could come from a native speaker. I've found a lot of weird things on myspace, but I then realized they all came mainly from non-natives.

Conclusion: I don't know if there is anything to conclude, but it's clear that there's a lot of native speakers who would naturally say "...hope you boys show up in your big boy pants, cause us girls are gonna bring the house down!"

Freakin sweeeeet...
You don't. We are not talking real maths here, just probability. What you do know is the same algorithm is applied to both sides of the equation/search. That means both sides are subject to the juxtapositions and disregard for punctuation and case, etc. Obviously certain specific strings may have some amount of bias but used in context as an indicator Google is a useful tool. What else is there that takes in such a diverse range?
I don't deny that it's useful; but if one string in a comparison has greater potential for ambiguity, as in this case, it's very difficult to see how we can justify a figure as precise as 38%.

For instance, on the first page for "us three are", I find:

1 case where the occurrence is only in a link

5 cases where "us" = "U.S."

3 cases where "us" is indeed used for "we".
In the first page for "we three are", on the other hand, I find 10 cases where "we three are" means exactly that. (Remarkably, the result is the same for page 2 in each case.)

If we assume that this is a fair representation of the distribution, then allowing for the fact that I only show 18100 hits for "we three are" and 12100 for "us three are", and excluding the autoreferential hit, though including the link, genuine cases of "us three are" amount to only 16% of the total, rather than the 40% that the "it all evens out" method would derive from the same figures.

Interestingly, "us three are" site:www.EnglishForward.com returns 2 different occurrences, while "we three are" site:www.EnglishForward.com returns 2 versions of the same occurrence. So even in the microcosm of our original thread, the googles are erratic.

As I say, I don't deny that it's a useful tool; but we can't take the totals at face value.

MrP
BokehYou don't. We are not talking real maths here, just probability.
Uh? But you are not comparing probabilities, you are comparing results. If two random variables have the same probability distribution, it doesn't mean they are going to take the same values at the same time...

Bokeh What you do know is the same algorithm is applied to both sides of the equation/search.
Yep, that must be true but, as I told you, if you try some searches, you'll notice that the number of results on the first page is definitely not a linear function of the real results. Plus, you'll notice that it is not even a function of the real results. The number of results shown on the first page must depend on a lot of things (could be number of links to certain pages, page rankings, type of websites, etc.), but they don't depend only on the number of real results. It is probably a function of all those variables [f(x1,x2,x3, ... xn)], but it's not a function of the number of real results [f(nr)]. And even if it was, it would definitely be non-linear, so it would be difficult to understand and compare the results, unless you knew the function. And even if you wanted to find that non-linear function, you could draw a graph of real results vs estimated results, but then you'd have no way to go on when you reach 999 (Google only shows 999 real results at most). So "big numbers" would have no sense at all in any case.
How do you know they occur equally on both sides, out of interest?
I think Bokeh is saying that it's like throwing dice. You can throw this pair of dice or that pair of dice, but if both pairs are fair dice, the probability distribution for both situations is the same. Even if they're crooked in the same way, the distributions will be the same.

Likewise, in some strange but not exactly equivalent way, for lookups of "I is" vs. "I am". Whatever 'unfairness' is built into the one search is built into the other search.

A business manager once said to me that he didn't care if the cost figures weren't correct, as long as they were incorrect in the same way this year as they were last year.

We don't know the exact number of stars in the universe, but compared to planets visible from earth, "lots" is a good enough estimate. That even includes the case where we take the controversy surrounding Pluto in account.

And anyway, I've learned tons of idiomatic Spanish and French by Googling to see which of two of my guesses is 'correct' for how to express some thought or another. I've checked my results with native speakers, and 95% of the time there was nothing about using Google that threw me off the correct path.

Obviously, for situations where fine distinctions are needed, like exactly how many angels can dance on this pinhead vs. that pinhead, the noise level is greater than the signal, and all bets are off -- probably.

CJ
Hi,

I do not know much about probability et cetera, but I do think Google povides an unprecedented corpus. The question is whether it serves the purpose, that is, leading the searcher to the intended destination. When I carry out a search about a phrase, I generally doublecheck the reult with NYTimes or bbc.c.o.uk. Sometimes the result is consistent, sometimes not.

"the edge of the precipice" = 40.300 (google) // 9 (nyt) // 50 (bbc)

"the rim of the precipice" = 783 (google) // 0 (nyt) // 0 (bbc)

"the verge of the precipice" = 4.630 (google) // 0 (nyt) // 0 (bbc)

"dull sound" = 44.100 (google) // 6 (nyt) // 9 (bbc)

"drab sound" = 196 (google) // 0 (nyt) // 1 (bbc)

"monotonous sound" = 14.300 (google) // 1 (nyt) // 4 (bbc)

I have a tentative approach towards google results. More examples can be given, but the numbers above show that there is not always consistency in restricted and non-restricted results.
LinguaphileI generally doublecheck the reult with NYTimes or bbc.c.o.uk.
That's a fair approach if you are looking for prescriptive grammar or correctness, but not for looking for the frequency of common "errors".
CalifJimYou can throw this pair of dice or that pair of dice, but if both pairs are fair dice, the probability distribution for both situations is the same. Even if they're crooked in the same way, the distributions will be the same.

I think with text strings, though, there are factors that we wouldn't find with dice – two strings may not have equal potential for ambiguity, for instance ("we three are" vs "us/U.S. three are").

Or to take another example: in the case of "if it were" (14.9m) vs "if it was" (22m), we would not know without checking how many hits for the latter were non-counterfactual (e.g. "Not sure if it was reading that or my hardcore revision for my swiftly approaching politics exam, but yeah, I have a headache"). There isn't an equivalent ambiguity in "if it were", however.

MrP

