freepik.com/free-vector/

How many fish in the pond?

Each day, the 2 guestbooks I'm responsible for receive 4 to 15 spams, advertising such needful things as rolex replicas, gold coins, ringtones; amitriptyline, atenolol, valium, tamiflu, cyclobenzaprine, zoloft, clonazepam, codeine, alprazolam, furosemide, trazodone, even methadone and morphine, of course viagra; girls (latinsex included... whatever it is), Russian women; not to forget apartments in Bologna, Milano, and Udine; antique oriental furniture; una bicicletta elettrica, porno videos, lingerie; and most recently a 1.000 watt car amplifier (just to mention the least monotonous stuff).
What can a poor webmaster do? Fortunately, I keep a copy of the data files on my harddisk, and since these files are rarely changed by ‘regular’ entries (3-5 per year), I start each day by simply replacing the new data files infested by spam entries with the saved ones, which takes just a few seconds. I could have been happy with that, but one day one of these impudent spammers added the line: "To the guestbook! Muhahahahahaha!" That was too much for me. I decided to do more against these transgressions!
By comparing the exact times of the inappropriate guestbook entries with the weblog statistics, I found out the sending IP addresses and added them to the .htaccess file supervising the respective guestbook sites, behind the words deny from (I really denied the littering of ‘my’ guestbooks to these villains, from the bottom of my heart!). After 13 days, I had collected 216 black sheep on both blacklists together (interestingly, 19 of them appearing on both lists). To my disappointment, however, both books still received an unabating number of spam entries each day.
With this frustrating result, I could have given up, but these numbers continued to intrigue me. Why did I find 19 IP addresses on both lists? Since both lists have been roughly the same size, each of them might be regarded as a random sample of size 108. Let's imagine a pond with an unknown number of fish (this method is widely used to monitor population densities in the wild, see Pollock et al 2002). Let's further draw (e.g. with a net) a first random sample of 108 fish. This number still doesn't say too much about the total number of fish roaming these waters. But let's now assume further, that I mark these fish and throw them back into the pond. The next day I draw again a sample of 108. This time my sample contains 19 marked and 89 unmarked fish. What might we think now about the total number?
We might imagine now, that we kind of ‘diluted’ the sample of marked fish in the greater (unknown) reservoir of unmarked ones. The ‘degree of dilution’ can be calculated by relating the number of fish found marked in the 2nd catch (19) to the total number of that catch (108). Since we know the total number of marked fish (108), we can calculate the total number of fish in the pond by multiplying 108 with the ‘dilution factor’ (108/19), resulting roughly in the number 614. This was good news for me. While it might appear fruitless to embark into the setting-up of a blacklist with several thousand entries, the number 614 was a glimpse of light at the end of the tunnel. So I kept on collecting the sinners, until...
... until I found out that many of these sinners don't give a dime to my deny to. They simply continue to leave their garbage every 2nd or 3rd day in at least one of these guestbooks. Apparently, they can do that even without accessing the web site at all. Not only can I not prevent them from doing so; also my nice calculation above is simply wrong, since it was based (to a still unknown, but most likely large extent) on habitual sinners. To remain in the picture with the fish in the pond: My pond contains 19 fish that jump into my net the first moment they see it. These 19 will always contribute to my catch, and they always will escape to come again the next day. The rest I have never seen twice.
We will have a new homepage soon. Including a new guestbook. A real good one. At least that's what I've been told. In the meantime, I start each day by simply replacing the new data files infested by spam entries with the saved ones, which takes just a few seconds.
Kenneth H. Pollock, James D. Nichols, Theodore R. Simons, George L. Farnsworth, Larissa L. Bailey, John R. Sauer (2002) Large scale wildlife monitoring studies: statistical methods for design and analysis. Environmetrics 13: 105-119.
Latest news: Following valuable suggestions, I changed the name of the guestbook.cgi file. The unwanted entries stopped immediately. My blacklist will not grow further (at least for now...).
7/06 <          MB (11/06)          < 12/06
spampoetry
thorny blossoms of globalization