|
HomeSite Help - Site Blog |
|
|
Home | The Bookstore |
Once the site at the new server is ready, this message will automatically disappear!
Meanwhile, you can see how the move is progressing at the status page.
2002-07-29SidetrackedNo, I haven't stopped with my log file analysis in order to set up server-side UA sniffing. I got sidetracked by some other problems that needed immediate attention. All solved or worked around now, and if you read on, you'll see I have actually come full circle. Eudora blowing chunksThe first thing that happened was that Eudora started blowing chunks. It seems that Eudora (at least version 4.3.2, which is what I'm using) gets into trouble when a folder has more than 32,000 records (emails) or so, and it needs to put more in it as a result of a filter. Every time it processed mails, it would tell me the index had become corrupt, and would I like to rebuild it? Yes, go ahead. This always fixed things for that session, but the next time it would happen again. And the rebuilding took quite a bit of time, too. Obviously, I would have to split up that folder or get rid of some stuff. But why did I have such a huge folder? you'll ask. On the server I've set up a few PHP routines that handle every external link when it's clicked: once counts each activated link per page and stores the result into a database so I get a an idea of which links on which pages are popular. Another actually checks the target URL, just like a link checker; if there's a problem, I automatically get an email, and in some cases you get an error page, informing you that I've been notified. When there's no problem, it's all completely transparent, you'll just be redirected to the link target; and I get to know "dynamically" which links have problems so I can try to repair them. And for those problem mails I had set up a filter in Eudora that would automatically put them in a special folder. That's the theory. I soon found that some servers are cheating a bit; if a page no longer exists they don't send you a '404' result, but instead a '302 Moved Temporarily' to a page explaining the error, or even just to the home page. Of course, a '302' result normally means no trouble, so you won't get an error page when you try such a link (but you'll wonder what happened when you get somewhere else than expected!); but with such 'irregular' results being returned I did have to get the program send me an email anyway, so I could check whether that server was cheating and there was a problem after all. Unfortunately, that increased the number of mails a lot, since some links (such as those to Amazon pages) will always result in a 302 redirect. Then I had also tried to make the actual link code shorter, by creating a virtual directory "go" for the link, and using Apache mod_rewrite to point to the actual PHP routine. A logical (but unexpected) result was that now all those links looked like internal links to all the bots, including Atomz which re-indexes the whole site every week. The compound result was more than 700 emails per week. At first I thought this was handy: Atomz was now working for me as a link checker as well. Then other robots started finding The Bookstore and following those links too, resulting in well over a thousand emails every week. Right - in no time at all, that apparently magic limit of 32,000 emails per folder had been reached and it wasn't "handy" anymore. The first emergency measure was to add the virtual "go" directory to the robots.txt file as forbidden for all bots. That stopped Atomz and the search engine bots from following those external links. Still, you clicked on links, and I continued to get more mails, and there were (by now) 33,000 emails in that folder, needing a rebuild every session. So the next step was to split up that folder into separate ones; creating temporary filters to sort them out, and a bunch of new ones to guide new mails to the new folders. Everything that's normally no problem (like the Amazon redirects) into separate folders, and special errors into folders by error code, and the rest into a "check this" folder. Eudora was happy again. And now I could see what problems I actually had... as opposed to Eudora, I was far from happy. Darned Book PublishersThe decent thing to do when you move content to somewhere else on your server, is to make sure the old location results in a '301 Moved Permanently' redirect. In practice, very few webmasters are 'decent' enough to do this - and they're losing visitors as a result. Worse, it seems not only all book publishers' sites regularly do huge reorganizations, but also they never leave a clue where the content has gone. Instead, they all seem to prefer to set up a '302 Moved Temporarily' to an error page, instead of giving a '404 Not Found'. Not nice. Now that all problems were neatly sorted into separate folders, I found they had been at it again. Content gone missing all over the place, and many "Background" links in The Bookstore not going anywhere sensible. While I managed to find most of it back, it took a lot of searching and poking around at a lot of publishers' sites. Which took a lot of time. Grrr. Don't do that guys! If you move stuff, leave a note in the form of a '301' redirect pointing to the new location! It isn't rocket science. And if you can't do it server-side, at least use a META refresh header in a document at the old location, pointing to the new location. Don't lose links! Don't lose people linking to you! MicrosoftI also have a few links to Microsoft: also notorious for moving things around without telling you. But that wasn't the problem this time. At first, the results varied: I'd get 404, or 302, or even 500 ("server error") results. And sometimes the links actually worked. This had been going on for a long time, but nothing I tried would consistently solve the problem. Now that I had all problem emails neatly sorted, I found that lately Microsoft had become consistent: now, I would always get a '400 Bad Request' error. Of course, the request never was "bad" at all; my routines produced syntactically correct HTTP 1.1 headers; they work with all other servers, too, only not with Microsoft (or rather, the MSDN section of the Microsoft site). Why was Microsoft telling me the request was bad when in reality it wasn't? Sometimes a HEAD request will yield a different header than a GET request, but I'd solved that already by trying a second pass using GET in case HEAD was behaving unexpectedly. Microsoft remained stubborn. I tried to reproduce it with some local tools, and could not: Tested locally, Microsoft returned '200 OK', but my server-side routine would always result in '400 Bad Request'.... Why? I decided to let it rest for a while; let the back of my mind puzzle over it while I continued the analysis of my server logs. Well, it seems those links are fairly popular; I got quite a few problem emails, and they consistently told me Microsoft was returning '400' for all those links. And the mails were telling me that you weren't getting to the Microsoft pages I'm linking to. So I tried again. Maybe it wants some extra headers? Add an Accept header (Accept: */*) - no difference. Add a port to the Host header - no difference. Add a User-Agent header (User-Agent: PHP/3.0.12), different user agents - no difference. All extended headers still worked with all other servers, just not with Microsoft's MSDN server. All these experiments required some debugging, and links not working at all, every now and then. I did my experiments when it's supposed to be quiet, but some of you may have noticed weird results for a while. Sorry about that! At the receiving end?I ended up with only one possible conclusion: Microsoft's MSDN server, for some unknown reason, is doing its own server-side "UA sniffing" and refusing access from my server. I have no idea why, though. And it's only a theory, because I cannot think of anything else: the requests are not bad, even though Microsoft is telling me so. It seems as if that server is simply sending '400 Bad Request' instead of '403 Forbidden' since the exact same request works from my local machine. It's a theory that I could do some more work to prove, but for now I'm working on that assumption. So now, apparently, I find myself at the receiving end of "UA sniffing". Ironic. Anyway, instead of giving you an error page when this happens, I
just redirect to the actual page; but I continue to send myself the problem mails - if only to remind myself that I need to do some more work
testing the theory. Everything back to normal for all of you - but I still have a niggling little problem...
2002-07-12UA sniffing: Strategy 1Here's the promised sequel to my Chain Reactions posts - or at least the first installment of that. The first thing to determine about server-side UA sniffing is which User Agents to recognize and handle by their UA string, and which ones to handle by IP address. End-user toolsThe principle is simple: clients that are run by an end user on their own machine cannot be recognized by IP address: each person normally has a different one. Worse, we should even be careful not to completely block an IP of someone using a nasty bot like a spambot, because with some access (large) providers the addresses may be assigned out of a pool; this implies two things: consecutive accesses may come from different addresses, and the same address may later be (re)used by a completely innocent person. This effect of different addresses for consecutive accesses can be easily recognized in our log files. So for end-user run tools, our only option is to use the User Agent string to recognize them. This class includes:
So, we'll need to gather the strings by which we can recognize these tools for what they are; more about that later. Server-run clientsThere's a host of these, too. The most obvious ones are the search engine robots that crawl the web to build an index. But not all crawlers are are run by search engines; even, not all crawlers are run by their developers. And their are server-side services that can be run by end users, for instance validation tools and link checkers. For the latter class it's fairly safe to recognize then by a User Agent string. But if we even consider serving something slightly different to a search engine robots than to a normal browser, we must recognize them by IP address, since they may sometimes use "normal" browser UA strings to check whether you're "cloaking". Now "cloaking" can mean anything from serving completely targeted pages to serving a slightly different version, just like you may server a slightly different version dependent on a user's browser. There's a reverse to that: if your "cloaking" leads to a very good position in a search engine, your competitors may want to check what you're feeding the search engine by faking a search engine's UA string; they won't succeed with that if you recognize the search engines by their address and base your content on that. So search engine robots must be recognized by IP address An extra wrinkle here is formed by the translation services: they serve their translated result to the user's browser, so here the content should be exactly as for a browser run by an end user. But these services may be run by the search engines themselves, so we'll need to exclude them from the special treatment for the search engines. Spambots, first line of defenseI've already mentioned a number of known spambots that actually visited my server. I've found a few more, thanks to these two sites:
My treatment will be simple: I'll recognize the known (behaving as) spambots in my Apache configuration; but instead of blocking them, I'll just serve them a page without any email address. This takes care of possible innocent use of these user agents (that I haven't observed before). So I'll just use Apache to recognize them and set an environment variable; and then use PHP to detect that variable and treat email addresses accordingly.With the list of known spambots (that actually show this behavior) I've mentioned earlier, and the ones newly discovered, I've come up with the following bit of Apache code:
The last two match all possible patterns for the DSurf15a 01 variants mentioned by Neil Gunton and found elsewhere. Since I don't really block anything, I could actually add the UA strings for the clients that don't always show spambot behavior (all they'll miss is the email addresses), but I won't do that for now; I prefer to watch them for a while and I can always add them later. For PHP I wrote a little function that either just shows text, or builds an email link, dependent on the setting of the spambot environment variable. It handles email address and linktext, and can add a title attribute and an image as well. I never use an image alone but you can easily adapt the code if you want to:
This code goes in a file in a directory that's always searched for includes, and it's included in each document at the top with: require COMMON_LIB.'maillink.php3';
Now I can build an email link with a call like this (assuming $address already has a value): maillink($address,"Marjolein Katsma");
The last task is to replace every email link with an appropriate call to the maillink function; that took some extra effort to standardize code and hunt down all email addresses (I even found a few that should have been replaced long ago!); all in all a few hours work. These spambots can come - they won't find anything to their liking anymore! Testing the spambot solutionAll well and good - but does this actually work? There are two possible approaches to testing this:
The first approach has two disadvantages: completely innocent visitors that happen to have the matched string will get the doctored version, and it's also easy to forget to remove the temporary string: it's error-prone and might hinder passers-by. The second approach needs a tool that makes it easy to send a completely user-defined UA string to the server. As it happens, I have a tool that can do just that: The Proxomitron is a very powerful 'Universal Web Filter'; it sits as a local proxy server between the browser and its outside connection (which can be another proxy if necessary) and uses a match-and-replace engine to make replacements in incoming and outgoing HTTP headers as well as web pages. And since it can replace outgoing headers, that includes the browser UA string - which comes in quite handy when testing our spambot defeating routine. Go get it and try it out: this amazingly powerfiul program is completely free and comes with a large number of pre-built filters. And the rest?We'll still need to gather information about all the other user agents, both end-user tools and server-side tools. Tomorrow (I hope) I'll
explain how I'm doing that; the log file analyses I made, and the log files themselves, play a big role, of course. And I'll mention
another tool and give more useful links.
2002-07-07Chain reactions 2Server logsNow that was a mini chain reaction all by itself. I've always downloaded my server logs, but manually, and irregularly. Stored them initially on a JAZ disk - until smoke came out of it one day (I thought it was the power supply but when I finally got a new one, I found it was the drive - maybe the drive as well). How to rescue the data on my JAZ disks? I really needed some of that - not just the log files. Not many JAZ drives around any more, but after exploring a lot of (non) possibilities, it turned out that buying a new one was actually the cheapest way of getting at my data. Sigh. Amazingly, the disk that had been in the drive when all the smoke came out turned out to be quite readable. So, put all of my log files together. Eliminate the overlaps. One garbage section found (must have been smoke damage), so delete that. Now we're in business. (I thought.) Email harvesters (AKA spambots)I really want to stop those getting at email addresses - it's bad enough that some old addresses are constantly getting spam. Consider this: two addresses keep popping up in pairs (the same spam from the same source sent to both addresses at the same time, often the same second - not bloody likely I used both addresses to sign up for the same information, is it? Forget it - they were harvested and are obviously on the same CD(s). Those two addresses are spam traps now - but I'd like to prevent this from happening again. I'd looked into this before but never did find much I could use. This time it was different, probably because I came from a different direction at it. Some useful links I found about email harvesters:
Let's see now, which of the known email harvesters have actually visited my sites? And how often? All I needed to do was plug in each of the know names in Windows search to go through my (locally-stored) log files. I was pretty shocked by how much there was! Here's what I found, in alphabetical order:
While some of these products are no longer distributed via their original maker's web sites (they have often disappeared after complaints), that doesn't mean spammers don't still use them; and they may get them also via some of those nice spammers' toolsets they can buy on CDs (the CDs you get spam for!)... EmailSiphon is a real oldie - but I can see it's still being actively used. So, that's definitely something I'll need to take care of (and notice how we're no longer doing browser sniffing but something else entirely?). I found several interesting pages about dealing with these nasties; the most useful ones (for me, at any rate) were: The second article goes into quite sophisticated measures you can take with Apache combined with scripting. I guess that's where I'll start once move on from reporting all this to real coding. OK, now we know what to do about spambots (mostly); so what other user agents do we have? Just searching server logs won't do any more, I'll need to do some real analysis here. Of course that leads to a whole new chain reaction, so here we go yet again: Analyzing server logsWhen I started out with Digital Daze where this site is hosted (see link in the page footer) they were running a home-grown variation of a web server with many interesting capabilities that Apache acquired only later (these guys really know what they're doing!). So when Apache was offered later, conversion was a no-brainer. One of the few things I had to pay attention to was the format of the server logs. I was using a "combined format" for the old server; Apache doesn't have an internal definition for this, which turned out not to be a problem since in Apache you can configure exactly what a server log record looks like: I could make it just like the old combined format. Here's what the combined log format definition looks like in my Apache httpd.conf:
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\""
So we have:
A typical log entry might then look like this (all on one line):
203.12.97.39 - - [08/Jan/1998:03:23:05 -0700] "GET / HTTP/1.0" 200 3570 "-" "Mozilla/2.02 (OS/2; I)"
So if we're interested in the user agents (and we are) all we need to do is look at the third quoted string. Right? So I write a Perl script to go through all my log files and pull out the third quoted string of each to build a list of user agents. Sigh - here we go again: Pulling out user agentsGuess what? Server logs aren't that simple. They record what's coming in (well, that's what this one is designed to do) - and what's actually coming in may not be what should be coming in. Making a script to pull out every third string wasn't all that hard - but when I looked at the results it was soon clear that wasn't always the user agent. The first thing I noted about these strange results was that what was supposed to be the user agent, was actually the tail end of the referrer. The referrer can be the URI of a search engine with search parameters: and those search parameters can contain a quoted string; so we have double quotes embedded in what should be a simple quoted string. Well, OK then: instead of looking for the third quoted string, I just look for the last one, which then must be the user agent, right? Wrong again: It seemed to work, mostly, but I still got a few really strange user agent strings. Now there are really strange user agent strings, of course (I'll show you some in a minute) but just a closing bracket? Not very likely, something's still wrong. Windows search to the rescue again - find the log files that contain "); load them in an editor, search for that string within the file. Brilliant: User agent strings can also contain embedded double quotes! I found these, for instance:
Mozilla/3.0 (compatible; Webinator-cigweb01.ge.com/2.54; "I'll be back")
Well, that was one simple strategy shot to pieces: If I want to get at the user agents I can't just look at the third string and I can't even look at the last string. I'll have to build a regular expression to match the whole of the log file record, and then pull out what I need. Indeed, here we go once again: Matching the log file recordWell, that took a little time. More than a little. Let's see how I went about it. First, we have a bit we're not interested in at all: that's everything up to the first request header (which starts with a double quote). Then we get the request header: something in double quotes but not containing quotes. Then a space, a number, a space, a number. Then the referrer which may contain quotes as we've seen but at least not spaces (it's a URI, and a URI cannot contain a space), then a space, and finally the user agent, which again starts with a quote, and may contain both spaces and quotes. Um - not quite. Those two numbers were wrong, as it turned out: some requests result in only a header being sent back and no content, hence no Content-Length: this happens for instance when the server reports nothing has changed (304), so it doesn't need to send any content again (the user agent already has it). In this case, the second "number" is not a number, but just a hyphen. For instance (one line): bos-bc02s06.bos.lycos.com - - [01/Jan/2001:11:00:36 -0700] "GET /robots.txt HTTP/1.0" 304 - "-" "Lycos_Spider_(T-Rex)"
Simple enough change. Then I found that the first request header can actually contain quotes as well. OK, but it ends with the protocol, right? Wrong again. It may end with HTTP/1.0 or HTTP/1.1 or just HTTP; but the protocol may actually not be there at all, and I even found a case with an empty first request header. Instead of the fairly simple regular expressions used to pull out the third, or last, quoted string, I finally wound up with this beauty: /^[^"]+("(?:(.+( HTTP(\/1.[01])?)?)|-?)" )((?:(\d+|-) ){2})"(.*)" "(.*)$/
This works! Here, the last bracketed expression is the user agent - but why no quote at the end after that? Because I found it was missing, at times (no idea why), so it will need to be stripped off if it's there. So I have: $agent = $8;
and then $agent =~ s/"$//;
After getting to this point through numerous trials and error and test files with problem records, I just set it to go on all my log files. And found it failed on every single line of the oldest one. Huh? Oh yes, I'd forgotten that: when I set up the server, way back in 1997, I didn't at first use the combined format. I didn't start doing that until I added a second site to the server. No problem, I'll just combine the access log with the agents log myself; a simple little script should be able to handle that. Well, "should" is right: I did have an access log and an agents log - but they didn't have the same number of records. Oh, yes, of course. Here we go again: Combining log filesThis was fun - NOT! The access log had about 18,000 lines, the agents log less. Instead of an empty record for an undefined agent, no record was produced. So I loaded both files in an editor, and started adding empty lines to the agents log until both files actually matched. Not too hard, just a lot of grunt work requiring concentration - and half a day. Then I wrote a script to combine the two, putting in "-" for the referrers. Then I found I actually had a referrer log as well. Well, fine, another half day's grunt work and five minutes to adapt the script should take care of that. Did I tell you I'm an optimist? I'm an optimist. There were far more cases with unknown referrers than with unknown agents - and they didn't simply match with particular agents, but with requests: I had to go through them line by line for large parts of the files. Well, I'd promised myself to get my log files in order, so it got done. All in all, just the manual "matching" took two long days, and adapting the combining script took a "little" more than five minutes, too. All done now. Hooray! Then I also combined and subdivided my log files, so I had separate ones each for a full year. That was easy. Pulling out agents - at lastThis time, it worked. All log files complete and in combined format, and an ugly regular expression to take care of all requests, including malformed ones. I'd already added a counter. I did a couple of variations, all combining logs for the different sites on the same server:
By the way, while the user agent is $8 in that regular expression, the first request header is $1, so for the third type of analysis I'd need both. Now the fun starts. Almost 5 years of web history here. Anyone remember BackRub? That was (I saw it in 1998 first ) what evolved into Google (also 1998); like this:
BackRub/2.1 backrub@backrub.stanford.edu http://backrub.stanford.edu/ (11x)
(Googlebot is at version 2.1 by now.) Then there are the security and privacy protecting programs that will hide the actual user agent from the server, and some ordinary browsers that allow the user to define the user agent string. While this is fine from a security or privacy point of view, for our original purpose of browser sniffing and serving the browser the best it can handle it's useless. Those hidden or "fantasy" user agents will just have to get the plain linear version (but we can add an explanation to the page for them!). Sometimes it's clear where the obfuscation comes from:
Sometimes the source is unknown but the intention is clear:
And some user-defined strings can be quite amusing, too:
And then there are the countless variations of UA strings for real browsers - sometimes with interesting additions if used 'through' other software. Polite?Polite robots ask for a robots.txt file first, before actually downloading material from your site. For an HTML page, they can then (also) look for specific meta tags, which ideally are an extension on what you specify in robots.txt. If you don't have access to the server root, meta tags is all you have. All of this is described in A Standard for Robot Exclusion, already dating back to 1994. Each of the methods can do some things the other can not: for instance, with robots.txt you can deny access to all images for particular robots, or to whole directories. With meta tags, you can tell a robot it can index the page, but not follow links, or vice versa. These days, the major search engines all fully support the Robots Exclusion protocol. Many types of off-line browsers do, too, whether they're dedicated programs like Teleport Pro from Tennyson Maxwell (highly recommended!) or just a browser function such as Internet Explorer provides (and early version ignored robots.txt completely, the current one does not). Alas, for the off-line browsers this is often an option that can be turned off (don't!). Still, I thought it useful to see which user agents actually did ask for robots.txt. But what of this scenario? Someone knocks on your door, and politely asks if he can come in; but he refuses to give his name. And a few weeks later he's back with the same spiel. Would you let him in? Curiously, I found there are robots that do ask for robots.txt (knock on the door, can I come in?) but hide their identity (user agent just "-"). Since 1999, I get more than 60 or 70 such requests per year; for this year it's already 71! I'm inclined to send them packing, but will need to work out a strategy for that. Have a look!Since I think these simple statistics are actually quite interesting, I've uploaded them all for your amusement. The files that list the "polite" robots are pretty small, so I simply left them as the original .txt files and you can choose whether to look at them in the browser, or download them. Most of the other files are pretty big though, so I've zipped them up (they compress nicely, with all the repeating strings). You can also how often each of the spambots (and spambot candidates) I mentioned has visited so far. These files can all be found in this directory: http://blogs.hshelp.com/loganal/. The file names should be self-explanatory. Let me know if the ZIP format is a problem; I can provide other formats if necessary. Now that I've done all that analysis, what am I going to do with it? Tomorrow, we'll have a look at developing a strategy.
2002-07-06[Accessibility] Before I continue with the Chain reactions story I started yesterday, a small update. I just found a very interesting weblog called 30 days to a more accessible weblog, written by Mark Pilgrim. It's not just interesting - it gives a lot of practical advice on how to make your pages (weblogs or otherwise) accessible for all. Now my blogs here weren't too bad, but a few improvements could be made, so I did. I quickly implemented two tips I found there:
Since all three blogs on this site use the same (PHP) template (which uses includes to pull in the content created by Blogger), a few simple changes have now made all three blogs (and the archives) more accessible. And I could do more - but that's for later. Do read (and use) those tips, though! And it's useful to regularly check your work with Lynx (available for many platforms), or use Delorie's Lynx viewer. Update: 2002-07-07 I'd been doing this in The BookStore for a long time but forgot to do it here: I'm now going through all posts and marking up every acronym as such, so you can hover the mouse over it and see a tooltip with an expanded version. Don't know what to do with "PHP" though: it looks like an acronym, but isn't. No markup for that, for now. In newer browsers you should also see a dotted underline for each (marked up) acronym. Let me know if I forgot one! Update: 2002-07-07 Same updates done for the index page giving access to all three blogs and their archives; also moved the links for the archives to the navigation column. 2002-07-05
[Back] Yes, I know, it's been a while. I've been rather depressed (never mind). And then I went on a short vacation in Northern Germany (see the
electronic postcard), which cheered me up no end. So then I've been
programming very hard (more about that later). And wanted to put up a new page as a result of that - which led to a chain reaction. See below! Spyonit buttonsNow that was stupid of me. I put these handy Spyonit buttons on a lot of pages, so you can quickly sign up to be notified of changes on a page. They worked nicely, too: I tested that. Until I changed the code a bit, that is, and didn't test that. I only just found out I managed to break them all. Sorry about that, folks - all fixed now! So use them :) Chain reactions 1Now how do I so often manage to get myself in a chain reaction when working on something? I'm not quite sure, but part of it is that I tend to postpone things until I really need them, so I'll have time now for what I really need now. (Hmm - is there a bit of recursion in there?). Here's a recent one that I'd like to report, since it has some interesting results (and will have more!): 3-column layoutQuite a while ago already, I was thinking about how to change the design of HomeSite Help. Playing with a nice 3-column layout done completely with style sheets (CSS-Positioning). I played with it enough to see it was quite feasible - and then postponed playing with it more until I would actually need to publish a new page. Went off to do other things (the new release of The Bookstore, and programming VTML). So a week and a half ago I was about ready to publish a new page (the result of another chain reaction, more about that later) - and set out to do what I promised myself: finish the new 3-column page design and publish it with that. Here we go: I worked out a better version of the design, and created a test page with some major elements. Tested it with a couple of new, and a couple of old browsers, and it worked fine: the new ones would get a nice layout, the old ones get all the content but without the nice layout: just the linearized code (optimized for accessibility). But I have only a couple of browsers to test with (still collecting...) and only a Windows machine to work with. So I went and asked a group of fellow web developers to have a peek. Got some every helpful observations, thanks, folks! But well, ouch. Not quite as simple as I had imagined: I found out there are browsers that do support @import, but do not support positioning (endless scrolling). And browsers that do support positioning, but only if it's on the left, so what's supposed to be on the right ends up overlapping what is already on the left (can't read that anymore). And so on. And I already knew I'd need some variations of my style sheets for different browsers. It was becoming obvious then, that I'd need to use more than a simple @import trick to separate capable browsers from the less-capable ones, I'd have to do actual "browser sniffing". Here we go again: Server-side browser sniffingDid a bit of research; something ready-made in PHP would be helpful since that is what I'm using here. I found a few candidates:
The first three are nice to show how things work, but too simple for what I need. The last two are good candidates, but neither does completely what I need. Well, OK, I'll make my own variation then. So I needed to know more about recognizing browsers, and did some more research. Lots of interesting stuff, including browser sniffing in JavaScript (not that I'd want that - but there was some useful information in there). Then I realized that when you use JavaScript to do browser sniffing, browser sniffing really is what you're doing: JavaScript runs in the browser then. Do it on the server, however, and you'll first need to find out whether the client accessing your site really is a browser. OK, so there's browsers and robots, right? And robots don't need style sheets. Simple. Um - here we go again: Browser sniffing?More research. Apart from browsers, there's robots, and then there's robots, and more robots... and you need to do something sensible for them. So what do we have?
Get the idea? Probably not complete even, but it does indicate what you need to take into account when looking at all this from the server. Actual browser sniffing would apply to only the first two (well, more or less). Email harvesters should be banned, or get a page without any email address, or get a page with lots of spam trap email addresses. And bots (including off-line browsers) should be polite and honor the robots exclusion protocol, so should we ban them if they don't? You cannot count on recognizing a search engine by the user agent string (a person can fake that, and some search engines pose as ordinary browsers), so you need to recognize them by IP address. Search engine bots don't need all that style sheet stuff, and could actually be presented with a slightly optimized version of the linear code: by serving things in a slightly different order, it could be optimized for keywords rather than accessibility. But wait, some search engines cache pages (Google and Alexa (for the web archive)); but then I won't know with which browser that cached page will be viewed. OK, so you can use a meta tag to tell them not to cache the page, and they'll get the linear version as well. But the translating services will need to present their result in the user's browser - so hopefully you can recognize the service by IP address (which may be inside an IP block owned by the search engine) and still recognize the user's browser by the user agent string (if not, they'll have to get the linearized version). Well, that's all in one paragraph, but it took a day or two of research to get to this point. Um, actually: user-agent sniffingAnd those email harvesters have always bothered me. I really need to take care of them. Now, with the research I'd done, I finally had a
good list of possible user agent strings: though some of course will masquerade as an ordinary browser, surprisingly they often give themselves
away with fancy names like EmailSiphon and MailThief. Which ones did and do visit my server? And I'll need to
recognize the real search engines by IP number (and exclude the translation services). And so on. Time to get out my server logs. Right, here we
go again. (Yes, there's more - that's for tomorrow!) 2002-03-27[Blogs] Tip for bloggers with PHP: One thing that bugged me when setting up my blogs was that for a blog post I can format the date in the settings, but not the dates appearing in the generated link to an archive (<$BlogArchiveName$>). The result would be a mix of date formats, even more confusing to many people than the default "American" format alone (which is ambiguous). So I created a little PHP script that takes the archive name, parses out the dates, reformats them, and then puts it all back together again in the link description. Here's the code:
Feel free to steal it! 2002-03-25[Blogs] After some more tweaks, I've declared it good enough for now: I'm going live by adding a link to the HomeSite Help home page. I'm considering an option to let people add their comments. I've seen a tool for that (in PHP) but the code was very messy; It shouldn't be too hard to write my own version: you'll see it when it's there :) Expect more "real" content in the coming days about different subjects than setting up this blogging subsite.
[Blogs] I've added a "Spymaker" button to all three Blog pages: with that you can easily create a "spy" which will let you know when there's something new on that page. All links to external resources now open in a separate window: you can keep it open if you want to follow other links.
[Blogs] A few more minor tweaks to the Blogs styling, and a better hierarchy for the headings. Time to go to bed. |
|
|
|
|