Independent Evaluations of Networking Products and Tools

The Black Art of Networking

BYTE Magazine, 1991

 

LANs aren't yet simple, off-the-shelf products – there's nothing simple about managing today's LANs

 

 

I recently ran into two situations that point up a fact of life about LANs. Vendor claims to the contrary, most LANs still require a high level of expertise and many hours of ongoing troubleshooting to make them work.

 

Sometimes the problem lies in the physical LAN, and you have to figure out which cable, network adapter, or repeater/access unit isn't working right. Sometimes the problem is relatively simple -- a missing or out-of-date software component or a configuration change. Less obvious are the problems that involve LAN setup parameters and tuning. The most complicated kind of LAN problem, however, is the one in which you discover a bug in your LAN vendor's hardware or software. Such bugs are fairly rare, thank goodness, but I recently had the misfortune of encountering two of them.

 

I get a lot of personal satisfaction from the frequent problem solving I do on networks. Unless you're prepared to diagnose and solve LAN problems, though, you may see your new LAN as more of a challenge than you had bargained for. Here's just two examples of how problems crop up.

 

Token Ring Trouble

Not long ago I recommended buying Thomas Conrad Model 4045 4/16 Mb/sec token ring cards to solve performance problems with my company's NetWare 3.11 file servers (I mentioned this problem in the August column). These Model 4045 cards have 128 KB of RAM, a 16-bit path to the CPU, an onboard processor, optimized driver software, and a few other bells and whistles. When I tried one out, it did indeed perform well with most network file-sharing tasks. Unfortunately, my company's application (an insurance rating system) crashed when run on a Thomas Conrad-equipped workstation.

 

Right away I suspected incorrect jumper settings on the adapter; the card has 12 jumpers--plenty of room for error. I checked and rechecked the settings, and then tried several different jumper placements to see if that would solve the problem. None worked. I replaced the new network adapters with IBM and then Western Digital token ring cards. In both cases the application ran slower, but didn't crash.

 

Perhaps I had a bug in one of my assembler modules. The interaction between my code and the adapter driver code might have somehow brought this problem to the surface. I pored over my assembler modules to see what the interaction might be. I tried to use a debugger to trace into the adapter driver  instructions after my code had issued a server file request, but the length of the code path (the sheer number of instructions), the lack of source code, and the inability of most commercial debuggers to reveal timing dependencies and other potential problems made this approach difficult.

 

Was the problem in the driver software? I called the Thomas Conrad customer support line. They asked me to sign a non-disclosure agreement, then sent a beta version of their newest driver software. It didn't solve the problem: our application still crashed at the same place in the application program.

 

A Thomas Conrad engineer then asked for a sample of code with which to reproduce the problem. I obliged, and shortly thereafter Thomas Conrad called back to confirm that I had indeed discovered a bug in their driver software.  The bug could occur only when a combination of message packets filled the adapter's packet buffer precisely to the last byte. The application issued a sequence of file requests, and NETx.COM responded by building full packets that caused the driver to fail. We have since received corrected drivers from TC and they work just fine. One case closed.

 

A Tale of Corruption

Another unrelated incident occurred when we delivered our software application to the CIGNA insurance company. They use IBM's OS/2 LAN Server 1.2 and OS/2 Extended Edition 1.2 (CSD level 4098) on the file server; the workstations run DOS LAN Requester. The software failed to run--we ended up with ``device I/O error'' messages and some corrupted files. Again, we first suspected a configuration error. Perhaps our software had been installed incorrectly, or the file server parameters were ``too tight'' and didn't allow us to share files and lock records to the extent we needed. I flew to Philadelphia to track down the problem.

 

CIGNA technical experts and I went over the installation and the file server parameters (the IBMLAN.INI file) in great detail. Everything checked out; something else was causing the I/O errors and corrupted files.

 

I traced through my code with a debugger, using a monitoring tool (WATCH, from Jensen & Partners International) to display DOS function calls as they occurred. I could now see the two problems with some clarity. DOS LAN Requester (DLR) would not let me lock records on a file located on the local hard disk, but returned a DOS error code, which the application received as a "Device I/O Error.'' Record locking worked fine for files on the file server, but failed miserably on files on the local hard drive. The problem was version-specific, too; we have other clients using OS/2 1.3, OS/2 LAN Server 1.3, and DOS LAN Requester who haven't run into the same error messages. Since our application performs generic file I/O operations that don't depend on the location of the file, the behavior of LAN Server/DLR was definitely a problem for us.

 

The second problem, file corruption, also involved record locking. When you lock a record, you specify the beginning of the record as a byte offset into the file and you specify the region to be locked as a number of bytes. Our application often needs to momentarily lock the entire file as one big record. To avoid having to exactly express the current length of the file each time, I designed the locking mechanism to always lock from the first byte of the file up through the 268,435,455th byte (0xFFFFFFF in hexadecimal).

 

Why on earth would I do this? The IBM DOS Technical Reference says, ``Locking beyond end-of-file is not an error.'' I relied on this in my code, thinking that I would save a small amount of LAN traffic by not asking each time about the length of the file. Our application has used this approach for many years, working successfully on top of such network operating systems as NetWare, PC LAN Program, 3+Share, and others. But on these versions of LAN Server/DLR, at this particular client site, one of our files inexplicably grew from its correct size of 1,536 bytes to 8 megabytes. I could see with WATCH that the file corruption related to the record locking scheme I had adopted.

 

One of the IBM system engineers who works with the CIGNA account put me in touch with the IBM development staff in Austin, Texas. I talked with Bill Cartright at IBM/Austin, the IBM programmer who maintains the LAN Server code. After some discussion about the techniques I used for record locking and file sharing, Bill mentioned a old bug in LAN Server having to do with record locks that extended beyond the end of a file. Obviously, this old bug, or a remnant of it, still existed in the version of LAN Server we were using at this client site.

 

I owe Bill a letter and some software that demonstrates the error, so IBM can issue an APAR (IBM-speak for problem log) and fix the problem. In the meantime, I decided it would just as effective to lock the first five bytes of every file as to lock the first 200 million bytes. OS/2 LAN Server and DLR would still give me the same record-level collisions, since I was in essence locking the entire file as one small record. I would avoid the locking- beyond-end-of-file bug in LAN Server with this simple change, and I wouldn't have to think of it as a workaround--this technique works in all LAN  environments, so I could make the change permanent.

 

No Easy Answers

Problems like these don't happen every day, and I hope they never happen to you. Fortunately, both IBM and Thomas Conrad were very cooperative. I'm not pointing a finger here; these problems are examples of the complicated nature of LANs, not of poor quality.

 

I wish I could predict that LANs will become as easy to use and trouble-free as standalone PCs. However, as your computers become interconnected in increasingly complex LANs, you'll find it takes a certain amount of black magic to keep the network going. Frankly, I don't see that changing any time soon.

 

I'm writing this in a hotel room in Dallas; I'm at yet another client site to determine why their file server isn't serving up files as quickly as it should. I'm not sure yet what the problem might be; perhaps some of the configuration parameters just need tuning ...

 

 

 

                                                                                                                                                         Back