“The (B)Leading Edge”

"The (B)Leading Edge"
String in the Real World – Part 2

Jack Reeves
©C++ Report

Introduction

In my previous column, I started out examining the Standard string class "in the real world". That column looked primarily at the Standard string interface. In this column I will get more into the implementation. Let me preface these remarks with a caveat: it is not my intent to in any way suggest that there is anything "wrong" with the Standard string class. For the vast majority of programmers, the Standard string class will adequately meet all of their needs, and do so with elegant efficiency. Nevertheless, as with any other library, there is simply no such thing as "one size fits all". This column is about situations that might affect only a small fraction of C++ users, and then only occasionally. As I hope to show, even when the implementation of the string class that comes with your compiler is not what you need, my preferred solution is to stick with the string specification as provided by the Standard and provide a more appropriate implementation.

[Soapbox - On the one hand I really feel like I should not have to constantly make these qualifications; most good programmers understand that it is up to them to determine whether a given function, class, or library is appropriate to their application or not. On the other hand, I seem to be running into an awful lot of marginally competent (shall we say) programmers these days. In particular, there seem to be a lot of people out there who have nothing but criticism for the C++ Standard. The C++ Standard is not perfect, but nobody (at least nobody I know) claims that it is. It is true that in the end the committee was constrained by time and procedural rules, but most programmers working in the real world understand how deadlines and QA requirements can keep you from making last minute changes to a product. The overall consensus of the committee remains: "The C++ Standard is the best we could make it. If we could have agreed on how to make it better, then we would have made it better." This is especially true for the Standard C++ Library. In spite of this, there apparently are a lot of people who seem convinced that had they been on the C++ committee, it would have turned out better. I want to make it clear that I am not one of them. It is true that I personally prefer a slightly different interface from the Standard string class in my own work, but I am aware that there are others whose opinions differ from mine. I am also aware of how difficult it is to come up with something as good as the Standard string class when starting from scratch. The key point is not that the Standard isn’t perfect as is, but that it is an excellent starting point that most users will find perfectly adequate for their tasks, and that others can use as a basis for something more appropriate when necessary. While the pundits and the amateurs complain about C++’s complexity and the size of its defining Standard, those of us with large-scale problems to solve in the real world now have at our disposal a programming language of almost immeasurable power and flexibility. You want C-like efficiency? Use the C subset. Garbage collection? Link in a garbage collector. Need multiple inheritance? No problem. Want to do generic programming? Ditto. Need the flexibility of a symbolic programming language? You can do it in C++ (though it does take some work). Yes, C++ has depths of complexity that can require study and practice to master. In my opinion, the Standard did not really add a lot of complex features to what was already in the C++ language -- namespaces being the only one I consider important (of course, if your older C++ compiler did not support templates or exceptions (or you were carefully avoiding them) then you may disagree). The library is another story. Unfortunately, study and practice seem to be anathema to the vast majority of programmers out there. For myself, study and practice of my craft do not bother me; if I wanted to be half-good at something, I would still be playing the guitar. End Soapbox]

Background

I was recently working on a financial data server. In that application a financial instrument was identified by three elements: its type, source, and symbol. Each of these was represented in network requests by short strings. It therefore seemed reasonably to create a class such as the following:

class ItemName {

string _type, _source, _symbol;

public:

. . .

const string& type() const { return _type; }

const string& source() const { return _source; }

const string& symbor() const { return _symbor; }

ItemName& type(const string& s) { _type = s; return *this; }

ItemName& source(const string& s) { _source = s; return *this; }

ItemName& symbol(const string& s) { _symbol = s; return *this; }

};

This is pretty straight forward, and everything worked fine, but let’s look a little deeper into the actual values that this class represents.

The type value is determined by the software itself (in this case it is actually determined by a legacy C library). There are about 10 different types defined, each represented by a three or four character string. The most common type is "gnrl". The source value is an identifier for the actual source of the instrument data. Various codes represent these sources. Since these codes have to be transmitted over the network to the server, they were deliberately chosen to conserve bandwidth. For example, the New York Stock Exchange is coded as the single character ‘N’. Most source codes are 2 to 5 characters in length. The symbol is the actual assigned symbol for the instrument. Most of these are short 3-5 character strings.

Now consider the typical string implementation. Many Standard libraries implement string using a reference counting scheme similar to that shown in figure 1. The actual string object is just a pointer (often called a ‘handle’) to the data on the free-store. The Standard makes it clear that string can be implemented using such a shared data scheme, but it must act as if every copy of a string were separate. This is often referred to as "delayed copy" or "copy on write." The point is that reference counting is an internal optimization that vendors can, and often do, adopt to improve the efficiency of copying strings.

The header of the data area often looks something like this:

struct stringRep {

int refCount;

int length;

int size;

bool flag;

};

On a typical machine with a 32-bit word size, this struct will take up 4 words, or 16 bytes. Most implementations will have at least one ‘flag’ field, many will have more than one, but I will assume that all of the flags can fit into one word. Some implementations will also include a copy of the allocator object that was an argument of the string constructor. Since the Standard string class uses the default allocator, which in turn is specified to use operator new and operator delete, it is possible for an implementation to optimize this field away. For now I will assume that this has been done.

In addition to the stringRep header, most string implementations will also have a minimum data allocation size. I have seen this range from as low as 8 bytes (my own string implementation) to as high as 64 bytes. Let’s assume 16 bytes for now. This means that our "typical" string class will allocate a minimum of 32 bytes (StringRep plus data) from the free-store for even the shortest strings.

The Problem

Now let’s return to ItemName. One of the applications that used ItemName was a local financial data server. This application was to act as a buffer for the global financial network. As such, it would receive requests for instruments and cache the data. An instrument’s ItemName was the key by which its data could be requested, and in turn was the index into the cache. Needless to say, a lot of ItemName objects were typically in existence within this application. A typical test run involved a cache of 10,000 instruments. A real world instance of this application was expected to be able to handle up to 100,000 instruments, with future versions going as high as 500,000 instruments. Looking only at the data cache, 10,000 ItemNames used for indexes implies 30,000 strings. With a typical minimal allocation of 32 bytes, this implies almost a megabyte of data held in strings. Not a big deal by itself, but when you consider that the average length of each of those strings was less than 4 characters, the 700% overhead starts to look pretty bad. It became unacceptable when we started to consider real-world size caches and the hardware it would take to support them. Since I was the C++ / library / STL / middle-ware guru, I was given the problem of reducing this memory overhead to an acceptable level (the problem was much larger than ItemName – ItemName is just the subject of this column).

My first attempt at reducing ItemName’s overhead was aimed at the _type field. Since there were only a limited number of types I knew I could share the data representations. I decided that I should be able to take advantage of the sharing inherent in the string class -– if I just did the copies right. So I changed the type update function like so:

typedef set<string> Types;

const Types types = init_types(); // built at startup

ItemName& ItemName::type(const string& s) {

Types::const_iterator it = types.find(s);

check(it != types.end()); // like assert

_type = *it;

return *this;

}

The idea here is to look up the new value for the type in a table and then make the assignment from the value in the table. For the table, I naturally turned to the STL and used a set<string>. Assuming the assignment just increments the reference count of the string in the table, this should mean that every _type field in the application just points to one of the elements in the table.

In the world of strange coincidences, it turned out that I had not had this change in place 24 hours when we received a new C++ environment for one of our platforms. I took some time to do an evaluation of this version’s Standard string class and immediately discovered that their Standard string did not use a reference counting / copy-on-write scheme. Instead, it was implemented much like most vector<> implementations, and made a copy of the data within the object at the point where a copy was invoked. Initially, I was slightly annoyed at this; I considered the non-reference counting implementation slightly archaic^#. Nevertheless, I quickly realized that the string implementation was completely compliant with the Standard, and my annoyance was better aimed at myself for developing code that depended on a given implementation. As I noted above, the use of reference counting is an internal optimization. If you want portable code, you should only depend upon what is guaranteed by the Standard and not make any assumptions about the implementation. In this particular case, if I wanted different string objects to share the same data, it was up to me to code it so that they would.

At this point I admit to a brief period of weakness whereby I fell back into old C habits. In C, if you want to share common data, you use pointers. So I switched ItemName’s _type member to be a const char* and its mutator and accessor functions became:

typedef set<const char*, cstr_less> Types;

const Types types = init_types();

ItemName& ItemName::type(const string& s) {

Types::const_iterator it = types.find(s.c_str());

check(it != types.end());

_type = *it;

return *this;

}

string ItemName::type() const {

return string(_type);

}

You will note that I had to change the return type of type() from a const string& to just a string. You will also note that the instantiation of the set<> template includes the template argument cstr_less. This is one of several little utility functors that I have sitting around in a utility library. As you would expect, it compares two C-style strings using the strcmp() function.

With this done, I turned my attention to the _source field. Like the _type field, the typical execution run would include only a few distinct values for _source. Unlike the _type field, I could not enumerate in advance all the possible candidate values for source. I needed something a little more dynamic than I had used for _type. It is true that I could have just expanded upon the set<const char*> idea, but that would have meant allocating character arrays and copying strings whenever a new source value came along. Since new source values are rare there was no performance issue involved; my mind simply balked at doing anything so C-like. What I wanted to do was put strings into a set, and then share the strings themselves. In essence, I wanted to hold a pointer to the string in the set<>. I could have used a string*, but I realized that I already had something just like a pointer-to-string that I could use instead. ItemName became:

class ItemName {

typedef set<string> StringSet;

static StringSet sources;

StringSet::iterator _source;

. . . .

};

ItemName& ItemName::source(const string& s) {

_source = sources.insert(s).first;

return *this;

}

const string& ItemName::source() const {

return *_source;

}

Most people think of iterators as just being useful for stepping through a collection. Here, I take advantage of an iterator’s pointer-like property of being a reference to something in a collection, and hold on to it. While the mutator function looks like it inserts every requested source value into the set, it really doesn’t. Since a set<> can not have duplicates, the insert function returns a pair<iterator, bool>. The bool element indicates whether the insert succeeded or not. If the insert succeeded, the iterator portion of the pair<> references the newly inserted value. If the insert fails because there is already an element in the set<> with the same key, the iterator references the element already in the set<>. In either case, the iterator returned from insert is what I want, so I ignore the second element in the pair<> and save the iterator.

Naturally, this scheme works for type as well, so I went back and changed the _type member to also be a StringSet::iterator. Since new types are not allowed, the mutator function remained

ItemName& ItemName::type(const string& s) {

StringSet::iterator it = types.find(s);

check(it != types.end());

_type = it;

return *this;

}

With these two changes, I significantly collapsed the memory footprint used by ItemNames in an execution. The fact that the allocated data for a type or source string might still have a considerable amount of overhead was quite a bit less important when thousands of such strings were all sharing the same data.

That left symbol. Up to this point, the "improvements" had been relatively painless. Clients of ItemName had been forced to recompile (several times), but there had been no changes to the interface which required coding changes (this had been an unstated, but important, requirement of the tuning effort). Furthermore, the coding effort necessary to realize significantly measurable improvements had been small. Things might have been left as they were at this point, but symbol was still a problem. With a average symbol size being less than 4 characters, and the typical string allocation to hold those 3.x characters being 32+ bytes, there was still room for improvement. Again, the first ideas involved falling back to ordinary C-style strings. Besides my personal aversion to using C-style strings, I was now sensitized to the whole idea of free store overhead. While it was certainly true that

char* p = std::strcpy(new char[4], "IBM");

would actually use much less space than

string s("IBM");

it was still likely to use more than I wanted. Beside the 4 characters of actual data, the typical heap manager would also include 4-8 bytes for its own control information. It would also be likely to align allocation on double-word boundaries. Since I did not consider any of this heap manager overhead in discussing the string class overhead, it is not really fair to consider it now, but it is there. Since "IBM" would fit into the space actually used by the char*, it seemed a shame to have to put up with any heap overhead for such short strings.

At this point, let’s admit that a great many programmers would just pick a size (4, 6, or 8 bytes) for a fixed length character buffer and say something like "that will work for the vast majority of cases" and let it go at that. Since it probably will work for all the test cases, it is quite possible that the program could actually be in user’s hands before somebody had a problem. Then the "fix" would be to just make the buffer size bigger. While there are a lot of advantages in simple solutions, I have had to fix enough systems designed around such short-sighted limitations that I don’t even consider them acceptable for in-house products. Of course this is an overwhelming argument for the use of Standard C++ strings (and the STL): they provide open ended, general-purpose solutions that are as simple (or even simpler) to use as the more limited older versions.

The problem in this case was how to deal with those occasions when the symbol would not fit into a fixed 4-byte area (or 6, or 8, or whatever). I came up with a new class that I called cpstring (that was suppose to stand for ‘Compact String’ – in retrospect the name wasn’t very good, but I am pretty much stuck with it now). Its representation looks something like this

class cpstring {

struct Local {

short int len;

char data[6];

};

struct Remote {

short int len;

short int siz;

char* data;

};

union Impl {

local loc;

remote rem;

};

Impl _impl;

public:

. . . .

};

Within the implementation, if _impl.loc.len < 6 then the actual character data is in _impl.loc.data. If _impl.loc.len >= 6 then the data is on the heap and is referenced via _impl.rem.data. I don’t show the details of the cpstring interface because they are exactly the same as the Standard string -- with some additions described below. Readers who are interested can get a copy of the entire implementation from my web site (www.bleading-edge.com). I will mention a few of the implementation details.

Obviously, there is no reference counting with cpstring – any copies result in physical copies being made.

Also, since the length and size elements are declared to be shorts, the maximum length of string that can be held in a cpstring is limited to 32767. While this is usually not a problem, it is probably less than what a Standard string will support (although there are no guarantees).

Since the objective of cpstring is to conserve memory, it tries to keep empty space allocated on the free-store to a minimum. To hold a string of length l (where l >= 6), it will allocate a buffer of size l+1^$ rounded up to the next double-word boundary. If the string shrinks, the allocated memory will shrink with it. Naturally, if the size of the string shrinks below 6 characters, the data is copied into the object itself and any memory allocated from the free-store is released. This scheme would make the impl.rem.siz element redundant were it not for reserve().

The member function reserve() can be called to allocate a bigger chunk of memory than might otherwise be needed, but the allocation created by reserve() will remain only so long as any changes to a string do not cause its current length to shrink. In other words: you can not do a reserve() on a cpstring to create a fixed size character buffer that you can then fill and empty as needed (the reserve() function of the Standard string class does not guarantee this kind of behavior either – though many implementations handle it that way. This is another example where you need to be careful to not depend upon a given implementation. If you need a fixed size character buffer, then you should allocate a fixed size buffer. If you want to manipulate that buffer as a string then you need another of my ‘string’ implementations – one that lets you specify a fixed size data area, or wrap a string interface around an existing character buffer. Next column).

To facilitate the use of cpstrings interchangeably with Standard strings, there is a converting constructor which takes a const string& and makes a cpstring. There is also a conversion operator string() which will make a string from a cpstring. In addition, most of the functions of cpstring (assign, append, insert, replace, the search functions, the comparison functions, and the free comparison operators) also have versions that take a const string& argument. This allows common operations such as a == b where one is a cpstring and the other a string to be carried out without having to create a temporary of one type or the other.

Naturally, after creating cpstring, I went back and used it to store the type and source elements in ItemName. The final (or at least the current) version of ItemName now looks like this:

class ItemName {

typedef set<cpstring> StringSet;

static const StringSet types;

static StringSet sources;

StringSet::const_iterator _type;

StringSet::const_iterator _source;

cpstring _symbol;

public:

. . . . // constructors, destructors, and such

const cpstring& type() const { return *_type; }

const cpstring& source() const { return *_source; }

const cpstring& symbol() const { return _symbol; }

ItemName& type(const string& s);

ItemName& source(const string& s);

ItemName& symbol(const string& s);

. . . . // etc.

};

I will close this (already too long) column with a few final observations. ItemName is a very small class, but it provides a number of valuable insights into the problems of designing reusable, general-purpose types. One of the often stated rules of C++ is "always make data elements private." In the original version of ItemName, this may have seemed like a rather silly restriction. As we saw however, when we started changing the implementation of ItemName (with many thousands of lines of code already using it), we were very glad that we had isolated the data behind a functional interface. Even then we had problems.

As I changed ItemName’s representation, all clients of ItemName were forced to re-compile. Even though I only posted new versions of ItemName to the released libraries at carefully timed intervals, the fact remained that whenever a new version of ItemName appeared, people would get annoyed at the time it would take to rebuild their code. I considered completely hiding the details of ItemName’s implementation in a separate class, with ItemName becoming just a ‘handle’. While that might have helped while I was working on things, I rejected it as a long-term solution, however. Since these changes came about from the need to address problems with excessive memory allocation, I decided that adding another level of free-store allocation was not appropriate. The best thing I could do was to stabilize ItemName again as soon as possible.

I also decided to go ahead and return cpstring references in the interface of ItemName. Most of us never think twice about having a member function such as

const string& type() const;

The problem is that by returning a reference we are exposing details of the implementation, even though we are using a member function. If we change the implementation, we will be forced to change the interface, which could cause our clients to have to change code. As I noted above, one of my unstated requirements in changing ItemName was that I not break any existing code. What this meant in practice was that I had to continue to return a string from the accessor functions, or something that could be automatically converted into one.

I decided to go ahead and return a const cpstring& (and to do so with an inline function) for the sake of efficiency (the usual reason for returning references instead of objects). As it was, I worried about the possibility that a lot of code using ItemName might have to convert the cpstrings to regular strings, with the attendant memory allocation that would go along with it. This is one of the primary reasons that I went to the trouble of making cpstring a complete replacement for string. If I had not been worried about the overhead of converting cpstrings to strings, I could have solved the memory problem with symbol in some simpler fashion, including using ordinary C-style strings. If I had done so however, I would have been forced to create a string temporary in order to return the value from an accessor function (you can see this in my second solution for type, above). Doing anything else would have broken code. Since ItemName was a heavily used type, adding this level of overhead could have caused performance problems in its own right. By returning a cpstring&, and by making cpstring’s interface accept string objects without conversion, I hoped to keep to a minimum the extra overhead I would introduce by using cpstring instead of string. As it was, I came real close to sticking with the Standard string class for _type and _source since I was storing these in a container and sharing them anyway.

I also thought long and hard (in fact I am still thinking) about not providing cpstring with an operator string() member function. By preventing cpstrings from being automatically converted back into strings, I would ensure that the compiler would not silently start creating string temporaries that were not there before. The downside for preventing this performance hit was that I could possibly break some existing code. This last possibility turned out to be the deciding factor in this case – I simply could not afford to break code at that point. After examining some typical code it was decided that we could worry about performance problems with string temporaries after we discovered that we actually had performance problems with string temporaries. If I had been tuning ItemName earlier in the project, I am pretty sure that I would have provided an explicit member function (as_string()?) instead of operator string(). This would have forced users of cpstring to be aware of any string temporaries they created.

In the final analysis, cpstring turned out to be a useful class. It helped solve ItemName’s memory problem, though the object sharing approaches used for type and source provided much higher payoff for the effort involved. Outside of ItemName, cpstring has turned out to be only of limited use – there are few uses for a string optimized to hold fewer than 6 characters. Still, I think the effort involved in creating cpstring was worth it. Still, if I could have come up with a more general-purpose solution, I would have preferred it. (One option I have been considering is to make cpstring a template with an argument to specify the size of the internal buffer. That way, a user with an application where most of the strings are less than ‘n’ characters could create a cpstring optimized for ‘n’ characters, but still have a general purpose type that could accommodate longer strings if necessary.)

In summary, the Standard string class provides a very useful abstraction. If you find yourself in a situation where your standard implementation is inadequate for your application, you definitely want to consider keeping the interface, but providing your own implementation. If your situation is like mine and you discover your problems late in the project (and that is probably what will happen), then you may have very little choice. Even if you anticipate them up front, it is probably better to stick as close to the Standard string specification as you can. More and more, this is what programmers are going to expect, and by using Standard string you will minimize the surprises that your maintenance staff will have down the road.

Figure 1.