“The (B)Leading Edge”

"The (B)Leading Edge"
String in the Real World

Jack Reeves
©C++ Report

Introduction

With this column I am going to begin a series with the working subtitle of "Using the Standard Library in the Real World." Since everyone has their own version of "the Real World", not everything I will discuss may be applicable to you, but sooner or later, if you develop large scale programs in C++ you will run into similar situations. Let us start with the Standard string^@ class. It is my opinion that the string class is one of the most important new classes added to the Standard Library. There are parts of the Standard Library that many people (including myself) will probably never use; there are other parts that programmers may only use occasionally; still other parts may be used heavily in some areas but not at all in others; but it is a rare program indeed that doesn't need to do some type of string manipulation.

There are three separate parts of the Standard string class. In fact, this triumvirate is true of most of the classes defined in the Standard Library, and it is one of the key points that I will be emphasizing in these columns. Part 1 is the intent of the class. The Standard provides this. You may say "huh, what does he mean by 'intent', that's obvious." Maybe it is, and then again maybe it isn't. I think it is important to understand the intent of a class. Obviously, if you understand what the class is trying to provide, the details of the class definition will make a lot more sense. If you misconstrue the intent however, you may find many things do not work as you expect. It is often even more important to understand what the intent isn't. For example, the Standard string class is not intended to be just a wrapper for the C-library string functions. This may seem obvious to readers of this magazine, but I have encountered C++ programmers who apparently believed this, and they were a bit annoyed when they discovered otherwise.

A key aspect of understanding the intent of the string class is to realize that it has a dual personality. The first and most obvious meaning of a string is that it represents a sequence of characters. This is how string started out. When the STL became part of the Standard Library, string got a second meaning: that of a container. The two approaches are not opposed to each other, but they do have slightly different mind-sets about the operations that it makes sense to provide.

The second part of a class is the specification of the class itself. Again, the Standard provides this. For the most part, this is what people think the Standard provides. In truth, it is often necessary to figure out the intent from the specification. The specification of string gives two clues to the intent of the committee. If you have looked at clause 21 of the FDIS^# you will have noted that string (actually the basic_string template) has a large number of functions specified in its interface. This is usually termed a "fat" interface. There are two reasons (at least) that the committee chose to give string a fat interface. One was efficiency. Consider the definition of operator==(). Besides the expected

bool operator==(const string& lhs, const string& rhs);

there are also versions taking arguments of ordinary c-style strings

bool operator==(const string& lhs, const char* rhs);

bool operator==(const char* lhs, const string& rhs);

Strictly speaking, neither of these latter two definitions are necessary. Standard string defines a constructor which will convert a c-style string to a Standard string. Therefore, in the absence of the extra definitions the statement

if (str == "DATE")

would have reduced to

if (operator==(str, string("DATE")))

Nevertheless, it was the intent of the committee that the Standard library should be efficient enough so that it would be usable in typical applications. For string, this meant being usable most places where the old C-library string functions would have been used. The implicit conversion of "DATE" to a string creates a temporary. In a typical implementation this temporary will likely allocate memory from the free store and copy the c-string. Finally, the allocated memory will have to be freed when the temporary is destroyed. This extra overhead is likely to be unacceptable for certain applications. More to the point, it isn't directly comparable with the C-library strcmp() function. The committee added the extra function definitions to allow comparisons to be made directly, without the overhead of creating a temporary. While it is certainly possible to implement the latter two functions in terms of the first one, it would be a rare implementation that did so (but something worth checking).

The other intent I deduced from the fat interface was that the committee wanted to make it easy to do common tasks. Consider the substr() member function. This takes a starting position within a string and a length and returns a string which contains the sub-string. This isn't a required function -- there is a string constructor that will create a string from a portion of another string. Yet the committee obviously felt (and I agree) that having both functions available was useful.

Another example is the copy() function. I have found that most people do not even know about this function. It is a member function and its signature is

size_type copy(char* s, size_type n, size_type pos = 0) const;

copy() will safely copy the characters from the string, starting at 'pos', into a buffer 's' of size 'n'. When I say "safely" I mean that the copy will stop when either there are no more characters in the string, or there is no more space in the buffer. The number of characters actually copied is returned. Again, copy() is not required -- there are other ways to accomplish the same thing. Furthermore, unlike substr(), this is not something that is typically a very common operation. Nevertheless, the need does occur, and it is simpler to write, easier to get correct, and typically more efficient to use copy() than the alternatives. Again the committee obviously felt this was useful enough to provide. The problem with "usefulness" is that the list of useful functions can go on forever (I will have a later column on some of the useful functions that didn't make it into the Standard), so what the committee decided to include in string is somewhat arbitrary, but it is certainly more than the bare minimum.

If I seem to be dwelling on this intent thing, it is because now we come to the third aspect of the Standard string class -- the implementation. The Standard does not specify the implementation, that is up to the individual library vendors. Every implementation of anything as complex as the Standard string class will involve some element of compromise. The different vendors will make their choices based upon what they feel is important for their typical customers, on their typical platforms. In other words, the vendors will apply their own interpretation of the intent of the class. In the process, they will fill in a lot of things that the Standard does not mention. Chances are fairly good that most vendors will understand and correctly interpret the intent of the committee, but there are a lot of gray areas in the Standard (I will be touching on some of these). If your understanding of the intent of the string class differs from your vendor's, you could be in for some shock when something doesn't work the way you think it should.

I thought I would devote the rest of this column to just looking at the definition of the Standard string class and at some of the little surprises that I have run into when using Standard string. In other words, these are some examples where my expectations of a string class did not quite match those provided by the Standard string class. In certain cases, I found these annoying enough that I have created my own version of string.

There are 101 member function signatures defined for string, including the constructors, but excluding the destructor (5 of these are member templates). When you consider the default arguments, there are 129 different ways you can call the basic string member functions. Obviously, the 5 member template functions can be invoked in an infinite variety of ways. Finally, there are also 28 stand-alone functions defined as part of the string interface. That is a big class, and not surprisingly, I found that there were a few things in there that did not work quite the way I thought they would.

For example, there are member functions to append, assign, insert, and replace characters in a string. There are versions of each of these functions which take (a) another string, (b) a sub-string of another string, (c) a character buffer (represented by a const char* argument) with a length, (d) a c-style null-terminated string (also represented by a const char* argument), (e) a repeat count and a single character, and (f) an arbitrary collection of characters represented by two iterators. That seems like a pretty comprehensive assortment, but after using string for just a short while I noticed an absence -- there is no version of these functions that takes a single character. For example, if you want to append a period to a string you can write

str.append(1, '.');

or you can use the operator form (which does support individual characters)

str += '.';

This bugged me, particularly with insert(), since I continually found myself writing things like

str.insert(p, ' ');

str.insert(p, '\0');

As you would expect, the compiler would complain about these examples, but there are a couple of cases that it doesn't complain about.

As I mentioned above, the string class has a dual personality. Primarily, it is a sequence of characters, but there are also certain functions that support the view that it is a container of individual characters. Most of the time, this dual personality is hardly noticeable, but in certain circumstances it can lead to inconsistencies. Occasionally, these inconsistencies would drive me nuts.

I mentioned that each of the append(), assign(), insert(), and replace() functions have a version (actually a member template function) that takes two iterators to specify the new characters. This is the container viewpoint. In addition to specifying the new characters, the insert() function must specify the position where they are to be inserted. All of the "string as sequence" versions of functions specify a location in the string as a numerical offset from the beginning of the string. Thus, the first position in the string is location 0. The "string as container" versions use iterators, however. In the container viewpoint, the first location is at str.begin(). Naturally enough, the version of insert() which uses iterators to specify the new characters also uses an iterator to specify the position. Its signature is

template<class InputIterator>

void insert(iterator p, InputIterator first, InputIterator last);

So far, so good; but in addition to allowing a collection of characters to be inserted somebody decided that string should also have the standard container functions to insert a single character, and a repeated sequence of a single character. The function signatures are

iterator insert(iterator p, char c = char());

void insert(iterator p, size_type n, char c);

Now suppose I want to put a backslash at the beginning of a string. I should write

str.insert(0, 1, '\\');

but suppose I incorrectly write

str.insert(0, '\\');

There are two versions of insert() which can be called with two arguments. Only one of these takes a char argument as its second parameter, and its first parameter is an iterator. Now, assume that the string implementor has defined the string class iterator to be just a synonym for a char*. Naturally, the compiler finds that it can convert the 0 into a char* and blithely calls the container style function. The Standard is quite clear that the position iterator should be a valid iterator of the string, but it makes no requirement that implementations check the validity of the iterator (the efficiency consideration I suppose -- after some soul searching, I decided that in my own implementation of string I would attempt to validate the iterator only during debugging). In any case, I prefer compile time errors to runtime errors, and this situation happened to me often enough (twice was once too often, in my opinion) that I decided to do something about it. More on that below.

Another area of inconsistency is in the erase() function. There are three signatures defined for erase()

string& erase(size_type pos = 0, size_type n = npos);

iterator erase(iterator pos);

iterator erase(iterator first, iterator last);

In addition, there is the function clear() which takes no arguments. Now, suppose I want to just erase one character (perhaps the backslash I tried to add above). I want to write

str.erase(0); // remove first character

but that doesn't work very well. The best I can hope for is that the compiler would flag this as ambiguous. That depends upon string::iterator being a typedef for char*. A Standard compliant compiler will do so, but there are still a lot of older compilers which will chose one function or the other. Likewise, if string::iterator is actually a class type, then the call is not ambiguous. In this case, either choice yields a runtime error. If I write something like

str.erase(p);

where 'p' is a position offset and not an iterator, I don't even have a hope of the compiler helping me. Because of the default arguments, instead of erasing one character, this function call actually erases all the characters from position 'p' to the end of the string. I have never actually had to track down such a bug, for the simple reason that when I discovered this little trap I decided I would remove it.

Some of these inconsistencies are the result of trying to support two slightly different intents -- but not all of them. For example, in the "string as sequence" view a sub-string is always specified by a starting position and a length; both are of type size_type. Some functions which take arguments specifying a sub-string have defaults for both the starting position and the length. This usually means that a signature can be invoked in one of three ways

str.substr(); // the entire string

str.substr(n); // the sub-string starting at n to the end

str.substr(n, l); // the sub-string starting at n of length l

On the other hand, other functions do not have the defaults. Usually, there are two different signatures available so you can write either of these

str.append(s); // append all of s

str.append(s, n, l); // append s.substr(n, l);

but not

str.append(s, n); // append s.substr(n);

As near as I can tell, the constructor, erase(), and substr() functions have the default arguments, whereas the append(), assign(), insert(), replace(), and compare() functions do not. These inconsistencies are in the FDIS, and I found myself stumbling across them more often than I would like. The question became what could I do about them? I tried several different approaches before I settled on the one described here. Each of the other approaches I tried had its advantages, but each had the primary disadvantage that it was not portable. As we moved our code base from one platform to another, the string code would break. What I finally came up with seems obvious now, but it didn't at the time.

I defined my own version of Standard string. I called it stringx (for string-extended). It is shown in listing 1. Allow me to summarize some key requirements I had for class stringx. First, it had to be interchangeable with the Standard string class. I wanted to be able to use one of my stringx objects anywhere a string object was required. This was fairly simple, I just derived stringx from string. Since I wanted to replace certain function signatures of string I could not use public inheritance; instead I used private inheritance and added two conversion operators which will automatically convert a stringx object to either a string& or a const string&. I also wanted to be able to use a string object anywhere I expected a stringx object. This was more difficult. Class stringx has a converting constructor which will take a const string& as an argument, but this still makes a new stringx object. If I want to treat an existing string object as a stringx object, I have to do an explicit cast. In practice, this has not turned out to be a problem.

Most of the changes I made to stringx are indicated by comments in the listing: //x indicates an extension, where //c indicates a change (I probably didn't get them all). One of my biggest changes were to add the functions to support appending, assigning, inserting, and replacing a single character. I also added a constructor which would make a stringx from a single character.

Another change was to explicitly remove any default arguments for sub-strings. I decided that functions which could specify a sub-string should always require both the start position and the length. For example, with a stringx object, the function calls

strx.substr();

strx.substr(p);

both cause compile errors. Only the full version

strx.substr(p, n);

is acceptable.

The one exception to this rule is the erase() function. As I described above, I often found myself writing

str.erase(p);

when I meant to erase only the single character at position 'p'. So in stringx, the erase() function is different from string.

erase(); // disallowed, use clear() instead

erase(p); // erases the single character at 'p'

erase(p, n); // erases the sub-string from 'p' to 'p+n'

The same two forms work with iterators

erase(i1); // erases the single character at i1

erase(i1, i2); // erases the sub-string from i1 to i2

I prefer this kind of consistency.

I also added a number of insert() functions which take an iterator to specify the insertion position, but in truth I have never used them.

I use stringx strictly within the implementation of a function, never in its interface. This means that users of my classes and functions never have to know anything about stringx if they prefer not to. Typically, I will create a stringx object so I can manipulate it with my preferred methods, and then either return it or save it as a string. Since all of the differences between string and stringx functionality involve mutators, and since almost inevitably string objects are passed to functions as reference to constant, I almost never have reason to explicitly convert a string object reference to a stringx reference.

Since the implementation of stringx depends only on the public interface of string it is portable. It is also efficient -- stringx is implemented entirely with inline functions. The only drawback is that someone who has to maintain code written using stringx may not realize the slight, but important differences between it and string. Oh well, maintenance is always tough.

Class stringx is my solution to my problems with the definition of string. In a very large sense, these are my quirks, and are not meant to imply any real shortcommings of the Standard string class itself. Many other programmers may not notice the things I did, and may consider my version the inconsistent one. I will note, however that at least a few of my coworkers also complained about the lack of functions which take a single character argument. Most of them seem quite happy with the stringx version.

Listing 1.

/*******************************************************************************

extended string

//$-----------------------------------------------------------------------------

Author: Jack W. Reeves

//!----|-------|-------|-------|-------|-------|-------|-------|-------|-------|

Description:

This file defines an extended string type. It adds all the extension

that I feel ought to be in the stringx class.

//#-----------------------------------------------------------------------------

Notes:

//*****************************************************************************/

#ifndef STLUTIL_STRINGX_H

#define STLUTIL_STRINGX_H

#include <std/stdext>

#include <std/string>

class stringx : private string {

public:

// types

typedef string::size_type size_type;

typedef string::reference reference;

typedef string::const_reference const_reference;

typedef string::iterator iterator;

typedef string::const_iterator const_iterator;

typedef string::reverse_iterator reverse_iterator;

typedef string::const_reverse_iterator const_reverse_iterator;

static const size_type npos = -1;

// construct/copy/destroy:

explicit stringx()

: string() {}

stringx(const string& str)

: string(str) {}

stringx(const string& str, size_type pos, size_type n) //c - jwr

: string(str, pos, n) {}

stringx(const char* s, size_type n)

: string(s, n) {}

stringx(const char* s)

: string(s) {}

stringx(size_type n, char c)

: string(n, c) {}

template<class InputIterator>

stringx(InputIterator begin, InputIterator end)

: string(begin, end) {}

stringx(char c) //x - jwr

: string(1, c) {}

stringx(const stringx& str) //x - jwr (copy constructor)

: string(str) {}

~stringx() {}

stringx& operator=(const string& str)

{ return static_cast<stringx&>(string::operator=(str)); }

stringx& operator=(const char* s)

{ return static_cast<stringx&>(string::operator=(s)); }

stringx& operator=(char c)

{ return static_cast<stringx&>(string::operator=(c)); }

stringx& operator=(const stringx& str) //x - jwr (copy assignment)

{ return static_cast<stringx&>(string::operator=(str)); }

// convert to string reference

operator string&()

{ return *static_cast<string*>(this); }

operator const string&() const

{ return *static_cast<const string*>(this); }

// iterators

iterator begin()

{ return string::begin(); }

const_iterator begin() const

{ return string::begin(); }

iterator end()

{ return string::end(); }

const_iterator end() const

{ return string::end(); }

reverse_iterator rbegin()

{ return string::rbegin(); }

const_reverse_iterator rbegin() const

{ return string::rbegin();

reverse_iterator rend()

{ return string::rend(); }

const_reverse_iterator rend() const

{ return string::rend(); }

// capacity

size_type size() const

{ return string::size(); }

size_type length() const

{ return string::length(); }

size_type max_size() const

{ return string::max_size(); }

stringx& resize(size_type n, char c = char(0))

{ string::resize(n, c); return *this; }

size_type capacity() const

{ return string::capacity(); }

stringx& reserve(size_type res_arg = 0)

{ string::reserve(res_arg); return *this; }

void clear()

{ return string::clear(); }

bool empty() const

{ return string::empty(); }

// element access

const_reference operator[](size_type pos) const

{ return string::operator[](pos); }

reference operator[](size_type pos)

{ return string::operator[](pos); }

const_reference at(size_type n) const

{ return string::at(n); }

reference at(size_type n)

{ return string::at(n); }

// modifiers:

stringx& operator+=(const string& str)

{ return static_cast<stringx&>(string::operator+=(str)); }

stringx& operator+=(const char* s)

{ return static_cast<stringx&>(string::operator+=(s)); }

stringx& operator+=(char c)

{ return static_cast<stringx&>(string::operator+=(c)); }

stringx& append(const string& str)

{ return static_cast<stringx&>(string::append(str)); }

stringx& append(const string& str, size_type pos, size_type n)

{ return static_cast<stringx&>(string::append(str, pos, n)); }

stringx& append(const char* s, size_type n)

{ return static_cast<stringx&>(string::append(s, n)); }

stringx& append(const char* s)

{ return static_cast<stringx&>(string::append(s)); }

stringx& append(size_type n, char c)

{ return static_cast<stringx&>(string::append(n, c)); }

template<class InputIterator>

stringx& append(InputIterator first, InputIterator last)

{ return static_cast<stringx&>(string::append(first, last)); }

stringx& append(char c) //x - jwr

{ return static_cast<stringx&>(string::append(1, c)); }

stringx& assign(const string& str)

{ return static_cast<stringx&>(string::assign(str)); }

stringx& assign(const string& str, size_type pos, size_type n)

{ return static_cast<stringx&>(string::assign(str, pos, n)); }

stringx& assign(const char* s, size_type n)

{ return static_cast<stringx&>(string::assign(s, n)); }

stringx& assign(const char* s)

{ return static_cast<stringx&>(string::assign(s)); }

stringx& assign(size_type n, char c)

{ return static_cast<stringx&>(string::assign(n, c)); }

template<class InputIterator>

stringx& assign(InputIterator first, InputIterator last)

{ return static_cast<stringx&>(string::assign(first, last)); }

stringx& assign(char c) //x - jwr

{ return static_cast<stringx&>(string::assign(1, c)); }

stringx& insert(size_type pos1, const string& str)

{ return static_cast<stringx&>(string::insert(pos1, str)); }

stringx& insert(size_type pos1, const string& str,

size_type pos2, size_type n)

{ return static_cast<stringx&>(string::insert(pos1, str, pos2, n)); }

stringx& insert(size_type pos, const char* s, size_type n)

{ return static_cast<stringx&>(string::insert(pos, s, n)); }

stringx& insert(size_type pos, const char* s)

{ return static_cast<stringx&>(string::insert(pos, s)); }

stringx& insert(size_type pos, size_type n, char c)

{ return static_cast<stringx&>(string::insert(pos, n, c)); }

stringx& insert(size_type pos, char c) //x - jwr

{ return static_cast<stringx&>(string::insert(pos, 1, c)); }

stringx& insert(iterator p, const string& str) //x - jwr

{ return static_cast<stringx&>(string::insert(p-begin(), str)); }

stringx& insert(iterator p, const char* s, size_type n) //x - jwr

{ return static_cast<stringx&>(string::insert(p-begin(), s, n)); }

stringx& insert(iterator p, const char* s) //x - jwr

{ return static_cast<stringx&>(string::insert(p-begin(), s)); }

iterator insert(iterator p, char c)

{ return string::insert(p, c); }

stringx& insert(iterator p, size_type n, char c)

{ string::insert(p, n, c); return *this; }

template<class InputIterator>

stringx& insert(iterator p, InputIterator first, InputIterator last) //c - jwr

{ string::insert(p, first, last); return *this; }

stringx& erase(size_type pos, size_type n = 1) //c - jwr

{ return static_cast<stringx&>(string::erase(pos, n)); }

iterator erase(iterator position)

{ return string::erase(position); }

iterator erase(iterator first, iterator last)

{ return string::erase(first, last); }

stringx& replace(size_type pos1, size_type n1, const string& str)

{ return static_cast<stringx&>(string::replace(pos1, n1, str)); }

stringx& replace(size_type pos1, size_type n1, const string& str,

size_type pos2, size_type n2)

{ return static_cast<stringx&>(string::replace(pos1, n1, str, pos2, n2)); }

stringx& replace(size_type pos, size_type n1, const char* s, size_type n2)

{ return static_cast<stringx&>(string::replace(pos, n1, s, n2)); }

stringx& replace(size_type pos, size_type n1, const char* s)

{ return static_cast<stringx&>(string::replace(pos, n1, s)); }

stringx& replace(size_type pos, size_type n1, size_type n2, char c)

{ return static_cast<stringx&>(string::replace(pos, n1, n2, c)); }

stringx& replace(size_type pos, size_type n1, char c) //x - jwr

{ return static_cast<stringx&>(string::replace(pos, n1, 1, c)); }

stringx& replace(iterator i1, iterator i2, const string& str)

{ return static_cast<stringx&>(string::replace(i1, i2, str)); }

stringx& replace(iterator i1, iterator i2, const char* s, size_type n)

{ return static_cast<stringx&>(string::replace(i1, i2, s, n)); }

stringx& replace(iterator i1, iterator i2, const char* s)

{ return static_cast<stringx&>(string::replace(i1, i2, s)); }

stringx& replace(iterator i1, iterator i2, size_type n, char c)

{ return static_cast<stringx&>(string::replace(i1, i2, n, c)); }

stringx& replace(iterator i1, iterator i2, char c) //x - jwr

{ return static_cast<stringx&>(string::replace(i1, i2, 1, c)); }

template<class InputIterator>

stringx& replace(iterator i1, iterator i2, InputIterator j1, InputIterator j2)

{ return static_cast<stringx&>(string::replace(i1, i2, j1, j2)); }

size_type copy(char* s, size_type n, size_type pos = 0) const

{ return string::copy(s, n, pos); }

void swap(stringx& s)

{ string::swap(s); }

// string operations

const char* c_str() const

{ return string::c_str(); }

const char* data() const

{ return string::data(); }

stringx substr(size_type pos, size_type n) const //c - jwr

{ return stringx(*this, pos, n); }

// search functions

... details omitted

// comparison

int compare(const string& str) const

{ return string::compare(str); }

int compare(size_type pos1, size_type n1, const string& str) const

{ return string::compare(pos1, n1, str); }

int compare(const string& str, size_type pos2, size_type n2) const //x - jwr

{ return string::compare(0, size(), str, pos2, n2); }

int compare(size_type pos1, size_type n1, const string& str,

size_type pos2, size_type n2) const

{ return string::compare(pos1, n1, str, pos2, n2); }

int compare(const char* s) const

{ return string::compare(s); }

int compare(size_type pos1, size_type n1, const char* s) const

{ return string::compare(pos1, n1, s); }

int compare(const char* s, size_type n2) const //x - jwr

{ return string::compare(0, size(), s, n2); }

int compare(size_type pos1, size_type n1, const char* s, size_type n2) const

{ return string::compare(pos1, n1, s, n2); }

};

#endif