Trending
Opinion: How will Project 2025 impact game developers?
The Heritage Foundation's manifesto for the possible next administration could do great harm to many, including large portions of the game development community.
Tools, game engines, and games can benefit from safe, fast handling of strings, and in this Gamasutra technical feature, veteran game coder Joe Linhoff explains and makes available a new, low-level library to avoid issues created by standard C string functions.
This article presents a high performance approach to string handling and a low level C library. Tools, game engines, and games can benefit from safe, fast string handling. Many string libraries are based on C's standard functions. While the standard C string functions are sometimes adequate, they are often unsafe, slow, or not suited to multi-threaded programming. Often, they thwart safe, clean, efficient problem solving.
Many high level string libraries build on C's standard approach and inherit these problems. This article presents a new, low-level library to address and overcome these issues. In addition, the code for the string library has also been made available here for download.
Strings are a derived data type used to organize sets of characters. Applications of all kinds, including word processors, text editors, programming tools, and games make extensive use of strings. Most games use strings for messages, player information, configuration, scripting, and internal identification of resources.
A string is an array of characters. The most popular approach to string handling dates back to the creation of the C language.[1] This approach uses a pointer to the first character in the string and a zero to signal the end of the string.
If the first character in string is a zero, then the string is said to be empty. A non-empty string can be a single a single character such as “A”, a word like “hello”, a whole sentence, or even an entire file. Standard double quotes are used to demarcate a string and single quotes are used for individual characters.
string | characters in the string and the zero-terminator | string length |
0 | 0 | |
"A" | 'A' | 0 |
"hello" | 'h' | 'e' |
"set x 3.14" | 's' | 'e' |
The new library defines and uses the type 'chr' for a character. The library can be rebuilt for any power-of-two character size.
The new library supports standard zero-terminated C strings and is designed to co-exist with all existing code. The biggest difference between the new library and the standard C library is the addition of an optional pointer-terminator parameter to the string handling routines. The pointer-terminator is a pointer to one character past the last valid character in the string. That is, it is a pointer to the zero-terminator, or where the zero-terminator would be if it were present.
There are significant benefits to supporting pointer-terminated strings. One benefit is that pointer-terminated strings allow you to safely define words in a shared buffer without modifying the buffer. For example, if you are scanning a buffer and find a single word that you would like to operate on as an individual string, you have two options with C's standard routines: Either temporarily poke a zero into the buffer to end the word, or copy the word into a second buffer. Modifying the buffer is sometimes not possible and usually not desirable. Copying the word to a second buffer will typically involve memory allocation and freeing and always increase processing time. Both options are undesirable. With pointer-terminated strings, you avoid these problems by passing pointers to the start and end of the word.
Pointer-terminated strings also allow the programmer to find the length of a string by subtracting the starting pointer from the pointer-terminator. Furthermore, routines that scan backwards through strings are simplified when the end of the string is known.
One rule the new library follows is that buffer building functions always add or force a zero-terminator into the buffer. This helps simplify the handing of buffer overflow issues by truncating strings that do not fit entirely into the destination buffer. This also makes it easier to integrate the new library with existing code.
There are potential drawbacks to using pointer-terminated strings. One drawback is that you often don't have a pointer to the end of the string, and to get that pointer, you have to scan the string yourself. To address this drawback, the pointer-terminator may always be passed as a null pointer. When the pointer-terminator is passed as a null pointer, the routine treats the string as a standard zero-terminated string.
Another slight drawback is the performance hit incurred when passing null pointer-terminators. To address this, many of the routines have multiple versions, some of which do not take a pointer-terminator and rely on the string being zero-terminated.
Confusion over terminology can lead to hard-to-find bugs. Like the standard routines, the routines in the new library use the the term 'length' to mean the number of characters in the string. Unlike the standard routines, the term 'size' refers to the byte size of the string buffer including the terminator. For example, the string “abc” has a length of 3 and a size of 4*sizeof(chr). The size of the string “abc” includes the 3 characters and the zero-terminator.
int n;
n = szlen("abc"); // n == 3
n = szlen2("abc",NULL); // n == 3
n = szsize("abc"); // n == 4*sizeof(chr)
n = szsize2("abc",NULL); // n == 4*sizeof(chr)
Another fundamental difference between this approach and the standard string routines is that the routines in this library do not allocate or free memory. When writing fast code, you should do your best to avoid calls to system library functions -- especially ones that involve memory management. If a string needs to grow, either the application or a higher level string manager must handle the method in which to grow that string.
All routines that build buffers require a pointer to the start of the buffer, and a byte size of the whole buffer. The routines return the number of characters put into the buffer, not including the zero-terminator. The slight discontinuity between passing in the byte size of the destination buffer and returning the character count is provided to facilitate buffer manipulation.
The following example copies a zero-terminated string into a buffer:
int n;
chr buf[8];
n = szcpy(buf,sizeof(buf),"abc",NULL); // n == 3
n = szsize(buf); // n == 4*sizeof(chr)
The next example defines a string and copies part of it. The comment block below labels the character index for each character in the string. The code that follows, loads up the start pointer with a pointer to the character at index 10 and the pointer-terminator with a pointer to the character at index 27. This defines the string "buffer that might".
// character 0 1 2 3 4 5
// index 01234567890123456789012345678901234567890123456789012345678
chr bigstr[] = "This is a buffer that might be shared by multiple routines.";
int n;
chr buf[32];
chr *ss = &bigstr[10]; // start at "buffer"
chr *sx = &bigstr[27]; // end after "might"
The line of code below copies the string into the buffer. The number of characters copied into the buffer is returned. The copying routine, like all buffer building routines, guarantees that, if the size of destination buffer is not zero, it will terminate or truncate the destination string with a zero.
n = szcpy(buf,sizeof(buf),ss,sx); // n == 17, buf == "buffer that might"
Like the standard string functions, case sensitive and insensitive compare functions are provided. A few variations of these functions are provided to allow you to choose the one closest to your data. The szcmp() function compares two strings, s and t. The parameter 'ss' is a pointer to the start of string s. The parameter 'ts' is a pointer to the start of string t. The parameters 'sx' and 'tx' are the pointer-terminators for strings s and t.
int n;
chr *ss = &bigstr[2]; // start of string s
chr *sx = &bigstr[4]; // pointer-terminator for string s
chr *ts = &bigstr[5]; // start of string t
chr *tx = &bigstr[7]; // pointer-terminator for string t
n = szcmp("is",ss,sx); // n == 0 signaling strings match
n = szcmp2(ss,sx,ts,tx); // n == 0 signaling strings match
In addition to the standard forward and backward looking character routines and substring searching routines, routines that facilitate fast character scanning are provided. These routines take a pointer to a flag table and flags. These scanning routines, szfskip() and szftill(), either skip over characters or scan until they hit the first character in a character set.
The flag table is an array of integers, one integer for each character. Each integer is used to store flags for each character. The flags signal which character sets the character belongs to. The following code sets up a simple flag table for ASCII characters.
// define character sets using the bottom 24 bits
#define M_ALPHA 0x0000001
#define M_NUMERIC 0x0000002
#define NUM(_a_) (sizeof(_a_)/sizeof(_a_[0]))
int i,flagtable[128]; // an array of 128 works well for ASCII
// initialize the flagtable
for(i=0;i
flagtable[i]=0; // zero whole table
flagtable[0]=NUM(flagtable); // first entry must contain table size
// define character sets -- assume characters in 1..127 range
for(i='a';i<='z';i++) // lowercase alpha
flagtable[i]|=M_ALPHA;
for(i='A';i<='Z';i++) // uppercase alpha
flagtable[i]|=M_ALPHA;
for(i='0';i<='9';i++) // numbers
flagtable[i]|=M_NUMERIC;
Setting flag bits for each character allows the string routines to do a quick lookup and bit test to determine if that character belongs to a specific character set or sets. The top 8 flag bits are reserved for the library's internal use.
Finding words in a string buffer can be accomplished by scanning forward through the buffer until an alphabetical character is hit; this is the word start. The routine can then skip all subsequent characters until it finds the first non-alpha-numeric character; this is the end of the word. The following code shows an example of this.
// scanning
chr *ss,*sx;
ss="a bc de\n\rz\n\r"; // example buffer to scan
// scan forward till first alpha, then skip to first non-alpha-numeric
// anything in between is considered a word
ss=szftill(flagtable,ss,NULL,M_ALPHA); // scan till alpha
sx=szfskip(flagtable,ss,NULL,M_ALPHA|M_NUMERIC); // skip alpha-numerics
The above code fragment leaves 'ss' and 'sx' with pointers to the start and end of a word in the string. This word can then be mapped to a token value with szmap() as shown below.
// word start is ss, word end is sx
// map word to token
switch(szmap(tokens,ss,sx))
{
case TOKEN_A:
printf("TOKEN_A ");
break; // TOKEN_A
case TOKEN_BC:
printf("TOKEN_BC ");
break; // TOKEN_BC
case TOKEN_DE:
printf("TOKEN_DE ");
break; // TOKEN_DE
case TOKEN_Z:
printf("TOKEN_Z ");
break; // TOKEN_Z
default:
printf("* unknown token ");
break;
} // switch
The tokens use above are defined with enumerations and a corresponding string table as shown below:
enum { // token enumerations
TOKEN_RESERVED, // reserve token 0
TOKEN_A,
TOKEN_BC,
TOKEN_DE,
TOKEN_Z,
}; // TOKEN_
chr *tokens[] = { // string table
"", // reserve token 0
"a", // TOKEN_A
"bc", // TOKEN_BC
"de", // TOKEN_DE
"z", // TOKEN_Z
NULL // terminate token list
}; // tokens[]
As a note, the simple flag table shown here works well for eight-bit ASCII and may be used with a sixteen-bit character set. However, because the current routines require the size of the flag table to match the number of characters in the character set, larger character sets would be better served with a modified approach.
Converting a string to a numerical value is a common operation. The function sztobin() accomplishes this. For parameters, pass the string start, the string pointer-terminator, the conversion base, the destination type, and a pointer to the destination variable in which to store the converted value. The routine returns a pointer to the first character after the conversion. The destination type is a character code from the list below which identifies the type of the destination variable.
// destination types
// 'i' = pointer to int
// 'u' = pointer to unsigned int
// 'f' = pointer to float
// 'd' = pointer to double
For example, the following code converts a string to a floating point value:
float fval;
chr *ss;
ss = sztobin("1.23;",NULL,10,'f',&fval); // fval == 1.23 ss == ";"
Use the string formatting routine introduced below, szfmt(), to convert a numerical value into a string.
The library includes szfmt(), a sprintf-like formatting routine. The routine supports the basic sprintf-formatting commands. This routine is a buffer building routine and follows the rule that the byte size of the destination buffer is passed into the routine, and the length of the string built is returned.
The following example shows how to use szfmt() to build a buffer:
int n;
chr buf[32],*ss;
ss = "world";
n = 2007;
n = szfmt(buf,sizeof(buf),"Hello %s %d.",ss,n); // n == 17
The standard C library approach to string handling can lead to unsafe, slow, and often problematic code that may not be suited to multi-threaded programming. This article presents a new string handling library based on pointer-terminated strings to help overcome these problems.
This library can be used in C or C++ projects directly or as the foundation for a higher level class. To achieve maximum benefit, all string handling in the program should support pointer-terminated strings.
I have not tried to duplicate all the functionality present in other libraries. Rather, this represents a minimalist and easily extensible library. I have been using variations of this library in tools, projects, and games for the past five years with great results.
[1] Dennis M. Ritchie, The Development of the C Language, (Association for Computing Machinery, Inc., 1993) (http://www.cs.bell-labs.com/who/dmr/chist.html visited 7/13/2007).
when building non-zero-length buffers, always force a zero-terminator
pass in byte sizes of buffers for buffer building commands
return character lengths not including the terminator
extern int szlen(chr *ss); // number of chrs in string
extern int szlen2(chr *ss,chr *sx);
extern int szsize(chr *ss); // byte size of string and terminator
extern int szsize2(chr *ss,chr *sx);
extern int szcpy(chr *dst,int dstsize,chr *ss,chr *sx);
extern int szcmp(chr *ss,chr *ts,chr *tx); // string s is zero-terminated
extern int szcmp2(chr *ss,chr *sx,chr *ts,chr *tx);
extern chr* szfskip(int *flagtable,chr *ss,chr *sx,int flags); // skip matching
extern chr* szftill(int *flagtable,chr *ss,chr *sx,int flags); // till match
extern int szmap(chr *table[],chr *ss,chr *sx);
extern chr* sztobin(chr *ss,chr *sx,int base,chr dsttype,void *dst);
extern int szfmt(chr *dst,int dstsize,chr *fmt,...);
Read more about:
FeaturesYou May Also Like