Sponsored By

A High Performance, Low Level String Library

Tools, game engines, and games can benefit from safe, fast handling of strings, and in this Gamasutra technical feature, veteran game coder Joe Linhoff explains and makes available a new, low-level library to avoid issues created by standard C string functions.

Joe Linhoff, Blogger

August 15, 2007

17 Min Read
Game Developer logo in a gray background | Game Developer

This article presents a high performance approach to string handling and a low level C library. Tools, game engines, and games can benefit from safe, fast string handling. Many string libraries are based on C's standard functions. While the standard C string functions are sometimes adequate, they are often unsafe, slow, or not suited to multi-threaded programming. Often, they thwart safe, clean, efficient problem solving.

Many high level string libraries build on C's standard approach and inherit these problems. This article presents a new, low-level library to address and overcome these issues. In addition, the code for the string library has also been made available here for download.

Strings are a derived data type used to organize sets of characters. Applications of all kinds, including word processors, text editors, programming tools, and games make extensive use of strings. Most games use strings for messages, player information, configuration, scripting, and internal identification of resources.

A string is an array of characters. The most popular approach to string handling dates back to the creation of the C language.[1] This approach uses a pointer to the first character in the string and a zero to signal the end of the string.

If the first character in string is a zero, then the string is said to be empty. A non-empty string can be a single a single character such as ā€œAā€, a word like ā€œhelloā€, a whole sentence, or even an entire file. Standard double quotes are used to demarcate a string and single quotes are used for individual characters.

Ā 

string

characters in the string and the zero-terminator

string length

0

0

"A"

'A'

0

"hello"

'h'

'e'

"set x 3.14"

's'

'e'

Chracter Types

The new library defines and uses the type 'chr' for a character. The library can be rebuilt for any power-of-two character size.

Terminating Strings

The new library supports standard zero-terminated C strings and is designed to co-exist with all existing code. The biggest difference between the new library and the standard C library is the addition of an optional pointer-terminator parameter to the string handling routines. The pointer-terminator is a pointer to one character past the last valid character in the string. That is, it is a pointer to the zero-terminator, or where the zero-terminator would be if it were present.

There are significant benefits to supporting pointer-terminated strings. One benefit is that pointer-terminated strings allow you to safely define words in a shared buffer without modifying the buffer. For example, if you are scanning a buffer and find a single word that you would like to operate on as an individual string, you have two options with C's standard routines: Either temporarily poke a zero into the buffer to end the word, or copy the word into a second buffer. Modifying the buffer is sometimes not possible and usually not desirable. Copying the word to a second buffer will typically involve memory allocation and freeing and always increase processing time. Both options are undesirable. With pointer-terminated strings, you avoid these problems by passing pointers to the start and end of the word.

Pointer-terminated strings also allow the programmer to find the length of a string by subtracting the starting pointer from the pointer-terminator. Furthermore, routines that scan backwards through strings are simplified when the end of the string is known.

Forced Zero-Termination

One rule the new library follows is that buffer building functions always add or force a zero-terminator into the buffer. This helps simplify the handing of buffer overflow issues by truncating strings that do not fit entirely into the destination buffer. This also makes it easier to integrate the new library with existing code.

Pointer-Termination Parameters

There are potential drawbacks to using pointer-terminated strings. One drawback is that you often don't have a pointer to the end of the string, and to get that pointer, you have to scan the string yourself. To address this drawback, the pointer-terminator may always be passed as a null pointer. When the pointer-terminator is passed as a null pointer, the routine treats the string as a standard zero-terminated string.

Another slight drawback is the performance hit incurred when passing null pointer-terminators. To address this, many of the routines have multiple versions, some of which do not take a pointer-terminator and rely on the string being zero-terminated.

Lengths And Sizes

Confusion over terminology can lead to hard-to-find bugs. Like the standard routines, the routines in the new library use the the term 'length' to mean the number of characters in the string. Unlike the standard routines, the term 'size' refers to the byte size of the string buffer including the terminator. For example, the string ā€œabcā€ has a length of 3 and a size of 4*sizeof(chr). The size of the string ā€œabcā€ includes the 3 characters and the zero-terminator.

int n;

n = szlen("abc"); // n == 3

n = szlen2("abc",NULL); // n == 3

n = szsize("abc"); // n == 4*sizeof(chr)

n = szsize2("abc",NULL); // n == 4*sizeof(chr)

Memory Management And Strings

Another fundamental difference between this approach and the standard string routines is that the routines in this library do not allocate or free memory. When writing fast code, you should do your best to avoid calls to system library functions -- especially ones that involve memory management. If a string needs to grow, either the application or a higher level string manager must handle the method in which to grow that string.

Building Buffers

All routines that build buffers require a pointer to the start of the buffer, and a byte size of the whole buffer. The routines return the number of characters put into the buffer, not including the zero-terminator. The slight discontinuity between passing in the byte size of the destination buffer and returning the character count is provided to facilitate buffer manipulation.

The following example copies a zero-terminated string into a buffer:

int n;

chr buf[8];

n = szcpy(buf,sizeof(buf),"abc",NULL); // n == 3

n = szsize(buf); // n == 4*sizeof(chr)

The next example defines a string and copies part of it. The comment block below labels the character index for each character in the string. The code that follows, loads up the start pointer with a pointer to the character at index 10 and the pointer-terminator with a pointer to the character at index 27. This defines the string "buffer that might".

// character 0 1 2 3 4 5

// index 01234567890123456789012345678901234567890123456789012345678

chr bigstr[] = "This is a buffer that might be shared by multiple routines.";

int n;

chr buf[32];

chr *ss = &bigstr[10]; // start at "buffer"

chr *sx = &bigstr[27]; // end after "might"

The line of code below copies the string into the buffer. The number of characters copied into the buffer is returned. The copying routine, like all buffer building routines, guarantees that, if the size of destination buffer is not zero, it will terminate or truncate the destination string with a zero.

n = szcpy(buf,sizeof(buf),ss,sx); // n == 17, buf == "buffer that might"

Comparing Strings

Like the standard string functions, case sensitive and insensitive compare functions are provided. A few variations of these functions are provided to allow you to choose the one closest to your data. The szcmp() function compares two strings, s and t. The parameter 'ss' is a pointer to the start of string s. The parameter 'ts' is a pointer to the start of string t. The parameters 'sx' and 'tx' are the pointer-terminators for strings s and t.

int n;

chr *ss = &bigstr[2]; // start of string s

chr *sx = &bigstr[4]; // pointer-terminator for string s

chr *ts = &bigstr[5]; // start of string t

chr *tx = &bigstr[7]; // pointer-terminator for string t

n = szcmp("is",ss,sx); // n == 0 signaling strings match

n = szcmp2(ss,sx,ts,tx); // n == 0 signaling strings match

Scanning

In addition to the standard forward and backward looking character routines and substring searching routines, routines that facilitate fast character scanning are provided. These routines take a pointer to a flag table and flags. These scanning routines, szfskip() and szftill(), either skip over characters or scan until they hit the first character in a character set.

The flag table is an array of integers, one integer for each character. Each integer is used to store flags for each character. The flags signal which character sets the character belongs to. The following code sets up a simple flag table for ASCII characters.

// define character sets using the bottom 24 bits

#define M_ALPHA 0x0000001

#define M_NUMERIC 0x0000002

#define NUM(_a_) (sizeof(_a_)/sizeof(_a_[0]))

int i,flagtable[128]; // an array of 128 works well for ASCII

// initialize the flagtable

for(i=0;i

flagtable[i]=0; // zero whole table

flagtable[0]=NUM(flagtable); // first entry must contain table size

// define character sets -- assume characters in 1..127 range

for(i='a';i<='z';i++) // lowercase alpha

flagtable[i]|=M_ALPHA;

for(i='A';i<='Z';i++) // uppercase alpha

flagtable[i]|=M_ALPHA;

for(i='0';i<='9';i++) // numbers

flagtable[i]|=M_NUMERIC;

Setting flag bits for each character allows the string routines to do a quick lookup and bit test to determine if that character belongs to a specific character set or sets. The top 8 flag bits are reserved for the library's internal use.

Scanning For Words

Finding words in a string buffer can be accomplished by scanning forward through the buffer until an alphabetical character is hit; this is the word start. The routine can then skip all subsequent characters until it finds the first non-alpha-numeric character; this is the end of the word. The following code shows an example of this.

// scanning

chr *ss,*sx;

ss="a bc de\n\rz\n\r"; // example buffer to scan

// scan forward till first alpha, then skip to first non-alpha-numeric

// anything in between is considered a word

ss=szftill(flagtable,ss,NULL,M_ALPHA); // scan till alpha

sx=szfskip(flagtable,ss,NULL,M_ALPHA|M_NUMERIC); // skip alpha-numerics

The above code fragment leaves 'ss' and 'sx' with pointers to the start and end of a word in the string. This word can then be mapped to a token value with szmap() as shown below.

// word start is ss, word end is sx

// map word to token

switch(szmap(tokens,ss,sx))

{

case TOKEN_A:

printf("TOKEN_A ");

break; // TOKEN_A

case TOKEN_BC:

printf("TOKEN_BC ");

break; // TOKEN_BC

case TOKEN_DE:

printf("TOKEN_DE ");

break; // TOKEN_DE

case TOKEN_Z:

printf("TOKEN_Z ");

break; // TOKEN_Z

default:

printf("* unknown token ");

break;

} // switch

The tokens use above are defined with enumerations and a corresponding string table as shown below:

enum { // token enumerations

TOKEN_RESERVED, // reserve token 0

TOKEN_A,

TOKEN_BC,

TOKEN_DE,

TOKEN_Z,

}; // TOKEN_

chr *tokens[] = { // string table

"", // reserve token 0

"a", // TOKEN_A

"bc", // TOKEN_BC

"de", // TOKEN_DE

"z", // TOKEN_Z

NULL // terminate token list

}; // tokens[]

Flag Tables And Large Character Sets

As a note, the simple flag table shown here works well for eight-bit ASCII and may be used with a sixteen-bit character set. However, because the current routines require the size of the flag table to match the number of characters in the character set, larger character sets would be better served with a modified approach.

Numerical Conversions

Converting a string to a numerical value is a common operation. The function sztobin() accomplishes this. For parameters, pass the string start, the string pointer-terminator, the conversion base, the destination type, and a pointer to the destination variable in which to store the converted value. The routine returns a pointer to the first character after the conversion. The destination type is a character code from the list below which identifies the type of the destination variable.

// destination types

// 'i' = pointer to int

// 'u' = pointer to unsigned int

// 'f' = pointer to float

// 'd' = pointer to double

For example, the following code converts a string to a floating point value:

float fval;

chr *ss;

ss = sztobin("1.23;",NULL,10,'f',&fval); // fval == 1.23 ss == ";"

Use the string formatting routine introduced below, szfmt(), to convert a numerical value into a string.

Formatting

The library includes szfmt(), a sprintf-like formatting routine. The routine supports the basic sprintf-formatting commands. This routine is a buffer building routine and follows the rule that the byte size of the destination buffer is passed into the routine, and the length of the string built is returned.

The following example shows how to use szfmt() to build a buffer:

int n;

chr buf[32],*ss;

ss = "world";

n = 2007;

n = szfmt(buf,sizeof(buf),"Hello %s %d.",ss,n); // n == 17

Conclusion

The standard C library approach to string handling can lead to unsafe, slow, and often problematic code that may not be suited to multi-threaded programming. This article presents a new string handling library based on pointer-terminated strings to help overcome these problems.

This library can be used in C or C++ projects directly or as the foundation for a higher level class. To achieve maximum benefit, all string handling in the program should support pointer-terminated strings.

I have not tried to duplicate all the functionality present in other libraries. Rather, this represents a minimalist and easily extensible library. I have been using variations of this library in tools, projects, and games for the past five years with great results.

Reference

[1] Dennis M. Ritchie, The Development of the C Language, (Association for Computing Machinery, Inc., 1993) (http://www.cs.bell-labs.com/who/dmr/chist.html visited 7/13/2007).

Buffer Building Rules The Library Routine Follows

  • when building non-zero-length buffers, always force a zero-terminator

  • pass in byte sizes of buffers for buffer building commands

  • return character lengths not including the terminator

Prototypes For Functions Used In This Article

extern int szlen(chr *ss); // number of chrs in string

extern int szlen2(chr *ss,chr *sx);

extern int szsize(chr *ss); // byte size of string and terminator

extern int szsize2(chr *ss,chr *sx);

extern int szcpy(chr *dst,int dstsize,chr *ss,chr *sx);

extern int szcmp(chr *ss,chr *ts,chr *tx); // string s is zero-terminated

extern int szcmp2(chr *ss,chr *sx,chr *ts,chr *tx);

extern chr* szfskip(int *flagtable,chr *ss,chr *sx,int flags); // skip matching

extern chr* szftill(int *flagtable,chr *ss,chr *sx,int flags); // till match

extern int szmap(chr *table[],chr *ss,chr *sx);

extern chr* sztobin(chr *ss,chr *sx,int base,chr dsttype,void *dst);

extern int szfmt(chr *dst,int dstsize,chr *fmt,...);

Read more about:

Features

About the Author

Joe Linhoff

Blogger

Joe Linhoff has been a professional game developer for over 25 years. He was worked as a game developer in Chicago, San Diego, and Minneapolis. Published titles he has worked on include Big Buck Hunter Pro, Johnny Nero Action Hero, Big Buck Hunter, Indigo Swing, Invasion: The Abductors, War Gods, Dungeon Master: Chaos Strikes Back, Stickers, Certificate Maker, and others. He is now a professor of game development at DePaul University in Chicago.

Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like