In-place vector modification with standard C

May 09, 2014
Recently I decided to do a quick search on stackoverflow to see what was considered the “best” way to remove tokens from a string. There were many suggestions, from elaborate C functions to some more minimal C++ functions utilizing the standard containers and algorithms. Surprisingly, there was not much consideration giving to the removal of tokens in-place.  for (char *tmp = str; (tmp = strchr(tmp, ‘+’)) && (str = tmp); tmp = str) while (*tmp++=*(tmp+1)); Not only does this algorithm get the job done in a minimal two lines of code (well actually, effectively the string modification/removal occurs in one line), but it also has the added benefit of completing the job in-place, meaning there is no additional buffer used to complete the task. This is very desirable within resource constrained environments such as embedded systems, or in real time systems, where dynamic allocation of memory (or unnecessary stack modification) can be considered an impediment to stability. To be fair, this algorithm does have its downfalls; it can become embarrassingly expensive on longer strings containing a larger number of occurrences of the sought after token. For example, we can represent the average performance of this algorithm as such: (len/2)*x*y where len = string length (divided by two to represent an average distribution of occurrences of x throughout), x = number of occurrences, and y equals the length of the sought token (this mechanism can be modified to suit the removal of substrings as well) which in this example will be a constant of 1 and thus can be negated. so for an example string of 2000 chars with 3 occurrences of a sought token, we have: (2000/2)*3 = 3000 total characters copied. vs a linear copy to a double buffer involving 2000 characters copied. However don’t forget to consider the additional performance ramifications of memory allocation, adjustment of memory pointers (or possibly even worse, an additional copy of the entire string back into its original buffer) or the fact that in certain situations you simply just don’t have the memory to spare. Obviously, as buffer size increases along with token occurrences this is not the most ideal implementation. However, to demonstrate the contrary, if there is only one occurrence of the sought token existing within the string, then on average (no matter the length of the string) this implementation will perform 50 percent better than the alternative. For this sort of usage, not only is this algorithm the most aesthetically pleasing and resource optimal, but it also the yields the highest performance. NOTE: Don’t fear the syntax 🙂 As long as the original string ‘str’ is nul-terminated, the result will also be.   Marcus Sokes

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *