Up: Classes   [Index]


String

String Class

Objects of String class contain arrays of ASCII characters. The value of a String object is similar to the C language organization of strings as a NUL terminated array of char values.

In most cases, String objects can be used like a collection of Character objects. The overloaded operators ++, --, +, and -, all work similarly to the operators in List or AssociativeArray objects.

Some of String classes’ methods add semantics to operators, like the += method, which behaves differently depending on whether its argument is another String object, or an Integer object.


myString = "Hello, ";    /* The resulting value is, */
myString += "world!";    /* "Hello, world!"         */

myString = "Hello, ";    /* The resulting value is, */
myString += 3;           /* "lo, "                  */

The main exception to this is the map method, which doesn’t allow incrementing self within an argument block. This is because String objects don’t use Key objects internally to order a String object’s individual Character objects. If it’s necessary to treat a String object as a collection, the asList method will organize the receiver String into a List of Character objects.

Conversely, Array and List classes contain the asString method, which translates an Array or List into a String object.

In addition, methods like matchRegex, =~, and !~ can accept as arguments strings that contain regular expression metacharacters and use them to perform regular expression matches on the receiver String. See Pattern Matching.

Instance Variables

value

The value is a pointer to the character string.

Instance Methods

* (void)

When used as a prefix operator, overloads C’s ‘*’ dereference operator and returns the first element of the receiver, a Character object.

= (char *s)

Set the value of the receiver object to s.

== (char *s)

Return TRUE if s and the receiver are identical, FALSE otherwise.

=~ (char *pattern)

Returns a Boolean value of true if the receiver contains the regular expression pattern, false otherwise. See Pattern Matching.

!~ (char *pattern)

Returns a Boolean value of false if the receiver does not contain the argument, pattern, which may contain regular expression metacharacters. See Pattern Matching.

!= (char *s)

Return FALSE if s and the receiver are not identical, TRUE otherwise.

!= (char *s)

Return FALSE if s and the receiver are not identical, TRUE otherwise.

+ (String s)
+ (Integer i)

If the argument is a String, concatenate the receiver and s and return the new String. If the argument is an Integer, return a reference to the receiver plus i.

++ (void)

Increment the value of the receiver as a char *. This method uses __ctalkIncStringRef () to handle the pointer math.

In other words, this method effectively sets the receiver String's value from, for example, ‘Hello, world!’ to ‘ello, world!’. If the receiver is incremented to the end of its contents, then its value is NULL.

+= (String s)
+= (Integer i)

If the argument is an Integer, increment the reference to the receiver by that amount. If the argument is a String or any other class, concatenate the argument to the receiver and return the receiver, formatting it as a string first if necessary.

- (Integer i)

Return a reference to the receiver String minus i. If the reference is before the start of the string, return NULL. That means the method is only effective after a call to ++ or a similar method.


String new str;

str = "Hello, world!";

str += 1;

printf ("%s\n", str);    /* Prints, "ello, world!" */

--str;

printf ("%s\n", str);    /* Prints, "Hello, world!" */


-- (void)

Decrement the value of the receiver as a char *. The effect is the converse of ++, above. The method doesn’t decrement the reference so that it points before the beginning of the String object’s contents. That means, like - above, the method only returns a pointer to somewhere in the receiver’s value after a previous call to ++ or a similar method. For example,


String new str;

str = "Hello, world!";

++str;

printf ("%s\n", str);    /* Prints, "ello, world!" */

--str;

printf ("%s\n", str);    /* Prints, "Hello, world!" */


-= (Integer i)

If the argument is an Integer, decrement the reference to the receiver’s value by the amount given as the argument, an Integer. Like the other methods that decrement the reference to the receiver’s value, the program must first have incremented it further than the start of the string.

asExpanded (void)

Return the expanded directory path for a directory glob pattern contained in the receiver.

asInteger (void)

Return an Integer object with the value of the receiver.

asList (List newList)

Store each character of the receiver String as Character object members of newList.

at (int index)

Return the character at index. The first character of the string is at index 0. If index is greater than the length of the string, return ‘NULL’.

atPut (int n, char c)

Replace the n’th character of the receiver with c. Has no effect and returns NULL if n is greater than the length of the receiver.

The atPut method interprets the following character sequences (with their ASCII values)


Sequence   ASCII Value
\0         0
\a         7
\b         7
\n         10
\e         27
\f         10
\r         13
\t         9
\v         11

The ‘\e’ escape sequence is an extension to the C language standard.

The method returns the receiver (with the new value) if successful.

You should note that the method does not do any conversion of the argument; that is, if c isn’t a Character object, then the results are probably not going to be what you want. For example, if you try to store an Integer in a String, like this:


myInt = 1;

myString atPut 0, myInt + '0';

The results aren’t going to be what you want; adding ASCII ‘'0'doesn’t convert myInt to a Character object. You still need to use the asCharacter method from Magnitude class to create a Character object, as in this example.


myInt = 1;

myString atPut 0, (myInt + '0') asCharacter;

The parentheses in the second argument are necessary; otherwise, asCharacter would use ‘'0'’ as its receiver because asCharacter, which is a method message, has a higher precedence than ‘+’. Instead, asCharacter's receiver should be the value of ‘myInt + '0'’, so we enclose the first part expression in parentheses so it gets evaluated first.

callStackTrace (void)

Print a call stack trace.

charPos (char c)

Return an Integer with the position of c in the receiver. Returns an Integer between 0 (the first character) and the receiver’s length, minus one (the last character). If the receiver does not contain c, returns -1.

charPosR (char c)

Return an Integer with the position of the last occurence of c in the receiver. Returns an Integer between 0 (the first character) and the receiver’s length, minus one (the last character). If the receiver does not contain c, returns -1.

chomp (void)

Removes a trailing newline character (‘\n’) if the receiver contains one. Named after Perl’s very useful string trimming function.

consoleReadLine (String promptStr)

Print the promptStr on the terminal and wait for the user to enter a line of text. If Ctalk is built with the GNU readline libraries, adds readline’s standard line editing and command history facilities. In that case, Ctalk also defines the HAVE_GNU_READLINE preprocessor definition to ‘1’. You can build Ctalk with or without readline; see the options to ./configure for further information.

Here is a sample program that shows how to use consoleReadLine.

int main (int argc, char **argv)    String new s;
  String new promptStr;

  if (argc > 1)
    promptStr = argv[1];
  else
    promptStr = "Prompt ";

  printf ("Readline test.  Type ^C or, \"quit,\" to exit.\n");
#if HAVE_GNU_READLINE
  printf ("Ctalk built with GNU Readline Support.\n");
#else
  printf ("Ctalk built without GNU Readline Support.\n");
#endif
  while (1)      s consoleReadLine promptStr;
    printf ("You typed (or recalled), \"%s.\"\n", s);
    /*
     *  Matches both, "quit," and, "quit\n."
     */
    if (s match "quit")
      break;
  }
}
contains (String pattern)
contains (String pattern, Integer starting_offset)

Returns a Boolean value of True if the receiver string contains an exact match of the text in pattern, False otherwise.

With a second argument n, an Integer, the method begins its search from the n’th character in the receiver string.

envVarExists (char *envVarName)

Test for the presence of an environment variable. Return TRUE if the variable exists, FALSE otherwise.

eval (void)

Evaluate the content of the receiver String’s value as if it were an argument to eval.

getEnv (char *envVarName)

Return the value of environment variable envVarName as the value of the receiver, or (null). Note that this method generates an internal exception of the environment variable does not exist. To test for the presence of an environment variable without generation an exception, see envVarExists, above.

getRS (void)

Returns a Character with the current record separator.

The record separator determines whether the regular expression metacharacters ‘^’ and ‘$’ recognize line endings. The default value of the record separator is a newline ‘\n’ character, which means that a ‘^’ character will match an expression at the start of a string, or starting at the beginning of a text line. Likewise, a ‘$’ metacharacter matches both the end of a line and the end of the string.

To match only at the beginning and end of the string, set the record separator to a NUL character (‘\0’). See Pattern Matching.

isXLFD (void)

Returns a Boolean value of True if the receiver is a XLFD font descriptor, False otherwise. For more information about font selection, refer to the X11Font class See X11Font, and the X11FreeTypeFont class See X11FreeTypeFont.

length (void)

Return an object of class Integer with the length of the receiver in characters.

map (OBJECT *(*method)())

Execute method, an instance method of class String, for each character of the receiver object. For example,


String instanceMethod printSpaceChar (void) {
  printf (" %c", self);  /* Here, for each call to the printSpaceChar
                             method, "self" is each of myString's
                             successive characters. */
}

int main () {

  String new myString;

  myString = "Hello, world!";

  myString map printSpaceChar;

  printf ("\n");
}

The argument to map can also be a code block:


int main () {

  String new myString;

  myString = "Hello, world!";

  myString map {
    printf (" %c", self);
  }

  printf ("\n");
}

match (char *pattern)

Returns TRUE if pattern matches the receiver String regardless of case, false otherwise. Both match and matchCase, below, are being superceded by matchRegex and quickSearch, also below.

matchAt (Integer idx)

Returns the text of the idx’th parenthesized match resulting from a previous call to matchRegex, =~, or !~. See Pattern Matching.

matchCase (char *pattern)

Returns TRUE if pattern matches the receiver case- sensitively, false otherwise. Like match, above, matchCase is being superceded by matchRegex and quickSearch, below.

matchIndexAt (Integer idx)

Returns the character position in the receiver String of the idx’th parenthesized match resulting from a previous call to matchRegex, =~, or !~. See Pattern Matching.

matchLength (void)

Returns the length of a regular expression match from the previous call to the matchRegex method, below.

matchRegex (String pattern, Array offsets)

Searches the receiver, a String object, for all occurrences of pattern. The matchRegex method places the positions of the matches in the offsets array, and returns an Integer that contains the number of matches. See Pattern Matching.

The quickSearch method, below, matches exact text only, but it uses a much faster search algorithm.

nMatches (void)

Returns an Integer with the number matches from the last call to the matchRegex method.

printMatchToks (Integer yesNo)

If the argument is non-zero, print the tokens of regular expression patterns and the matching text after each regular expression match. This can be useful when debugging regular expressions. See DebugPattern.

printOn (char *fmt, ...)

Format and print the method’s arguments to the receiver.

quickSearch (String pattern, Array offsets)

Searches the receiver, a String object, for all occurrences of pattern. The quickSearch method places the positions of the matches in the offsets array, and returns an Integer that contains the number of matches.

Unlike matchRegex, above, quickSearch matches exact text only, but it uses a much faster search algorithm.

readFormat (char *fmt, ...)

Scan the receiver into the arguments, using fmt.

search (String pattern, Array offsets)

This method is a synonym for matchRegex, above, and is here for backward compatibility.

setRS (char record_separator_char)

Sets the current application’s record separator character, which determines how regular expression metacharacters match line endings, among other uses. See RecordSeparator. See Pattern Matching.

split (char delimiter, char ** resultArray)

Split the receiver at each occurrence of delimiter, and save the result in resultArray. The delimiter argument can be either a Character object or a String object. If delimiter is a String, it uses Ctalk’s pattern matching library to match the delimiter string. See Pattern Matching.

However, the pattern matching library only records the length of the last match, so if you use a pattern like ‘" *"’ then the results may be inaccurate if all of the delimiters are not the same length.

subString (int index, int length)

Return the substring of the receiver of length characters beginning at index. String indexes start at 0. If index + length is greater than the length of the receiver, return the substring from index to the end of the receiver.

sysErrnoStr (void)

Sets the receiver’s value to the text message of the last system error (the value of errno(3)).

tokenize (List tokens)

Splits the receiver String at each whitespace character or characters (spaces, horizontal and vertical tabs, or newlines) and pushes each non-whitespace set of characters (words, numbers, and miscellaneous punctuation) onto the List given as the argument. The method uses ispunct(3) to separate punctuation, except for ‘_’ characters, which are used in labels.

Note that this method can generate lists with hundreds or even thousands of tokens, so you need to take care with large (or even medium sized) input Strings as receivers.

tokenizeLine (List tokens)

Similar to tokenize, above. This method also treats newline characters as tokens, which makes it easier to parse input that relies on newlines (for example, C++ style comments, preprocessor directives, and some types of text files).

vPrintOn (Stringcalling_methods_fmt_arg)

This function formats the variable arguments of its calling method on the receiver String object.

The argument is the format argument of the calling method. When vPrintOn is called, it uses the argument as the start of the caller’s variable argument list.

Here is an example of vPrintOn's use.


Object instanceMethod myPrint (String fmt, ...) {
  String new s;
  s vPrintOn fmt;
  return s;
}

int main () {
  Object new obj;
  Integer new i;
  String new str;

  i = 5;

  str = obj myPrint "Hello, world no. %d", i;

  printf ("%s\n", str);
}

writeFormat (char *fmt,...)

Write the formatted arguments using fmt to the receiver. Note that Ctalk stores scalar types as formatted strings. See Variable arguments.

String Searching and Pattern Matching

String class defines a number of methods for searching and matching String objects. The matchRegex method recognizes some basic metacharacters to provide regular expression search capabilities. The quickSearch method searches String objects for exact text patterns, but it uses a much faster search algorithm.

The operators, =~ and !~ return true or false depending on whether the receiver contains the pattern given as the argument. If the argument contains metacharacters, then Ctalk conducts a regular expression search; otherwise, it tries to match (or not match, in the case of !~) the receiver and the pattern exactly.

If you want more thorough information about the search, the matchRegex and quickSearch methods allow an additional argument after the text pattern: an Array object that the methods use to return the character positions of the matches within the receiver. After the method is finished searching, the second argument contains the position of the first character wherever the text pattern matched text in the receiver. The last offset is ‘-1’, indicating that there are no further matches. The methods also return an Integer object that contains the number of matches.

Here is an example from LibrarySearch class that contains the additional ‘offsets’ argument.


if ((inputLine match KEYPAT) && 
       (inputLine matchRegex (pattern, offsets) != 0)) {

...

}

Searches can provide even more information than this, however. Pattern strings may contain backreferences, which save the text and position of any of the receiver string’s matched text that the program needs. The sections just below describe backreferences in detail.

All of these methods (except quickSearch) recognize a few regular expression metacharacters. They are:

.

Matches any single character.

^

Matches text at the beginning of the receiver String's text.

$

Matches text at the end of the receiver String's text, or the end of a line (that is, the character before a ‘\n’ or ‘\r’ newline character).

*

Matches zero or more occurrences of the character or expression it follows.

+

Matches one or more occurences of the character or expression it follows.

?

Matches zero or one occurrence of the character or expression it follows.

\

Escapes the next character so it is interpreted literally; e.g., the sequence ‘\*’ is interpreted as a literal asterisk. Because Ctalk’s lexical analysis also performs the same task, so if you want a backslash to appear in a pattern, you need to type, ‘\\’, for example,


myPat = "\\*";   /* The '\\' tells Ctalk's lexer that we really
                    want a '\' to appear in the pattern string,
                    so it will still be there when we use myPat
                    as a regular expression. */

However, Ctalk also recognizes patterns, which only need to be evaluated by the regular expression parser. Patterns do not get checked immediately for things like for balanced quotes and ASCII escape sequences; instead, they get evaluated by the regular expression parser when the program actually tries to perform some pattern matching. Otherwise, patterns are identical to Strings. Expressed as a pattern, myPat in the example above would look like this.


myPat = /\*/;

Pattern strings are described in their own section, below. See Pattern Strings.

(
)

Begin and end a match reference (i.e., a backreference). Matched text between ‘(’ and ‘)’ is saved, along with its position in the receiver String, and can be retrieved with subsequent calls to the matchAt and matchIndexAt methods. The match information is saved until the program performs another pattern match.

\W
\d
\p
\w
\l

In patterns, these escape sequences match characters of different types. The escape sequences have the following meanings.


Character Class      Matches
---------------      ------
\W                   'Word' Characters (A-Z, a-z)
\d                   Decimal Digits (0-9)
\w                   White Space (space, \t, \n, \f, \v)
\p                   Punctuation (Any other character.)
\l                   'Label' Characters (A-Z, a-z, 0-9, and _)
\x                   Hexadecimal Digits (0-9, a-f, A-F, x, and X)

The following program contains a pattern that looks for alphabetic characters, punctuation, and whitespace.


int main (int argc, char **argv) {
  String new str;

  str = "Hello, world!";

  if (str =~ /e(\W*\p\w*\W)/) {
    printf ("match - %s\n", str matchAt 0);
  }
}

When run, the expression,


str =~ /e(\W*\p\w*\W)/

Produces the following output.


match - llo, w

|

Matches either of the expressions on each side of the ‘|’. The expressions may be either a character expression, or a set of characters enclosed in parentheses. Here are some examples of alternate patterns.

a|b
a*|b*
a+|b+
\W+|\d+
(ab)|(cd)

When matching alternate expressions, using ‘*’ in the expressions can produce unexpected results because a ‘*’ can provide a zero-length match, and the ‘|’ metacharacter is most useful when there is some text to be matched.

If one or both expressions are enclosed in parentheses, then the expression that matches is treated as a backreference, and the program can retrieve the match information with the matchAt and matchIndexAt methods.

The following example shows how to use some of the matching featues in an actual program. This program saves the first non-label character (either a space or parenthesis) of a function declaration, and its position, so we can retrieve the function name and display it separately.


int main (argc, argv) {
  String new text, pattern, fn_name;
  List new fn_list;

  fn_list = "strlen ()", "strcat(char *)", "strncpy (char *)",
    "stat (char *, struct stat *)";

  /* Match the first non-label character: either a space or a
     parenthesis.  The double backslashes cause the content of
     'pattern' (after the normal lexical analysis for the string) to
     be,
     
       "( *)\("

     So the regular expression parser can check for a backslashed
     opening parenthesis (i.e., a literal '(', not another
     backreference delimiter).
  */

  pattern = "( *)\\(";

  fn_list map {
    if (self =~ pattern) {
      printf ("Matched text: \"%s\" at index: %d\n",
	      self matchAt 0, self matchIndexAt 0);
      fn_name = self subString 0, self matchIndexAt 0;
      printf ("Function name: %s\n", fn_name);
    }
  }

  return 0;
}

When run, the program should produce results like this.


Matched text: " " at index: 6
Function name: strlen
Matched text: "" at index: 6
Function name: strcat
Matched text: " " at index: 7
Function name: strncpy
Matched text: " " at index: 4
Function name: stat

Note that the first backreference is numbered ‘0’, in the expression ‘self matchAt 0’. If there were another set of (unescaped) parentheses in pattern, then its text would be refered to as ‘self matchAt 1’.

You should also note that the second function match saved an empty string. That’s because the text that the backreferenced pattern referred to resulted in a zero-length match. That’s because ‘*’ metacharacters can refer to zero or more occurrences of the character that precedes it.

The program could also use the charPos method to look for the ‘ ’ and/or ‘(’ characters, but using a regular expression gives us information about which non-label character appears first more efficiently.

Here’s another example. The pattern contains only one set of parentheses, but Ctalk saves a match reference every time the pattern matches characters in the target string.


int main () {
  String new string, pattern;
  Array new offsets;
  Integer new nMatches, i;

  pattern = "(l*o)";
  string = "Hello, world! Hello, world, Hello, world!";
  
  nMatches = string matchRegex pattern, offsets;

  printf ("nMatches: %d\n", nMatches);
  offsets map {
    printf ("%d\n", self);
  }
  for (i = 0; i < nMatches; ++i) {
    printf ("%s\n", string matchAt i);
  }
}

When run, the program produces output like this.


nMatches: 6
2
8
16
22
30
36
-1
llo
o
llo
o
llo
o

The character classes match anywhere they find text in a target string, including control characters like ‘\n’ and ‘\f’, regardless of the record separator character. For a brief example, refer to the section, The Record Separator Character, below.

This example matches one of two patterns joined by a ‘|’ metacharacter.


int main () {
  String new s, pat;
  Array new matches;
  Integer new n_matches, n_th_match;

  pat = "-(mo)|(ho)use";

  s = "-mouse-house-";

  n_matches = s matchRegex pat, matches;

  for (n_th_match = 0; n_th_match < n_matches; ++n_th_match) {
    printf ("Match %d. Matched %s at character index %ld.\n",
	    n_th_match, s matchAt n_th_match, s matchIndexAt n_th_match);
  }

  matches delete;

}

When run, the program should produce output like this.


Match 0. Matched mo at character index 0.
Match 1. Matched ho at character index 6.

You should note that if a pattern in a backreference results in a zero length match, then that backreference contains a zero length string. While not incorrect, it can produce confusing results when examining matched text. The following program shows one way to indicate a zero-length backreference. It prints the string ‘(null)’ whenever a backreference contains a zero-length string.


int main () {
  String new s;
  String new pat;
  Integer new n_matches;
  Array new offsets;
  Integer new i;

  s = "1.mobile 2mobile mobile";
  pat = "(\\d\\p)?m";
  
  n_matches = s matchRegex pat, offsets;
  
  for (i = 0; i < n_matches; ++i) {
    printf ("%Ld\n", offsets at i);
  }

  for (i = 0; i < n_matches; ++i) {
    if ((s matchAt i) length == 0) {
      printf ("%d: %s\n", s matchIndexAt i, "(null)");
    } else {
      printf ("%d: %s\n", s matchIndexAt i, s matchAt i);
    }
  }
}

When run, the program should produce output that looks like this.


0
10
17
0: 1.
17: (null)
22: (null)

Pattern Strings

When writing a regular expression, it’s necessary to take into account all of the processing that String objects encounter when they are evaluated, before they reach the Ctalk library’s regular expression parser. To help facilitate lexical analysis and parsing, Ctalk also provides pattern strings, which allow Ctalk to defer the evaluation of a pattern until the regular expression parser actually performs the text matching.

Ctalk also provides operators that provide shorthand methods to match patterns with text, the =~ and !~ operators.

Pattern constants at this time may only follow the =~ and !~ operators, but you can use the matchAt and matchIndexAt, and nMatches methods to retrieve the match information. You must, as with Strings that are used as patterns, enclose the pattern in ‘(’ and ‘)’ metacharacters in order to create a backreference.

Here is a simple string matching program that matches text against a pattern constant.


int main () {

  String new s;
  Integer new n_offsets;
  Integer new i;
  
  s = "Hello?";

  if (s =~ /(o\?)/) {
    printf ("match\n");
    i = 0;
    n_offsets = s nMatches;
    while (i < n_offsets) {
      printf ("%d: %s\n", s matchIndexAt i, s matchAt i);
      ++i;
    }
  }
}

The most obvious example of how a pattern provides an advantage for text matching is when writing backslash escapes. To make a backslash appear in a pattern string, you need to write at least two backslashes in order for a backslash to appear when it’s needed to escape the following character. If you want to match an escaped backslash, then you need to write at least four backslashes.


String         Pattern
"\\*"          /\*/        # Matches a literal '*'.
"\\\\*"        /\\*/       # Matches the expression '\*'.

To create a pattern, you delimit the characters of the pattern with slashes (‘//’) instead of double quotes. Other delimiters can signify patterns also if the pattern starts with a ‘m’ character, followed by the delimiter character, which must be non-alphanumeric.


String         Pattern     Alternate Pattern
"\\*"          /\*/        m|\*|
"\\\\*"        /\\*/       m|\\*|

There is no single rule that governs how often String objects are evaluated when a program runs. So writing patterns helps take some of the work out of testing an application’s pattern matching routines.

Debugging Pattern Matches

Ctalk allows you to view the parsed pattern tokens, and the text that each token matches. Token printing is enabled using the printMatchToks method, like this.


myString printMatchToks TRUE;

When token printing is enabled, then Ctalk’s pattern matching routines print the tokens of the pattern and the text that each token matches after every pattern match attempt.

If we have a program like the following:


int main () {

  String new s;

  s printMatchToks TRUE;

  s = "192.168.0.1";

  if (s =~ /\d+\.(\d+)\.\d+\.\d+/) {
    printf ("match!\n");
  }

}

Then, when this program is run with token printing enabled, the output should look similar to this.


joeuser@myhost:~$ ./mypatprogram 
PATTERN: /\d+\.(\d+)\.\d+\.\d+/         TEXT: "192.168.0.1"
TOK: d+         (character class)               MATCH: "192"
TOK: .          (literal character)             MATCH: "."
TOK: (          (backreference start)           MATCH: ""
TOK: d+         (character class)               MATCH: "168"
TOK: )          (backreference end)             MATCH: ""
TOK: .          (literal character)             MATCH: "."
TOK: d+         (character class)               MATCH: "0"
TOK: .          (literal character)             MATCH: "."
TOK: d+         (character class)               MATCH: "1"
match!
joeuser@myhost:~$ 

The processed token text is followed by any attributes that the regular expression parser finds (for example, then a pattern like ‘\d+’ becomes the token ‘d+’ with the attribute of a character class identifier, or the ‘(’ and ‘)’ characters’ backreference attributes). Then, finally, the library prints the text that matches each token.

Successful matches have text matched by each token in the pattern (except for zero-length metacharacters like ‘(’, ‘)’, ‘^’, or ‘$’).

Unsuccessful matches, however, may display text that matches where you don’t expect it. That’s because the regular expression parser scans along the entire length of the text, trying to match the first pattern token, then the second pattern token, and so on.

Although this doesn’t always pinpoint the exact place that a match first failed, it can provide a roadmap to help build a complex pattern from simpler, perhaps single-metachar patterns, which shows what the regular expression parser is doing internally.

The Record Separator Character

Ctalk uses a record separator character to determine how the metacharacters ‘^’ and ‘$’ match line endings, among other uses.

The default record separator character is a newline (‘\n’). In this case a ‘^’ metacharacter in an expression matches the beginning of a string as well as the character(s) immediately following a newline. Similarly, a ‘$’ metacharacter anchors a match to the characters at the end of a line and at the end of a string.

Setting the record separator character to NUL (‘\0’) causes ‘^’ and ‘$’ to match only the beginning and the end of a string.

Here is an example that prints the string indexes of matches with the default newline record separator and with a NUL record separator character.

When the record separator is ‘'\n'’, the ‘$’ metacharacter in our pattern matches the text immediately before a ‘\n’ character, as well as the text at the end of the string.


int main () {

  String new s;
  Integer new n_indexes;
  Array new match_indexes;
  String new pattern;
  
  printf ("\tMatch Indexes\n");

  /* Begin with the default record separator ('\n'). */

  s = "Hello, world!\nHello, wo\nHello, wo";
  pattern = "wo$";
  n_indexes = s matchRegex pattern, match_indexes;

  printf ("With newline record separator:\n");
  match_indexes map {
    printf ("%d\n", self);
  }

  s setRS '\0';   /* Set the record separator to NUL ('\0'). */

  match_indexes delete; /* Remember to start with an empty Array again. */

  n_indexes = s matchRegex pattern, match_indexes;

  printf ("With NUL record separator:\n");
  match_indexes map {
    printf ("%d\n", self);
  }
}

When run, the program should produce output like this.


        Match Indexes
With newline record separator:
21
31
-1
With NUL record separator:
31
-1

Likewise, a ‘^’ metacharacter matches text immediately after the ‘\n’ record separator, or at the beginning of a string.

It’s also possible, though, to match newlines (and other ASCII escape characters) in patterns, either with a character class match, or by adding the escape sequence to the pattern. To do that, the program should use a double backslash with the ASCII escape sequence, as with the newline escape sequence in this example.


int main () {
  String new s;

  s = "Hello,\nworld!";

  if (s =~ /(\W\p\\n)/)
    printf ("%s\n", s matchAt 0);
  
}


Up: Classes   [Index]