WhizzML Reference Manual

4.6 Strings

(string? obj) \(\rightarrow \) boolean

String values can be recognized via the predicate string?.

4.6.1 Coercion to string

(str obj1 …) \(\rightarrow \) string

Any value can be coerced to a string using str. The procedure takes an arbitrary number of arguments, and the result is the concatenation of all the coercions. When acting on string values, str is thus the string concatenation function.

(str 3) ;; => "3"
(str 3 5) ;; => "35"
(str "3" (+ 3 2)) ;; => "35"
(str "Hello" " " "world") ;; => "Hello world"

As shown in the examples above, str behaves as string concatenation for arguments of type string. To preserve quotations associated to strings in the result (for instance, because you are generating WhizzML source code), use the standard procedure pr-str.

(pr-str obj1 …) \(\rightarrow \) string

(pr-str 3) ;; => "3"
(pr-str 3 5) ;; => "35"
(pr-str "3" (+ 3 2)) ;; => "\"3\"5"
(pr-str "Hello" " " "world") ;; => "\"Hello\"\" \"\"world\""

We also provide a standard procedure that generates a JSON representation a given WhizzML value:

(json-str obj1) \(\rightarrow \) string

(json-str 3) ;; => "3"
(json-str [2 true]) ;; => "[2,true]"
(json-str {"a" 2.2 "b" [1 [false "c"]]}) ;; => "{\"a\":2.2,\"b\":[1,[false,\"c\"]]}"

and it is also possible to parse a JSON string into its corresponding WhizzML value:

(read-json-str strjson) \(\rightarrow \) object

(read-json-str "3") ;; => 3
(read-json-str "[2, true]") ;; => [2 true]
(read-json-str (json-str {"A" 2})) ;; => {"A" 2}

4.6.2 Digests

There are three hashing procedures available in the standard library:

(md5 str) \(\rightarrow \) string
(sha1 str) \(\rightarrow \) string
(sha256 str) \(\rightarrow \) string

These primitives act on the stream of bytes of their input string, str, and return a string representing the bytes that the cryptographic digest they name produces, in their hexadecimal representation:

(md5 "a text")  ;; => "b229386ec4627869d2c71b7df3c9600a"
(sha1 "a text") ;;  => "7081f2babbafff16b4bae16282859c844baa14ef"
(sha256 "") ;; => "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"

As shown, the returned strings use charaters in [0-9a-f] to represent the values of the output bytes: md5 produces 16 bytes (for a 128 bits digest), sha1 produces 20 bytes (160 bits) and sha256 produces 32 bytes (256 bits).

4.6.3 Pretty printing WhizzML code

As shown in subsection 4.6.1 , values can be coerced to strings using str and pr-str. In addition, one can use

(ppr-str obj1 [width]) \(\rightarrow \) string

to coerce an arbitrary value to a string preserving quotations (like pr-str) and formatting and indenting the result as if it were WhizzML code. Unlike pr-str, ppr-str accepts only one value to print and, optionally, the line width used during formatting.

(ppr-str {"a" 22343 "bbbbb" 3333}) ;; => "{\"a\" 22343 \"bbbbb\" 3333}"
(ppr-str {"a" 22343 "bbbbb" 3333} 10) ;; => "{\"a\" 22343\n \"bbbbb\"\n 3333}"
(ppr-str {"a" 22343 "bbbbb" 3333} 10) ;; => "{\"a\" 22343\n \"bbbbb\" 3333}"

If instead of a value, what you have is a string representing WhizzML code and want to reformat it as a pretty-printed one, use pretty-whizzml.

(pretty-whizzml str-of-code [width]) \(\rightarrow \) string

str-of-code must be a syntactically correct WhizzML code string, and width is the maximum number of characters per line in the resulting code string.

Pretty printing procedures are useful mainly for code generators (such as reify) and not used often when programming Machine Learning workflows.

4.6.4 String manipulation

(subs str int1 [int2]) \(\rightarrow \) string

The subs procedure extracts the substring from str starting at the zero-based index int1 and up to (but not including) the character at index int2. The latter is optional, and, if not provided, the substring takes until the end of str.

(subs "a string" 3) ;; => "tring"
(subs "a string" 0 3) ;; => "a s"
(subs "a" 2) ;; => ""

As you can easily check, the following expression will always evaluate to true when n is positive:

(= s (str (subs s 0 n) (subs s n)))  ;; true for all strings s

If int2 is greater than the length of the string, we just take characters up to its end:

(subs "a string" 3 2500) ;; => "tring"

If int1 or int2 are negative, they will refer to positions starting at the end of string, i.e., they are subtracted from (count str). For instance:

(subs "abcd" -1) ;; => "d"
(subs "abcd" -2) ;; => "cd"
(subs "abcd" 0 -1) ;; => "abc"
(subs "abcd" -1 -1) ;; => "d"
(subs "abcd" -3 2) ;; => "b"

(join list-of-strings) \(\rightarrow \) string
(join str-sep list-of-strings) \(\rightarrow \) string

The multivariadic join procedure concatenates a list of strings using an optional separator:

(join "/" ["a" "path" "x.whizzml"]) ;; => "a/path/x.whizzml"
(join "" ["1" "2"]) ;; => "12"
(join ["whizz" "ml" "!"]) ;; => "whizzml!"

The inverse operation, splitting a given string, is performed by the multivariadic standard procedures split and split-regexp:

(split str str-sep) \(\rightarrow \) string list
(split str str-sep int) \(\rightarrow \) string list
(split-regexp str rx-sep) \(\rightarrow \) string list
(split-regexp str rx-sep int) \(\rightarrow \) string list

These procedures take a string to split (str) and either a literal separator (str-sep) or a regular expression that separators should match (rx-sep), and return a list of strings. The optional argument int specifies the maximum length of the returned list:

(split "a,b,c" ",") ;; => ["a" "b" "c"]
(split "a,b,c" "," 2) ;; => ["a" "b,c"]
(split "a,b,c" "," 0) ;; => []
(split "a,b,c" "," -2) ;; => []
(split "a,,b,c" ",") ;; => ["a" "" "b" "c"]
(split-regexp "a,,b,c" ",+") ;; => ["a" "b" "c"]
(split-regexp "a,,b,c" ",+" 2) ;; => ["a" "b,c"]
(split-regexp "a,,b,c" ",+" 0) ;; => []

The standard library also includes the following case conversion procedures:

(lower-case str) \(\rightarrow \) string
(upper-case str) \(\rightarrow \) string
(capitalize str) \(\rightarrow \) string

which perform the expected conversions:

(lower-case "An Example") ;; => "an example"
(upper-case "An Example") ;; => "AN EXAMPLE"
(capitalize "an Example") ;; => "An example"
(capitalize "3 EXAMPLES") ;; => "3 examples"

Note that, as shown in the above example, capitalize treats its argument as a single unstructured token, upcasing only its first character.

4.6.5 String length and distance

(count str) \(\rightarrow \) integer
(empty? str) \(\rightarrow \) boolean

The length of a string can be obtained by means of the polymorphic procedure count, which can also be applied to lists and maps. For convenience, you can use the also polymorphic predicate empty?, which is a shorthand for (zero? (count str)).

The primitive levenshtein computes, as a non-negative integer, the distance between two given string values:

(levenshtein str1 str2) \(\rightarrow \) integer

The Levenshtein distance between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one into the other.

(levenshtein "a text" "a text") ;; => 0
(levenshtein "a text" "b text") ;; => 1
(levenshtein "a text" "another xxx") ;; => 7

4.6.6 Flatline strings

WhizzML provides helpers to generate flatline s-expressions as strings (typically for use in resource creation parameters). The basic function for flatline generation is flatline, which constructs strings via interpolation of variables.

(flatline str …) \(\rightarrow \) string

The arguments to flatline are a list of templates, or format strings, to generate the final flatline expressions, via concatenation. Each string may refer to any WhizzML variable in scope, and it will be substituted by its value, by quoting it according to the following rules (let’s call the variable to be substituted x):

{x} to replace x’s value into the format string:

(let (w "world")
      (flatline "(hello {w})")) ;; => "(hello world)"

    (define days 12)
    (let (delta 2)
      (flatline "(= (+ {days} {delta}) (field \"000000\"))"))
     ;; => "(= (+ 12 2) (field \"000000\"))''

{{x}} to replace x as a quoted value into the format string:

(let (w "something blue")
      (flatline "this is {{w}}, " " right?"))
       ;; => "this is \"something blue"\, right?"

@{x} when x is a list, to splice its elements into the format string:

(let (x [1 2 3])
      (flatline "(+ @{x})")) ;; => "(+ 1 2 3)"

@{{x}} when x is a list, to splice its (recursively) quoted elements into the format string.:

(let (ids ["000000" "000001"])
      (flatline "(fields @{{ids}})"))
      ;; => "(fields \"000000\" \"000001\")"

    (let (ids ["0" "1"]
          rows [[1 2] ["A" 3]]
          eqs (map (lambda (r) (flatline "(= fs (list @{{r}}))")) rows))
        (flatline "(let (fs (fields @{{ids}}))\n (not (or @{eqs})))"))
      ;; => "(let (fs (fields \"0\" \"1\"))
      ;;       (not (or (= fs (list 1 2)) (= fs (list \"A\" 3)))))"

As shown, braces have a special meaning in flatline’s format strings. If you need to introduce them literally, you should use a quoted variable, to avoid ambiguities and parsing errors. For example:

(let (ob "{"
      cb "}")
  (flatline "(if (even? x) {{ob}} {{cb}})"))
  ;; => "(if (even? x) \"{\" \"}\")"

Since braces are not part of Flatline’s syntax, the need of quoting them will only arise, as in the above example, when they appear within string values in the resulting Flatline expression.

4.6.7 Regular expressions

A regular expression in WhizzML is represented as a string following the Perl or Java standard notation. There is no “regular expression” type, just strings that comply to that format.

(regexp? str) \(\rightarrow \) boolean
(re-quote str) \(\rightarrow \) regular expression (string)

The regexp? predicate checks whether the string str represents a valid regular expression, and can therefore be directly used as such, and re-quote returns a string that matches the give string literally. Thus, (regexp? (re-quote s)) is identically true for any string s.

(regexp? "a") ;; => true
(regexp? "[ab]x.") ;; => true
(regexp? "x[a") ;; => false

(re-quote "no special symbols") ;; => "no special symbols"
(re-quote "a dot: .") ;; => "\\Qa dot: .\\E"

To check whether a string matches a given regular expression, use the following standard library procedures:

(matches rx str) \(\rightarrow \) list of string
(matches? rx str) \(\rightarrow \) boolean

matches returns the list of matching groups of the regular expression rx found in the string str, or an empty list if no matches are found, while matches? checks whether the given string matches the given regular expression, i.e., whether its list of matches is not empty. Hence (matches? r x) is just syntactic sugar for (not (empty? (matches r x))).

The list returned by matches always contains the original string as its first element, followed by other matching subgroups in the regular expression, if any. For instance:

(matches ".*x.*" "axz") ;; => ["axz"]
(matches "x([yzk]+)3" "xzzky3") ;; => ["xzzky3" "zzky"]
(matches "x(y)x([zj])" "xyxj") ;; => ["xyxj" "y" "j"])

Note that both matches and matches? perform full-string matching, not substring matching; e.g.:

(matches? "an x" "an x or two") ;; => false
(matches? "an x [a-z ]+" "an x or two") ;; => true

At the substring level, WhizzML provides the following replacement primitives:

(replace str-target rx str-repl) \(\rightarrow \) string
(replace-first str-target rx str-repl) \(\rightarrow \) string

replace substitutes in str-target all (partial) matches of rx by the value of its third argument (another string); replace-first works like replace, but performing only one substitution (the first match).

(replace "replace me here and there" "e\\b" "X")
  ;; => "replacX mX herX and therX"
(replace-first "replace me here and there" "e\\B" "Y")
  ;; => "rYplace me here and there"

We provide a convenience predicate to check for occurrences of a term within a string, with a case-sensitivity flag:

(contains-string? str-needle str-hay [bool-cs]) \(\rightarrow \) boolean

The predicate checks whether the string str-needle occurs as a substring in str-hay. By default, the matching is case-sensitive, but a case-insenstive search can be requested by passing false as the third argument.

(contains-string? "foo" "bazquuxfoooo") ;; => true
(contains-string? "foo" "bazquuxFOooo" true) ;; => false
(contains-string? "foo" "bazquuxFOooo" false) ;; => true

Likewise, these variants of replace and replace-first take as second argument a literal string to be replaced, rather than a regular expression:

(replace-string str-target str-needle str-repl) \(\rightarrow \) string
(replace-first-string str-target str-needle str-repl) \(\rightarrow \) string

For instance:

(replace-string "[ab] in a regexp is not '[ab]' in a string" "[ab]" ".")
  ;; => ". in a regexp is not '.' in a string"
(replace-first-string "[ab] in a regexp is not '[ab]' in a string" "[ab]" ".")
  ;; => ". in a regexp is not '[ab]' in a string"