8.5 Regular Expressions
The ‘-regex’ and ‘-iregex’ tests of ‘find’ allow matching by regular expression, as does the ‘—regex’ option of ‘locate’. Your locale configuration affects how regular expressions are interpreted. *Note Environment Variables::, for a description of how your locale setup affects the interpretation of regular expressions. There are also several different types of regular expression, and these are interpreted differently. Normally, the type of regular expression used by ‘find’ and ‘locate’ is almost identical to that used in GNU Emacs. The single difference is that in ‘find’ and ‘locate’, a ’.’ will match a newline character. Both ‘find’ and ‘locate’ provide an option which allows selecting an alternative regular expression syntax; for ‘find’ this is the ‘-regextype’ option, and for ‘locate’ this is the ‘—regextype’ option. These options take a single argument, which indicates the specific regular expression syntax and behaviour that should be used. This should be one of the following:
8.5.1 ‘findutils-default’ regular expression syntax
The character ’.’ matches any single character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are ignored. Within square brackets, ” is taken literally. Character classes are not supported, so for example you would need to use ‘[0-9]’ instead of ’:digit:’. GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input Grouping is performed with backslashes followed by parentheses ’(’, ’)‘. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The character ’^’ only represents the beginning of a string when it appears:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The character ’$’ only represents the end of a string when it appears:
- At the end of a regular expression
- Before a close-group, signified by ’)’
- Before the alternation operator ’|’ ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.2 ‘emacs’ regular expression syntax
The character ’.’ matches any single character except newline.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are ignored. Within square brackets, ” is taken literally. Character classes are not supported, so for example you would need to use ‘[0-9]’ instead of ’:digit:’. GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input Grouping is performed with backslashes followed by parentheses ’(’, ’)‘. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The character ’^’ only represents the beginning of a string when it appears:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The character ’$’ only represents the end of a string when it appears:
- At the end of a regular expression
- Before a close-group, signified by ’)’
- Before the alternation operator ’|’ ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.3 ‘gnu-awk’ regular expression syntax
The character ’.’ matches any single character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” can be used to quote the following character. Character classes are supported; for example ’:digit:’ will match a single decimal digit. GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input Grouping is performed with parentheses ’()‘. An unmatched ’)’ matches just itself. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The characters ’^’ and ’$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, ’^’ can be used to invert the membership of the character class being specified. ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ Intervals are specified by ’{’ and ’}‘. Invalid intervals are treated as literals, for example ‘a{1’ is treated as ‘a{1’ The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.4 ‘grep’ regular expression syntax
The character ’.’ matches any single character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+ and ?’ match themselves. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” is taken literally. Character classes are supported; for example ’:digit:’ will match a single decimal digit. GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input Grouping is performed with backslashes followed by parentheses ’(’, ’)‘. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The character ’^’ only represents the beginning of a string when it appears:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After a newline
- After the alternation operator ’|’ The character ’$’ only represents the end of a string when it appears:
- At the end of a regular expression
- Before a close-group, signified by ’)’
- Before a newline
- Before the alternation operator ’|’ ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After a newline
- After the alternation operator ’|’ Intervals are specified by ’{’ and ’}‘. Invalid intervals such as ‘a{1z’ are not accepted. The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.5 ‘posix-awk’ regular expression syntax
The character ’.’ matches any single character except the null character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” can be used to quote the following character. Character classes are supported; for example ’:digit:’ will match a single decimal digit. GNU extensions are not supported and so ‘\w’, ‘\W’, ’<’, ’>’, ‘\b’, ‘\B’, ’`’, and ''' match ‘w’, ‘W’, ’<’, ’>’, ‘b’, ‘B’, ’`’, and ''' respectively. Grouping is performed with parentheses ’()‘. An unmatched ’)’ matches just itself. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The characters ’^’ and ’$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, ’^’ can be used to invert the membership of the character class being specified. ’*’, ’+’ and ’?’ are special at any point in a regular expression except the following places, where they are not allowed:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ Intervals are specified by ’{’ and ’}‘. Invalid intervals are treated as literals, for example ‘a{1’ is treated as ‘a{1’ The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.6 ‘awk’ regular expression syntax
The character ’.’ matches any single character except the null character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” can be used to quote the following character. Character classes are supported; for example ’:digit:’ will match a single decimal digit. GNU extensions are not supported and so ‘\w’, ‘\W’, ’<’, ’>’, ‘\b’, ‘\B’, ’`’, and ''' match ‘w’, ‘W’, ’<’, ’>’, ‘b’, ‘B’, ’`’, and ''' respectively. Grouping is performed with parentheses ’()‘. An unmatched ’)’ matches just itself. A backslash followed by a digit matches that digit. The alternation operator is ’|‘. The characters ’^’ and ’$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, ’^’ can be used to invert the membership of the character class being specified. ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.7 ‘posix-basic’ regular expression syntax
The character ’.’ matches any single character except the null character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+ and ?’ match themselves. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” is taken literally. Character classes are supported; for example ’:digit:’ will match a single decimal digit. GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input Grouping is performed with backslashes followed by parentheses ’(’, ’)‘. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The character ’^’ only represents the beginning of a string when it appears:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ The character ’$’ only represents the end of a string when it appears:
- At the end of a regular expression
- Before a close-group, signified by ’)’
- Before the alternation operator ’|’ ’*’, ’+’ and ’?’ are special at any point in a regular expression except:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ Intervals are specified by ’{’ and ’}‘. Invalid intervals such as ‘a{1z’ are not accepted. The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.8 ‘posix-egrep’ regular expression syntax
The character ’.’ matches any single character.
’+’ indicates that the regular expression should match one or more occurrences of the previous atom or regexp. ’?’ indicates that the regular expression should match zero or one occurrence of the previous atom or regexp. ’+’ matches a ’+’ ’?’ matches a ’?‘. Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” is taken literally. Character classes are supported; for example ’:digit:’ will match a single decimal digit.
GNU extensions are supported:
- ‘\w’ matches a character within a word
- ‘\W’ matches a character which is not within a word
- ’<’ matches the beginning of a word
- ’>’ matches the end of a word
- ‘\b’ matches a word boundary
- ‘\B’ matches characters which are not a word boundary
- ’`’ matches the beginning of the whole input
- ''' matches the end of the whole input
Grouping is performed with parentheses ’()‘.
An unmatched ’)’ matches just itself.
A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number.
For example ‘\2’ matches the second group expression.
The order of group expressions is determined by the position of their opening parenthesis ’(‘.
The alternation operator is ’|‘.
The characters ’^’ and ’$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, ’^’ can be used to invert the membership of the character class being specified.
The characters ’*’, ’+’ and ’?’ are special anywhere in a regular expression.
Intervals are specified by ’{’ and ’}‘. Invalid intervals are treated as literals, for example ‘a{1’ is treated as ‘a{1’
The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.5.9 ‘egrep’ regular expression syntax
This is a synonym for posix-egrep.
8.5.10 ‘posix-extended’ regular expression syntax
The character ’.’ matches any single character except the null character.
- + indicates that the regular expression should match one or more occurrences of the previous atom or regexp.
- ? indicates that the regular expression should match zero or one occurrence of the previous atom or regexp.
- + matches a ’+’
- ? matches a ’?‘.
Bracket expressions are used to match ranges of characters. Bracket expressions where the range is backward, for example ‘[z-a]’, are invalid. Within square brackets, ” is taken literally. Character classes are supported; for example ’:digit:’ will match a single decimal digit.
GNU extensions are supported:
\w
matches a character within a word\W
matches a character which is not within a word\<
matches the beginning of a word\>
matches the end of a word\b
matches a word boundary\B
matches characters which are not a word boundary- `\“ matches the beginning of the whole input
\'
matches the end of the whole input
Grouping is performed with parentheses ’()‘. An unmatched ’)’ matches just itself. A backslash followed by a digit acts as a back-reference and matches the same thing as the previous grouped expression indicated by that number. For example ‘\2’ matches the second group expression. The order of group expressions is determined by the position of their opening parenthesis ’(‘. The alternation operator is ’|‘. The characters ’^’ and ’$’ always represent the beginning and end of a string respectively, except within square brackets. Within brackets, ’^’ can be used to invert the membership of the character class being specified. ’*’, ’+’ and ’?’ are special at any point in a regular expression except the following places, where they are not allowed:
- At the beginning of a regular expression
- After an open-group, signified by ’(’
- After the alternation operator ’|’ Intervals are specified by ’{’ and ’}‘. Invalid intervals such as ‘a{1z’ are not accepted. The longest possible match is returned; this applies to the regular expression as a whole and (subject to this constraint) to subexpressions within groups.
8.6 Environment Variables
‘LANG’ Provides a default value for the internationalisation variables that are unset or null.
‘LC_ALL’ If set to a non-empty string value, override the values of all the other internationalisation variables.
‘LC_COLLATE’ The POSIX standard specifies that this variable affects the pattern matching to be used for the ‘-name’ option. GNU find uses the GNU version of the ‘fnmatch’ library function. This variable also affects the interpretation of the response to ‘-ok’; while the ‘LC_MESSAGES’ variable selects the actual pattern used to interpret the response to ‘-ok’, the interpretation of any bracket expressions in the pattern will be affected by the ‘LC_COLLATE’ variable.
‘LC_CTYPE’ This variable affects the treatment of character classes used in regular expression and with the ‘-name’ test, if the ‘fnmatch’ function supports this. This variable also affects the interpretation of any character classes in the regular expressions used to interpret the response to the prompt issued by ‘-ok’. The ‘LC_CTYPE’ environment variable will also affect which characters are considered to be unprintable when filenames are printed (*note Unusual Characters in File Names::).
‘LC_MESSAGES’ Determines the locale to be used for internationalised messages, including the interpretation of the response to the prompt made by the ‘-ok’ action.
‘NLSPATH’ Determines the location of the internationalisation message catalogues.
‘PATH’ Affects the directories which are searched to find the executables invoked by ‘-exec’, ‘-execdir’ ‘-ok’ and ‘-okdir’. If the ‘PATH’ environment variable includes the current directory (by explicitly including ’.’ or by having an empty element), and the find command line includes ‘-execdir’ or ‘-okdir’, ‘find’ will refuse to run. *Note Security Considerations::, for a more detailed discussion of security matters.
‘POSIXLY_CORRECT’ Determines the block size used by ‘-ls’ and ‘-fls’. If ‘POSIXLY_CORRECT’ is set, blocks are units of 512 bytes. Otherwise they are units of 1024 bytes. Setting this variable also turns off warning messages (that is, implies ‘-nowarn’) by default, because POSIX requires that apart from the output for ‘-ok’, all messages printed on stderr are diagnostics and must result in a non-zero exit status. When ‘POSIXLY_CORRECT’ is set, the response to the prompt made by the ‘-ok’ action is interpreted according to the system’s message catalogue, as opposed to according to ‘find”s own message translations.
‘TZ’ Affects the time zone used for some of the time-related format directives of ‘-printf’ and ‘-fprintf’.