VirtualBox

source: kBuild/trunk/src/sed/doc/sed-in.texi@ 2773

Last change on this file since 2773 was 599, checked in by bird, 18 years ago

GNU sed 4.1.5.

File size: 134.7 KB
Line 
1\input texinfo @c -*-texinfo-*-
2@c
3@c -- Stuff that needs adding: ----------------------------------------------
4@c (document the `;' command-separator)
5@c --------------------------------------------------------------------------
6@c Check for consistency: regexps in @code, text that they match in @samp.
7@c
8@c Tips:
9@c @command for command
10@c @samp for command fragments: @samp{cat -s}
11@c @code for sed commands and flags
12@c Use ``quote'' not `quote' or "quote".
13@c
14@c %**start of header
15@setfilename sed.info
16@settitle sed, a stream editor
17@c %**end of header
18
19@c @smallbook
20
21@include version.texi
22
23@c Combine indices.
24@syncodeindex ky cp
25@syncodeindex pg cp
26@syncodeindex tp cp
27
28@defcodeindex op
29@syncodeindex op fn
30
31@include config.texi
32
33@copying
34This file documents version @value{VERSION} of
35@value{SSED}, a stream editor.
36
37Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
38Software Foundation, Inc.
39
40This document is released under the terms of the @acronym{GNU} Free
41Documentation License as published by the Free Software Foundation;
42either version 1.1, or (at your option) any later version.
43
44You should have received a copy of the @acronym{GNU} Free Documentation
45License along with @value{SSED}; see the file @file{COPYING.DOC}.
46If not, write to the Free Software Foundation, 59 Temple Place - Suite
47330, Boston, MA 02110-1301, USA.
48
49There are no Cover Texts and no Invariant Sections; this text, along
50with its equivalent in the printed manual, constitutes the Title Page.
51@end copying
52
53@setchapternewpage off
54
55@titlepage
56@title @command{sed}, a stream editor
57@subtitle version @value{VERSION}, @value{UPDATED}
58@author by Ken Pizzini, Paolo Bonzini
59
60@page
61@vskip 0pt plus 1filll
62Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
63
64@insertcopying
65
66Published by the Free Software Foundation, @*
6751 Franklin Street, Fifth Floor @*
68Boston, MA 02110-1301, USA
69@end titlepage
70
71
72@node Top
73@top
74
75@ifnottex
76@insertcopying
77@end ifnottex
78
79@menu
80* Introduction:: Introduction
81* Invoking sed:: Invocation
82* sed Programs:: @command{sed} programs
83* Examples:: Some sample scripts
84* Limitations:: Limitations and (non-)limitations of @value{SSED}
85* Other Resources:: Other resources for learning about @command{sed}
86* Reporting Bugs:: Reporting bugs
87
88* Extended regexps:: @command{egrep}-style regular expressions
89@ifset PERL
90* Perl regexps:: Perl-style regular expressions
91@end ifset
92
93* Concept Index:: A menu with all the topics in this manual.
94* Command and Option Index:: A menu with all @command{sed} commands and
95 command-line options.
96
97@detailmenu
98--- The detailed node listing ---
99
100sed Programs:
101* Execution Cycle:: How @command{sed} works
102* Addresses:: Selecting lines with @command{sed}
103* Regular Expressions:: Overview of regular expression syntax
104* Common Commands:: Often used commands
105* The "s" Command:: @command{sed}'s Swiss Army Knife
106* Other Commands:: Less frequently used commands
107* Programming Commands:: Commands for @command{sed} gurus
108* Extended Commands:: Commands specific of @value{SSED}
109* Escapes:: Specifying special characters
110
111Examples:
112* Centering lines::
113* Increment a number::
114* Rename files to lower case::
115* Print bash environment::
116* Reverse chars of lines::
117* tac:: Reverse lines of files
118* cat -n:: Numbering lines
119* cat -b:: Numbering non-blank lines
120* wc -c:: Counting chars
121* wc -w:: Counting words
122* wc -l:: Counting lines
123* head:: Printing the first lines
124* tail:: Printing the last lines
125* uniq:: Make duplicate lines unique
126* uniq -d:: Print duplicated lines of input
127* uniq -u:: Remove all duplicated lines
128* cat -s:: Squeezing blank lines
129
130@ifset PERL
131Perl regexps:: Perl-style regular expressions
132* Backslash:: Introduces special sequences
133* Circumflex/dollar sign/period:: Behave specially with regard to new lines
134* Square brackets:: Are a bit different in strange cases
135* Options setting:: Toggle modifiers in the middle of a regexp
136* Non-capturing subpatterns:: Are not counted when backreferencing
137* Repetition:: Allows for non-greedy matching
138* Backreferences:: Allows for more than 10 back references
139* Assertions:: Allows for complex look ahead matches
140* Non-backtracking subpatterns:: Often gives more performance
141* Conditional subpatterns:: Allows if/then/else branches
142* Recursive patterns:: For example to match parentheses
143* Comments:: Because things can get complex...
144@end ifset
145
146@end detailmenu
147@end menu
148
149
150@node Introduction
151@chapter Introduction
152
153@cindex Stream editor
154@command{sed} is a stream editor.
155A stream editor is used to perform basic text
156transformations on an input stream
157(a file or input from a pipeline).
158While in some ways similar to an editor which
159permits scripted edits (such as @command{ed}),
160@command{sed} works by making only one pass over the
161input(s), and is consequently more efficient.
162But it is @command{sed}'s ability to filter text in a pipeline
163which particularly distinguishes it from other types of
164editors.
165
166
167@node Invoking sed
168@chapter Invocation
169
170Normally @command{sed} is invoked like this:
171
172@example
173sed SCRIPT INPUTFILE...
174@end example
175
176The full format for invoking @command{sed} is:
177
178@example
179sed OPTIONS... [SCRIPT] [INPUTFILE...]
180@end example
181
182If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
183@command{sed} filters the contents of the standard input. The @var{script}
184is actually the first non-option parameter, which @command{sed} specially
185considers a script and not an input file if (and only if) none of the
186other @var{options} specifies a script to be executed, that is if neither
187of the @option{-e} and @option{-f} options is specified.
188
189@command{sed} may be invoked with the following command-line options:
190
191@table @code
192@item --version
193@opindex --version
194@cindex Version, printing
195Print out the version of @command{sed} that is being run and a copyright notice,
196then exit.
197
198@item --help
199@opindex --help
200@cindex Usage summary, printing
201Print a usage message briefly summarizing these command-line options
202and the bug-reporting address,
203then exit.
204
205@item -n
206@itemx --quiet
207@itemx --silent
208@opindex -n
209@opindex --quiet
210@opindex --silent
211@cindex Disabling autoprint, from command line
212By default, @command{sed} prints out the pattern space
213at the end of each cycle through the script.
214These options disable this automatic printing,
215and @command{sed} only produces output when explicitly told to
216via the @code{p} command.
217
218@item -i[@var{SUFFIX}]
219@itemx --in-place[=@var{SUFFIX}]
220@opindex -i
221@opindex --in-place
222@cindex In-place editing, activating
223@cindex @value{SSEDEXT}, in-place editing
224This option specifies that files are to be edited in-place.
225@value{SSED} does this by creating a temporary file and
226sending output to this file rather than to the standard
227output.@footnote{This applies to commands such as @code{=},
228@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can
229still write to the standard output by using the @code{w}
230@cindex @value{SSEDEXT}, @file{/dev/stdout} file
231or @code{W} commands together with the @file{/dev/stdout}
232special file}.
233
234This option implies @option{-s}.
235
236When the end of the file is reached, the temporary file is
237renamed to the output file's original name. The extension,
238if supplied, is used to modify the name of the old file
239before renaming the temporary file, thereby making a backup
240copy@footnote{Note that @value{SSED} creates the backup
241 file whether or not any output is actually changed.}).
242
243@cindex In-place editing, Perl-style backup file names
244This rule is followed: if the extension doesn't contain a @code{*},
245then it is appended to the end of the current filename as a
246suffix; if the extension does contain one or more @code{*}
247characters, then @emph{each} asterisk is replaced with the
248current filename. This allows you to add a prefix to the
249backup file, instead of (or in addition to) a suffix, or
250even to place backup copies of the original files into another
251directory (provided the directory already exists).
252
253If no extension is supplied, the original file is
254overwritten without making a backup.
255
256@item -l @var{N}
257@itemx --line-length=@var{N}
258@opindex -l
259@opindex --line-length
260@cindex Line length, setting
261Specify the default line-wrap length for the @code{l} command.
262A length of 0 (zero) means to never wrap long lines. If
263not specified, it is taken to be 70.
264
265@item --posix
266@cindex @value{SSEDEXT}, disabling
267@value{SSED} includes several extensions to @acronym{POSIX}
268sed. In order to simplify writing portable scripts, this
269option disables all the extensions that this manual documents,
270including additional commands.
271@cindex @code{POSIXLY_CORRECT} behavior, enabling
272Most of the extensions accept @command{sed} programs that
273are outside the syntax mandated by @acronym{POSIX}, but some
274of them (such as the behavior of the @command{N} command
275described in @pxref{Reporting Bugs}) actually violate the
276standard. If you want to disable only the latter kind of
277extension, you can set the @code{POSIXLY_CORRECT} variable
278to a non-empty value.
279
280@item -r
281@itemx --regexp-extended
282@opindex -r
283@opindex --regexp-extended
284@cindex Extended regular expressions, choosing
285@cindex @acronym{GNU} extensions, extended regular expressions
286Use extended regular expressions rather than basic
287regular expressions. Extended regexps are those that
288@command{egrep} accepts; they can be clearer because they
289usually have less backslashes, but are a @acronym{GNU} extension
290and hence scripts that use them are not portable.
291@xref{Extended regexps, , Extended regular expressions}.
292
293@ifset PERL
294@item -R
295@itemx --regexp-perl
296@opindex -R
297@opindex --regexp-perl
298@cindex Perl-style regular expressions, choosing
299@cindex @value{SSEDEXT}, Perl-style regular expressions
300Use Perl-style regular expressions rather than basic
301regular expressions. Perl-style regexps are extremely
302powerful but are a @value{SSED} extension and hence scripts that
303use it are not portable. @xref{Perl regexps, ,
304Perl-style regular expressions}.
305@end ifset
306
307@item -s
308@itemx --separate
309@cindex Working on separate files
310By default, @command{sed} will consider the files specified on the
311command line as a single continuous long stream. This @value{SSED}
312extension allows the user to consider them as separate files:
313range addresses (such as @samp{/abc/,/def/}) are not allowed
314to span several files, line numbers are relative to the start
315of each file, @code{$} refers to the last line of each file,
316and files invoked from the @code{R} commands are rewound at the
317start of each file.
318
319@item -u
320@itemx --unbuffered
321@opindex -u
322@opindex --unbuffered
323@cindex Unbuffered I/O, choosing
324Buffer both input and output as minimally as practical.
325(This is particularly useful if the input is coming from
326the likes of @samp{tail -f}, and you wish to see the transformed
327output as soon as possible.)
328
329@item -e @var{script}
330@itemx --expression=@var{script}
331@opindex -e
332@opindex --expression
333@cindex Script, from command line
334Add the commands in @var{script} to the set of commands to be
335run while processing the input.
336
337@item -f @var{script-file}
338@itemx --file=@var{script-file}
339@opindex -f
340@opindex --file
341@cindex Script, from a file
342Add the commands contained in the file @var{script-file}
343to the set of commands to be run while processing the input.
344
345@end table
346
347If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
348options are given on the command-line,
349then the first non-option argument on the command line is
350taken to be the @var{script} to be executed.
351
352@cindex Files to be processed as input
353If any command-line parameters remain after processing the above,
354these parameters are interpreted as the names of input files to
355be processed.
356@cindex Standard input, processing as input
357A file name of @samp{-} refers to the standard input stream.
358The standard input will be processed if no file names are specified.
359
360
361@node sed Programs
362@chapter @command{sed} Programs
363
364@cindex @command{sed} program structure
365@cindex Script structure
366A @command{sed} program consists of one or more @command{sed} commands,
367passed in by one or more of the
368@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
369options, or the first non-option argument if zero of these
370options are used.
371This document will refer to ``the'' @command{sed} script;
372this is understood to mean the in-order catenation
373of all of the @var{script}s and @var{script-file}s passed in.
374
375Each @code{sed} command consists of an optional address or
376address range, followed by a one-character command name
377and any additional command-specific code.
378
379@menu
380* Execution Cycle:: How @command{sed} works
381* Addresses:: Selecting lines with @command{sed}
382* Regular Expressions:: Overview of regular expression syntax
383* Common Commands:: Often used commands
384* The "s" Command:: @command{sed}'s Swiss Army Knife
385* Other Commands:: Less frequently used commands
386* Programming Commands:: Commands for @command{sed} gurus
387* Extended Commands:: Commands specific of @value{SSED}
388* Escapes:: Specifying special characters
389@end menu
390
391
392@node Execution Cycle
393@section How @command{sed} Works
394
395@cindex Buffer spaces, pattern and hold
396@cindex Spaces, pattern and hold
397@cindex Pattern space, definition
398@cindex Hold space, definition
399@command{sed} maintains two data buffers: the active @emph{pattern} space,
400and the auxiliary @emph{hold} space. Both are initially empty.
401
402@command{sed} operates by performing the following cycle on each
403lines of input: first, @command{sed} reads one line from the input
404stream, removes any trailing newline, and places it in the pattern space.
405Then commands are executed; each command can have an address associated
406to it: addresses are a kind of condition code, and a command is only
407executed if the condition is verified before the command is to be
408executed.
409
410When the end of the script is reached, unless the @option{-n} option
411is in use, the contents of pattern space are printed out to the output
412stream, adding back the trailing newline if it was removed.@footnote{Actually,
413 if @command{sed} prints a line without the terminating newline, it will
414 nevertheless print the missing newline as soon as more text is sent to
415 the same output stream, which gives the ``least expected surprise''
416 even though it does not make commands like @samp{sed -n p} exactly
417 identical to @command{cat}.} Then the next cycle starts for the next
418input line.
419
420Unless special commands (like @samp{D}) are used, the pattern space is
421deleted between two cycles. The hold space, on the other hand, keeps
422its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
423@samp{g}, @samp{G} to move data between both buffers).
424
425
426@node Addresses
427@section Selecting lines with @command{sed}
428@cindex Addresses, in @command{sed} scripts
429@cindex Line selection
430@cindex Selecting lines to process
431
432Addresses in a @command{sed} script can be in any of the following forms:
433@table @code
434@item @var{number}
435@cindex Address, numeric
436@cindex Line, selecting by number
437Specifying a line number will match only that line in the input.
438(Note that @command{sed} counts lines continuously across all input files
439unless @option{-i} or @option{-s} options are specified.)
440
441@item @var{first}~@var{step}
442@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
443This @acronym{GNU} extension matches every @var{step}th line
444starting with line @var{first}.
445In particular, lines will be selected when there exists
446a non-negative @var{n} such that the current line-number equals
447@var{first} + (@var{n} * @var{step}).
448Thus, to select the odd-numbered lines,
449one would use @code{1~2};
450to pick every third line starting with the second, @samp{2~3} would be used;
451to pick every fifth line starting with the tenth, use @samp{10~5};
452and @samp{50~0} is just an obscure way of saying @code{50}.
453
454@item $
455@cindex Address, last line
456@cindex Last line, selecting
457@cindex Line, selecting last
458This address matches the last line of the last file of input, or
459the last line of each file when the @option{-i} or @option{-s} options
460are specified.
461
462@item /@var{regexp}/
463@cindex Address, as a regular expression
464@cindex Line, selecting by regular expression match
465This will select any line which matches the regular expression @var{regexp}.
466If @var{regexp} itself includes any @code{/} characters,
467each must be escaped by a backslash (@code{\}).
468
469@cindex empty regular expression
470@cindex @value{SSEDEXT}, modifiers and the empty regular expression
471The empty regular expression @samp{//} repeats the last regular
472expression match (the same holds if the empty regular expression is
473passed to the @code{s} command). Note that modifiers to regular expressions
474are evaluated when the regular expression is compiled, thus it is invalid to
475specify them together with the empty regular expression.
476
477@item \%@var{regexp}%
478(The @code{%} may be replaced by any other single character.)
479
480@cindex Slash character, in regular expressions
481This also matches the regular expression @var{regexp},
482but allows one to use a different delimiter than @code{/}.
483This is particularly useful if the @var{regexp} itself contains
484a lot of slashes, since it avoids the tedious escaping of every @code{/}.
485If @var{regexp} itself includes any delimiter characters,
486each must be escaped by a backslash (@code{\}).
487
488@item /@var{regexp}/I
489@itemx \%@var{regexp}%I
490@cindex @acronym{GNU} extensions, @code{I} modifier
491@ifset PERL
492@cindex Perl-style regular expressions, case-insensitive
493@end ifset
494The @code{I} modifier to regular-expression matching is a @acronym{GNU}
495extension which causes the @var{regexp} to be matched in
496a case-insensitive manner.
497
498@item /@var{regexp}/M
499@itemx \%@var{regexp}%M
500@ifset PERL
501@cindex @value{SSEDEXT}, @code{M} modifier
502@end ifset
503@cindex Perl-style regular expressions, multiline
504The @code{M} modifier to regular-expression matching is a @value{SSED}
505extension which causes @code{^} and @code{$} to match respectively
506(in addition to the normal behavior) the empty string after a newline,
507and the empty string before a newline. There are special character
508sequences
509@ifset PERL
510(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
511in basic or extended regular expression modes)
512@end ifset
513@ifclear PERL
514(@code{\`} and @code{\'})
515@end ifclear
516which always match the beginning or the end of the buffer.
517@code{M} stands for @cite{multi-line}.
518
519@ifset PERL
520@item /@var{regexp}/S
521@itemx \%@var{regexp}%S
522@cindex @value{SSEDEXT}, @code{S} modifier
523@cindex Perl-style regular expressions, single line
524The @code{S} modifier to regular-expression matching is only valid
525in Perl mode and specifies that the dot character (@code{.}) will
526match the newline character too. @code{S} stands for @cite{single-line}.
527@end ifset
528
529@ifset PERL
530@item /@var{regexp}/X
531@itemx \%@var{regexp}%X
532@cindex @value{SSEDEXT}, @code{X} modifier
533@cindex Perl-style regular expressions, extended
534The @code{X} modifier to regular-expression matching is also
535valid in Perl mode only. If it is used, whitespace in the
536pattern (other than in a character class) and
537characters between a @kbd{#} outside a character class and the
538next newline character are ignored. An escaping backslash
539can be used to include a whitespace or @kbd{#} character as part
540of the pattern.
541@end ifset
542@end table
543
544If no addresses are given, then all lines are matched;
545if one address is given, then only lines matching that
546address are matched.
547
548@cindex Range of lines
549@cindex Several lines, selecting
550An address range can be specified by specifying two addresses
551separated by a comma (@code{,}). An address range matches lines
552starting from where the first address matches, and continues
553until the second address matches (inclusively).
554
555If the second address is a @var{regexp}, then checking for the
556ending match will start with the line @emph{following} the
557line which matched the first address: a range will always
558span at least two lines (except of course if the input stream
559ends).
560
561If the second address is a @var{number} less than (or equal to)
562the line matching the first address, then only the one line is
563matched.
564
565@cindex Special addressing forms
566@cindex Range with start address of zero
567@cindex Zero, as range start address
568@cindex @var{addr1},+N
569@cindex @var{addr1},~N
570@cindex @acronym{GNU} extensions, special two-address forms
571@cindex @acronym{GNU} extensions, @code{0} address
572@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
573@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
574@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
575@value{SSED} also supports some special two-address forms; all these
576are @acronym{GNU} extensions:
577@table @code
578@item 0,/@var{regexp}/
579A line number of @code{0} can be used in an address specification like
580@code{0,/@var{regexp}/} so that @command{sed} will try to match
581@var{regexp} in the first input line too. In other words,
582@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
583except that if @var{addr2} matches the very first line of input the
584@code{0,/@var{regexp}/} form will consider it to end the range, whereas
585the @code{1,/@var{regexp}/} form will match the beginning of its range and
586hence make the range span up to the @emph{second} occurrence of the
587regular expression.
588
589Note that this is the only place where the @code{0} address makes
590sense; there is no 0-th line and commands which are given the @code{0}
591address in any other way will give an error.
592
593@item @var{addr1},+@var{N}
594Matches @var{addr1} and the @var{N} lines following @var{addr1}.
595
596@item @var{addr1},~@var{N}
597Matches @var{addr1} and the lines following @var{addr1}
598until the next line whose input line number is a multiple of @var{N}.
599@end table
600
601@cindex Excluding lines
602@cindex Selecting non-matching lines
603Appending the @code{!} character to the end of an address
604specification negates the sense of the match.
605That is, if the @code{!} character follows an address range,
606then only lines which do @emph{not} match the address range
607will be selected.
608This also works for singleton addresses,
609and, perhaps perversely, for the null address.
610
611
612@node Regular Expressions
613@section Overview of Regular Expression Syntax
614
615To know how to use @command{sed}, people should understand regular
616expressions (@dfn{regexp} for short). A regular expression
617is a pattern that is matched against a
618subject string from left to right. Most characters are
619@dfn{ordinary}: they stand for
620themselves in a pattern, and match the corresponding characters
621in the subject. As a trivial example, the pattern
622
623@example
624 The quick brown fox
625@end example
626
627@noindent
628matches a portion of a subject string that is identical to
629itself. The power of regular expressions comes from the
630ability to include alternatives and repetitions in the pattern.
631These are encoded in the pattern by the use of @dfn{special characters},
632which do not stand for themselves but instead
633are interpreted in some special way. Here is a brief description
634of regular expression syntax as used in @command{sed}.
635
636@table @code
637@item @var{char}
638A single ordinary character matches itself.
639
640@item *
641@cindex @acronym{GNU} extensions, to basic regular expressions
642Matches a sequence of zero or more instances of matches for the
643preceding regular expression, which must be an ordinary character, a
644special character preceded by @code{\}, a @code{.}, a grouped regexp
645(see below), or a bracket expression. As a @acronym{GNU} extension, a
646postfixed regular expression can also be followed by @code{*}; for
647example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX}
6481003.1-2001 says that @code{*} stands for itself when it appears at
649the start of a regular expression or subexpression, but many
650non@acronym{GNU} implementations do not support this and portable
651scripts should instead use @code{\*} in these contexts.
652
653@item \+
654@cindex @acronym{GNU} extensions, to basic regular expressions
655As @code{*}, but matches one or more. It is a @acronym{GNU} extension.
656
657@item \?
658@cindex @acronym{GNU} extensions, to basic regular expressions
659As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension.
660
661@item \@{@var{i}\@}
662As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
663decimal integer; for portability, keep it between 0 and 255
664inclusive).
665
666@item \@{@var{i},@var{j}\@}
667Matches between @var{i} and @var{j}, inclusive, sequences.
668
669@item \@{@var{i},\@}
670Matches more than or equal to @var{i} sequences.
671
672@item \(@var{regexp}\)
673Groups the inner @var{regexp} as a whole, this is used to:
674
675@itemize @bullet
676@item
677@cindex @acronym{GNU} extensions, to basic regular expressions
678Apply postfix operators, like @code{\(abcd\)*}:
679this will search for zero or more whole sequences
680of @samp{abcd}, while @code{abcd*} would search
681for @samp{abc} followed by zero or more occurrences
682of @samp{d}. Note that support for @code{\(abcd\)*} is
683required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
684implementations do not support it and hence it is not universally
685portable.
686
687@item
688Use back references (see below).
689@end itemize
690
691@item .
692Matches any character, including newline.
693
694@item ^
695Matches the null string at beginning of line, i.e. what
696appears after the circumflex must appear at the
697beginning of line. @code{^#include} will match only
698lines where @samp{#include} is the first thing on line---if
699there are spaces before, for example, the match fails.
700@code{^} acts as a special character only at the beginning
701of the regular expression or subexpression (that is,
702after @code{\(} or @code{\|}). Portable scripts should avoid
703@code{^} at the beginning of a subexpression, though, as
704@acronym{POSIX} allows implementations that treat @code{^} as
705an ordinary character in that context.
706
707
708@item $
709It is the same as @code{^}, but refers to end of line.
710@code{$} also acts as a special character only at the end
711of the regular expression or subexpression (that is, before @code{\)}
712or @code{\|}), and its use at the end of a subexpression is not
713portable.
714
715
716@item [@var{list}]
717@itemx [^@var{list}]
718Matches any single character in @var{list}: for example,
719@code{[aeiou]} matches all vowels. A list may include
720sequences like @code{@var{char1}-@var{char2}}, which
721matches any character between (inclusive) @var{char1}
722and @var{char2}.
723
724A leading @code{^} reverses the meaning of @var{list}, so that
725it matches any single character @emph{not} in @var{list}. To include
726@code{]} in the list, make it the first character (after
727the @code{^} if needed), to include @code{-} in the list,
728make it the first or last; to include @code{^} put
729it after the first character.
730
731@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
732The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
733are normally not special within @var{list}. For example, @code{[\*]}
734matches either @samp{\} or @samp{*}, because the @code{\} is not
735special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and
736@code{[:space:]} are special within @var{list} and represent collating
737symbols, equivalence classes, and character classes, respectively, and
738@code{[} is therefore special within @var{list} when it is followed by
739@code{.}, @code{=}, or @code{:}. Also, when not in
740@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
741@code{\t} are recognized within @var{list}. @xref{Escapes}.
742
743@item @var{regexp1}\|@var{regexp2}
744@cindex @acronym{GNU} extensions, to basic regular expressions
745Matches either @var{regexp1} or @var{regexp2}. Use
746parentheses to use complex alternative regular expressions.
747The matching process tries each alternative in turn, from
748left to right, and the first one that succeeds is used.
749It is a @acronym{GNU} extension.
750
751@item @var{regexp1}@var{regexp2}
752Matches the concatenation of @var{regexp1} and @var{regexp2}.
753Concatenation binds more tightly than @code{\|}, @code{^}, and
754@code{$}, but less tightly than the other regular expression
755operators.
756
757@item \@var{digit}
758Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
759subexpression in the regular expression. This is called a @dfn{back
760reference}. Subexpressions are implicity numbered by counting
761occurrences of @code{\(} left-to-right.
762
763@item \n
764Matches the newline character.
765
766@item \@var{char}
767Matches @var{char}, where @var{char} is one of @code{$},
768@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
769Note that the only C-like
770backslash sequences that you can portably assume to be
771interpreted are @code{\n} and @code{\\}; in particular
772@code{\t} is not portable, and matches a @samp{t} under most
773implementations of @command{sed}, rather than a tab character.
774
775@end table
776
777@cindex Greedy regular expression matching
778Note that the regular expression matcher is greedy, i.e., matches
779are attempted from left to right and, if two or more matches are
780possible starting at the same character, it selects the longest.
781
782@noindent
783Examples:
784@table @samp
785@item abcdef
786Matches @samp{abcdef}.
787
788@item a*b
789Matches zero or more @samp{a}s followed by a single
790@samp{b}. For example, @samp{b} or @samp{aaaaab}.
791
792@item a\?b
793Matches @samp{b} or @samp{ab}.
794
795@item a\+b\+
796Matches one or more @samp{a}s followed by one or more
797@samp{b}s: @samp{ab} is the shortest possible match, but
798other examples are @samp{aaaab} or @samp{abbbbb} or
799@samp{aaaaaabbbbbbb}.
800
801@item .*
802@itemx .\+
803These two both match all the characters in a string;
804however, the first matches every string (including the empty
805string), while the second matches only strings containing
806at least one character.
807
808@item ^main.*(.*)
809his matches a string starting with @samp{main},
810followed by an opening and closing
811parenthesis. The @samp{n}, @samp{(} and @samp{)} need not
812be adjacent.
813
814@item ^#
815This matches a string beginning with @samp{#}.
816
817@item \\$
818This matches a string ending with a single backslash. The
819regexp contains two backslashes for escaping.
820
821@item \$
822Instead, this matches a string consisting of a single dollar sign,
823because it is escaped.
824
825@item [a-zA-Z0-9]
826In the C locale, this matches any @acronym{ASCII} letters or digits.
827
828@item [^ @kbd{tab}]\+
829(Here @kbd{tab} stands for a single tab character.)
830This matches a string of one or more
831characters, none of which is a space or a tab.
832Usually this means a word.
833
834@item ^\(.*\)\n\1$
835This matches a string consisting of two equal substrings separated by
836a newline.
837
838@item .\@{9\@}A$
839This matches nine characters followed by an @samp{A}.
840
841@item ^.\@{15\@}A
842This matches the start of a string that contains 16 characters,
843the last of which is an @samp{A}.
844
845@end table
846
847
848
849@node Common Commands
850@section Often-Used Commands
851
852If you use @command{sed} at all, you will quite likely want to know
853these commands.
854
855@table @code
856@item #
857[No addresses allowed.]
858
859@findex # (comments)
860@cindex Comments, in scripts
861The @code{#} character begins a comment;
862the comment continues until the next newline.
863
864@cindex Portability, comments
865If you are concerned about portability, be aware that
866some implementations of @command{sed} (which are not @sc{posix}
867conformant) may only support a single one-line comment,
868and then only when the very first character of the script is a @code{#}.
869
870@findex -n, forcing from within a script
871@cindex Caveat --- #n on first line
872Warning: if the first two characters of the @command{sed} script
873are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
874If you want to put a comment in the first line of your script
875and that comment begins with the letter @samp{n}
876and you do not want this behavior,
877then be sure to either use a capital @samp{N},
878or place at least one space before the @samp{n}.
879
880@item q [@var{exit-code}]
881This command only accepts a single address.
882
883@findex q (quit) command
884@cindex @value{SSEDEXT}, returning an exit code
885@cindex Quitting
886Exit @command{sed} without processing any more commands or input.
887Note that the current pattern space is printed if auto-print is
888not disabled with the @option{-n} options. The ability to return
889an exit code from the @command{sed} script is a @value{SSED} extension.
890
891@item d
892@findex d (delete) command
893@cindex Text, deleting
894Delete the pattern space;
895immediately start next cycle.
896
897@item p
898@findex p (print) command
899@cindex Text, printing
900Print out the pattern space (to the standard output).
901This command is usually only used in conjunction with the @option{-n}
902command-line option.
903
904@item n
905@findex n (next-line) command
906@cindex Next input line, replace pattern space with
907@cindex Read next input line
908If auto-print is not disabled, print the pattern space,
909then, regardless, replace the pattern space with the next line of input.
910If there is no more input then @command{sed} exits without processing
911any more commands.
912
913@item @{ @var{commands} @}
914@findex @{@} command grouping
915@cindex Grouping commands
916@cindex Command groups
917A group of commands may be enclosed between
918@code{@{} and @code{@}} characters.
919This is particularly useful when you want a group of commands
920to be triggered by a single address (or address-range) match.
921
922@end table
923
924@node The "s" Command
925@section The @code{s} Command
926
927The syntax of the @code{s} (as in substitute) command is
928@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/}
929characters may be uniformly replaced by any other single
930character within any given @code{s} command. The @code{/}
931character (or whatever other character is used in its stead)
932can appear in the @var{regexp} or @var{replacement}
933only if it is preceded by a @code{\} character.
934
935The @code{s} command is probably the most important in @command{sed}
936and has a lot of different options. Its basic concept is simple:
937the @code{s} command attempts to match the pattern
938space against the supplied @var{regexp}; if the match is
939successful, then that portion of the pattern
940space which was matched is replaced with @var{replacement}.
941
942@cindex Backreferences, in regular expressions
943@cindex Parenthesized substrings
944The @var{replacement} can contain @code{\@var{n}} (@var{n} being
945a number from 1 to 9, inclusive) references, which refer to
946the portion of the match which is contained between the @var{n}th
947@code{\(} and its matching @code{\)}.
948Also, the @var{replacement} can contain unescaped @code{&}
949characters which reference the whole matched portion
950of the pattern space.
951@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
952Finally, as a @value{SSED} extension, you can include a
953special sequence made of a backslash and one of the letters
954@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
955The meaning is as follows:
956
957@table @code
958@item \L
959Turn the replacement
960to lowercase until a @code{\U} or @code{\E} is found,
961
962@item \l
963Turn the
964next character to lowercase,
965
966@item \U
967Turn the replacement to uppercase
968until a @code{\L} or @code{\E} is found,
969
970@item \u
971Turn the next character
972to uppercase,
973
974@item \E
975Stop case conversion started by @code{\L} or @code{\U}.
976@end table
977
978To include a literal @code{\}, @code{&}, or newline in the final
979replacement, be sure to precede the desired @code{\}, @code{&},
980or newline in the @var{replacement} with a @code{\}.
981
982@findex s command, option flags
983@cindex Substitution of text, options
984The @code{s} command can be followed by zero or more of the
985following @var{flags}:
986
987@table @code
988@item g
989@cindex Global substitution
990@cindex Replacing all text matching regexp in a line
991Apply the replacement to @emph{all} matches to the @var{regexp},
992not just the first.
993
994@item @var{number}
995@cindex Replacing only @var{n}th match of regexp in a line
996Only replace the @var{number}th match of the @var{regexp}.
997
998@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
999@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
1000Note: the @sc{posix} standard does not specify what should happen
1001when you mix the @code{g} and @var{number} modifiers,
1002and currently there is no widely agreed upon meaning
1003across @command{sed} implementations.
1004For @value{SSED}, the interaction is defined to be:
1005ignore matches before the @var{number}th,
1006and then match and replace all matches from
1007the @var{number}th on.
1008
1009@item p
1010@cindex Text, printing after substitution
1011If the substitution was made, then print the new pattern space.
1012
1013Note: when both the @code{p} and @code{e} options are specified,
1014the relative ordering of the two produces very different results.
1015In general, @code{ep} (evaluate then print) is what you want,
1016but operating the other way round can be useful for debugging.
1017For this reason, the current version of @value{SSED} interprets
1018specially the presence of @code{p} options both before and after
1019@code{e}, printing the pattern space before and after evaluation,
1020while in general flags for the @code{s} command show their
1021effect just once. This behavior, although documented, might
1022change in future versions.
1023
1024@item w @var{file-name}
1025@cindex Text, writing to a file after substitution
1026@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1027@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1028If the substitution was made, then write out the result to the named file.
1029As a @value{SSED} extension, two special values of @var{file-name} are
1030supported: @file{/dev/stderr}, which writes the result to the standard
1031error, and @file{/dev/stdout}, which writes to the standard
1032output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1033option is being used.}
1034
1035@item e
1036@cindex Evaluate Bourne-shell commands, after substitution
1037@cindex Subprocesses
1038@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1039@cindex @value{SSEDEXT}, subprocesses
1040This command allows one to pipe input from a shell command
1041into pattern space. If a substitution was made, the command
1042that is found in pattern space is executed and pattern space
1043is replaced with its output. A trailing newline is suppressed;
1044results are undefined if the command to be executed contains
1045a @sc{nul} character. This is a @value{SSED} extension.
1046
1047@item I
1048@itemx i
1049@cindex @acronym{GNU} extensions, @code{I} modifier
1050@cindex Case-insensitive matching
1051@ifset PERL
1052@cindex Perl-style regular expressions, case-insensitive
1053@end ifset
1054The @code{I} modifier to regular-expression matching is a @acronym{GNU}
1055extension which makes @command{sed} match @var{regexp} in a
1056case-insensitive manner.
1057
1058@item M
1059@itemx m
1060@cindex @value{SSEDEXT}, @code{M} modifier
1061@ifset PERL
1062@cindex Perl-style regular expressions, multiline
1063@end ifset
1064The @code{M} modifier to regular-expression matching is a @value{SSED}
1065extension which causes @code{^} and @code{$} to match respectively
1066(in addition to the normal behavior) the empty string after a newline,
1067and the empty string before a newline. There are special character
1068sequences
1069@ifset PERL
1070(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
1071in basic or extended regular expression modes)
1072@end ifset
1073@ifclear PERL
1074(@code{\`} and @code{\'})
1075@end ifclear
1076which always match the beginning or the end of the buffer.
1077@code{M} stands for @cite{multi-line}.
1078
1079@ifset PERL
1080@item S
1081@itemx s
1082@cindex @value{SSEDEXT}, @code{S} modifier
1083@cindex Perl-style regular expressions, single line
1084The @code{S} modifier to regular-expression matching is only valid
1085in Perl mode and specifies that the dot character (@code{.}) will
1086match the newline character too. @code{S} stands for @cite{single-line}.
1087@end ifset
1088
1089@ifset PERL
1090@item X
1091@itemx x
1092@cindex @value{SSEDEXT}, @code{X} modifier
1093@cindex Perl-style regular expressions, extended
1094The @code{X} modifier to regular-expression matching is also
1095valid in Perl mode only. If it is used, whitespace in the
1096pattern (other than in a character class) and
1097characters between a @kbd{#} outside a character class and the
1098next newline character are ignored. An escaping backslash
1099can be used to include a whitespace or @kbd{#} character as part
1100of the pattern.
1101@end ifset
1102@end table
1103
1104
1105@node Other Commands
1106@section Less Frequently-Used Commands
1107
1108Though perhaps less frequently used than those in the previous
1109section, some very small yet useful @command{sed} scripts can be built with
1110these commands.
1111
1112@table @code
1113@item y/@var{source-chars}/@var{dest-chars}/
1114(The @code{/} characters may be uniformly replaced by
1115any other single character within any given @code{y} command.)
1116
1117@findex y (transliterate) command
1118@cindex Transliteration
1119Transliterate any characters in the pattern space which match
1120any of the @var{source-chars} with the corresponding character
1121in @var{dest-chars}.
1122
1123Instances of the @code{/} (or whatever other character is used in its stead),
1124@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
1125lists, provide that each instance is escaped by a @code{\}.
1126The @var{source-chars} and @var{dest-chars} lists @emph{must}
1127contain the same number of characters (after de-escaping).
1128
1129@item a\
1130@itemx @var{text}
1131@cindex @value{SSEDEXT}, two addresses supported by most commands
1132As a @acronym{GNU} extension, this command accepts two addresses.
1133
1134@findex a (append text lines) command
1135@cindex Appending text after a line
1136@cindex Text, appending
1137Queue the lines of text which follow this command
1138(each but the last ending with a @code{\},
1139which are removed from the output)
1140to be output at the end of the current cycle,
1141or when the next input line is read.
1142
1143Escape sequences in @var{text} are processed, so you should
1144use @code{\\} in @var{text} to print a single backslash.
1145
1146As a @acronym{GNU} extension, if between the @code{a} and the newline there is
1147other than a whitespace-@code{\} sequence, then the text of this line,
1148starting at the first non-whitespace character after the @code{a},
1149is taken as the first line of the @var{text} block.
1150(This enables a simplification in scripting a one-line add.)
1151This extension also works with the @code{i} and @code{c} commands.
1152
1153@item i\
1154@itemx @var{text}
1155@cindex @value{SSEDEXT}, two addresses supported by most commands
1156As a @acronym{GNU} extension, this command accepts two addresses.
1157
1158@findex i (insert text lines) command
1159@cindex Inserting text before a line
1160@cindex Text, insertion
1161Immediately output the lines of text which follow this command
1162(each but the last ending with a @code{\},
1163which are removed from the output).
1164
1165@item c\
1166@itemx @var{text}
1167@findex c (change to text lines) command
1168@cindex Replacing selected lines with other text
1169Delete the lines matching the address or address-range,
1170and output the lines of text which follow this command
1171(each but the last ending with a @code{\},
1172which are removed from the output)
1173in place of the last line
1174(or in place of each line, if no addresses were specified).
1175A new cycle is started after this command is done,
1176since the pattern space will have been deleted.
1177
1178@item =
1179@cindex @value{SSEDEXT}, two addresses supported by most commands
1180As a @acronym{GNU} extension, this command accepts two addresses.
1181
1182@findex = (print line number) command
1183@cindex Printing line number
1184@cindex Line number, printing
1185Print out the current input line number (with a trailing newline).
1186
1187@item l @var{n}
1188@findex l (list unambiguously) command
1189@cindex List pattern space
1190@cindex Printing text unambiguously
1191@cindex Line length, setting
1192@cindex @value{SSEDEXT}, setting line length
1193Print the pattern space in an unambiguous form:
1194non-printable characters (and the @code{\} character)
1195are printed in C-style escaped form; long lines are split,
1196with a trailing @code{\} character to indicate the split;
1197the end of each line is marked with a @code{$}.
1198
1199@var{n} specifies the desired line-wrap length;
1200a length of 0 (zero) means to never wrap long lines. If omitted,
1201the default as specified on the command line is used. The @var{n}
1202parameter is a @value{SSED} extension.
1203
1204@item r @var{filename}
1205@cindex @value{SSEDEXT}, two addresses supported by most commands
1206As a @acronym{GNU} extension, this command accepts two addresses.
1207
1208@findex r (read file) command
1209@cindex Read text from a file
1210@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1211Queue the contents of @var{filename} to be read and
1212inserted into the output stream at the end of the current cycle,
1213or when the next input line is read.
1214Note that if @var{filename} cannot be read, it is treated as
1215if it were an empty file, without any error indication.
1216
1217As a @value{SSED} extension, the special value @file{/dev/stdin}
1218is supported for the file name, which reads the contents of the
1219standard input.
1220
1221@item w @var{filename}
1222@findex w (write file) command
1223@cindex Write to a file
1224@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1225@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1226Write the pattern space to @var{filename}.
1227As a @value{SSED} extension, two special values of @var{file-name} are
1228supported: @file{/dev/stderr}, which writes the result to the standard
1229error, and @file{/dev/stdout}, which writes to the standard
1230output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1231option is being used.}
1232
1233The file will be created (or truncated) before the
1234first input line is read; all @code{w} commands
1235(including instances of @code{w} flag on successful @code{s} commands)
1236which refer to the same @var{filename} are output without
1237closing and reopening the file.
1238
1239@item D
1240@findex D (delete first line) command
1241@cindex Delete first line from pattern space
1242Delete text in the pattern space up to the first newline.
1243If any text is left, restart cycle with the resultant
1244pattern space (without reading a new line of input),
1245otherwise start a normal new cycle.
1246
1247@item N
1248@findex N (append Next line) command
1249@cindex Next input line, append to pattern space
1250@cindex Append next input line to pattern space
1251Add a newline to the pattern space,
1252then append the next line of input to the pattern space.
1253If there is no more input then @command{sed} exits without processing
1254any more commands.
1255
1256@item P
1257@findex P (print first line) command
1258@cindex Print first line from pattern space
1259Print out the portion of the pattern space up to the first newline.
1260
1261@item h
1262@findex h (hold) command
1263@cindex Copy pattern space into hold space
1264@cindex Replace hold space with copy of pattern space
1265@cindex Hold space, copying pattern space into
1266Replace the contents of the hold space with the contents of the pattern space.
1267
1268@item H
1269@findex H (append Hold) command
1270@cindex Append pattern space to hold space
1271@cindex Hold space, appending from pattern space
1272Append a newline to the contents of the hold space,
1273and then append the contents of the pattern space to that of the hold space.
1274
1275@item g
1276@findex g (get) command
1277@cindex Copy hold space into pattern space
1278@cindex Replace pattern space with copy of hold space
1279@cindex Hold space, copy into pattern space
1280Replace the contents of the pattern space with the contents of the hold space.
1281
1282@item G
1283@findex G (appending Get) command
1284@cindex Append hold space to pattern space
1285@cindex Hold space, appending to pattern space
1286Append a newline to the contents of the pattern space,
1287and then append the contents of the hold space to that of the pattern space.
1288
1289@item x
1290@findex x (eXchange) command
1291@cindex Exchange hold space with pattern space
1292@cindex Hold space, exchange with pattern space
1293Exchange the contents of the hold and pattern spaces.
1294
1295@end table
1296
1297
1298@node Programming Commands
1299@section Commands for @command{sed} gurus
1300
1301In most cases, use of these commands indicates that you are
1302probably better off programming in something like @command{awk}
1303or Perl. But occasionally one is committed to sticking
1304with @command{sed}, and these commands can enable one to write
1305quite convoluted scripts.
1306
1307@cindex Flow of control in scripts
1308@table @code
1309@item : @var{label}
1310[No addresses allowed.]
1311
1312@findex : (label) command
1313@cindex Labels, in scripts
1314Specify the location of @var{label} for branch commands.
1315In all other respects, a no-op.
1316
1317@item b @var{label}
1318@findex b (branch) command
1319@cindex Branch to a label, unconditionally
1320@cindex Goto, in scripts
1321Unconditionally branch to @var{label}.
1322The @var{label} may be omitted, in which case the next cycle is started.
1323
1324@item t @var{label}
1325@findex t (test and branch if successful) command
1326@cindex Branch to a label, if @code{s///} succeeded
1327@cindex Conditional branch
1328Branch to @var{label} only if there has been a successful @code{s}ubstitution
1329since the last input line was read or conditional branch was taken.
1330The @var{label} may be omitted, in which case the next cycle is started.
1331
1332@end table
1333
1334@node Extended Commands
1335@section Commands Specific to @value{SSED}
1336
1337These commands are specific to @value{SSED}, so you
1338must use them with care and only when you are sure that
1339hindering portability is not evil. They allow you to check
1340for @value{SSED} extensions or to do tasks that are required
1341quite often, yet are unsupported by standard @command{sed}s.
1342
1343@table @code
1344@item e [@var{command}]
1345@findex e (evaluate) command
1346@cindex Evaluate Bourne-shell commands
1347@cindex Subprocesses
1348@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1349@cindex @value{SSEDEXT}, subprocesses
1350This command allows one to pipe input from a shell command
1351into pattern space. Without parameters, the @code{e} command
1352executes the command that is found in pattern space and
1353replaces the pattern space with the output; a trailing newline
1354is suppressed.
1355
1356If a parameter is specified, instead, the @code{e} command
1357interprets it as a command and sends its output to the output stream
1358(like @code{r} does). The command can run across multiple
1359lines, all but the last ending with a back-slash.
1360
1361In both cases, the results are undefined if the command to be
1362executed contains a @sc{nul} character.
1363
1364@item L @var{n}
1365@findex L (fLow paragraphs) command
1366@cindex Reformat pattern space
1367@cindex Reformatting paragraphs
1368@cindex @value{SSEDEXT}, reformatting paragraphs
1369@cindex @value{SSEDEXT}, @code{L} command
1370This @value{SSED} extension fills and joins lines in pattern space
1371to produce output lines of (at most) @var{n} characters, like
1372@code{fmt} does; if @var{n} is omitted, the default as specified
1373on the command line is used. This command is considered a failed
1374experiment and unless there is enough request (which seems unlikely)
1375will be removed in future versions.
1376
1377@ignore
1378Blank lines, spaces between words, and indentation are
1379preserved in the output; successive input lines with different
1380indentation are not joined; tabs are expanded to 8 columns.
1381
1382If the pattern space contains multiple lines, they are joined, but
1383since the pattern space usually contains a single line, the behavior
1384of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
1385it does not join short lines to form longer ones).
1386
1387@var{n} specifies the desired line-wrap length; if omitted,
1388the default as specified on the command line is used.
1389@end ignore
1390
1391@item Q [@var{exit-code}]
1392This command only accepts a single address.
1393
1394@findex Q (silent Quit) command
1395@cindex @value{SSEDEXT}, quitting silently
1396@cindex @value{SSEDEXT}, returning an exit code
1397@cindex Quitting
1398This command is the same as @code{q}, but will not print the
1399contents of pattern space. Like @code{q}, it provides the
1400ability to return an exit code to the caller.
1401
1402This command can be useful because the only alternative ways
1403to accomplish this apparently trivial function are to use
1404the @option{-n} option (which can unnecessarily complicate
1405your script) or resorting to the following snippet, which
1406wastes time by reading the whole file without any visible effect:
1407
1408@example
1409:eat
1410$d @i{Quit silently on the last line}
1411N @i{Read another line, silently}
1412g @i{Overwrite pattern space each time to save memory}
1413b eat
1414@end example
1415
1416@item R @var{filename}
1417@findex R (read line) command
1418@cindex Read text from a file
1419@cindex @value{SSEDEXT}, reading a file a line at a time
1420@cindex @value{SSEDEXT}, @code{R} command
1421@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1422Queue a line of @var{filename} to be read and
1423inserted into the output stream at the end of the current cycle,
1424or when the next input line is read.
1425Note that if @var{filename} cannot be read, or if its end is
1426reached, no line is appended, without any error indication.
1427
1428As with the @code{r} command, the special value @file{/dev/stdin}
1429is supported for the file name, which reads a line from the
1430standard input.
1431
1432@item T @var{label}
1433@findex T (test and branch if failed) command
1434@cindex @value{SSEDEXT}, branch if @code{s///} failed
1435@cindex Branch to a label, if @code{s///} failed
1436@cindex Conditional branch
1437Branch to @var{label} only if there have been no successful
1438@code{s}ubstitutions since the last input line was read or
1439conditional branch was taken. The @var{label} may be omitted,
1440in which case the next cycle is started.
1441
1442@item v @var{version}
1443@findex v (version) command
1444@cindex @value{SSEDEXT}, checking for their presence
1445@cindex Requiring @value{SSED}
1446This command does nothing, but makes @command{sed} fail if
1447@value{SSED} extensions are not supported, simply because other
1448versions of @command{sed} do not implement it. In addition, you
1449can specify the version of @command{sed} that your script
1450requires, such as @code{4.0.5}. The default is @code{4.0}
1451because that is the first version that implemented this command.
1452
1453This command enables all @value{SSEDEXT} even if
1454@env{POSIXLY_CORRECT} is set in the environment.
1455
1456@item W @var{filename}
1457@findex W (write first line) command
1458@cindex Write first line to a file
1459@cindex @value{SSEDEXT}, writing first line to a file
1460Write to the given filename the portion of the pattern space up to
1461the first newline. Everything said under the @code{w} command about
1462file handling holds here too.
1463@end table
1464
1465@node Escapes
1466@section @acronym{GNU} Extensions for Escapes in Regular Expressions
1467
1468@cindex @acronym{GNU} extensions, special escapes
1469Until this chapter, we have only encountered escapes of the form
1470@samp{\^}, which tell @command{sed} not to interpret the circumflex
1471as a special character, but rather to take it literally. For
1472example, @samp{\*} matches a single asterisk rather than zero
1473or more backslashes.
1474
1475@cindex @code{POSIXLY_CORRECT} behavior, escapes
1476This chapter introduces another kind of escape@footnote{All
1477the escapes introduced here are @acronym{GNU}
1478extensions, with the exception of @code{\n}. In basic regular
1479expression mode, setting @code{POSIXLY_CORRECT} disables them inside
1480bracket expressions.}---that
1481is, escapes that are applied to a character or sequence of characters
1482that ordinarily are taken literally, and that @command{sed} replaces
1483with a special character. This provides a way
1484of encoding non-printable characters in patterns in a visible manner.
1485There is no restriction on the appearance of non-printing characters
1486in a @command{sed} script but when a script is being prepared in the
1487shell or by text editing, it is usually easier to use one of
1488the following escape sequences than the binary character it
1489represents:
1490
1491The list of these escapes is:
1492
1493@table @code
1494@item \a
1495Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
1496
1497@item \f
1498Produces or matches a form feed (@sc{ascii} 12).
1499
1500@item \n
1501Produces or matches a newline (@sc{ascii} 10).
1502
1503@item \r
1504Produces or matches a carriage return (@sc{ascii} 13).
1505
1506@item \t
1507Produces or matches a horizontal tab (@sc{ascii} 9).
1508
1509@item \v
1510Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
1511
1512@item \c@var{x}
1513Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
1514any character. The precise effect of @samp{\c@var{x}} is as follows:
1515if @var{x} is a lower case letter, it is converted to upper case.
1516Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes
1517hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
1518
1519@item \d@var{xxx}
1520Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
1521
1522@item \o@var{xxx}
1523@ifset PERL
1524@item \@var{xxx}
1525@end ifset
1526Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
1527@ifset PERL
1528The syntax without the @code{o} is active in Perl mode, while the one
1529with the @code{o} is active in the normal or extended @sc{posix} regular
1530expression modes.
1531@end ifset
1532
1533@item \x@var{xx}
1534Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
1535@end table
1536
1537@samp{\b} (backspace) was omitted because of the conflict with
1538the existing ``word boundary'' meaning.
1539
1540Other escapes match a particular character class and are valid only in
1541regular expressions:
1542
1543@table @code
1544@item \w
1545Matches any ``word'' character. A ``word'' character is any
1546letter or digit or the underscore character.
1547
1548@item \W
1549Matches any ``non-word'' character.
1550
1551@item \b
1552Matches a word boundary; that is it matches if the character
1553to the left is a ``word'' character and the character to the
1554right is a ``non-word'' character, or vice-versa.
1555
1556@item \B
1557Matches everywhere but on a word boundary; that is it matches
1558if the character to the left and the character to the right
1559are either both ``word'' characters or both ``non-word''
1560characters.
1561
1562@item \`
1563Matches only at the start of pattern space. This is different
1564from @code{^} in multi-line mode.
1565
1566@item \'
1567Matches only at the end of pattern space. This is different
1568from @code{$} in multi-line mode.
1569
1570@ifset PERL
1571@item \G
1572Match only at the start of pattern space or, when doing a global
1573substitution using the @code{s///g} command and option, at
1574the end-of-match position of the prior match. For example,
1575@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
1576a run of @code{Z}s
1577@end ifset
1578@end table
1579
1580@node Examples
1581@chapter Some Sample Scripts
1582
1583Here are some @command{sed} scripts to guide you in the art of mastering
1584@command{sed}.
1585
1586@menu
1587Some exotic examples:
1588* Centering lines::
1589* Increment a number::
1590* Rename files to lower case::
1591* Print bash environment::
1592* Reverse chars of lines::
1593
1594Emulating standard utilities:
1595* tac:: Reverse lines of files
1596* cat -n:: Numbering lines
1597* cat -b:: Numbering non-blank lines
1598* wc -c:: Counting chars
1599* wc -w:: Counting words
1600* wc -l:: Counting lines
1601* head:: Printing the first lines
1602* tail:: Printing the last lines
1603* uniq:: Make duplicate lines unique
1604* uniq -d:: Print duplicated lines of input
1605* uniq -u:: Remove all duplicated lines
1606* cat -s:: Squeezing blank lines
1607@end menu
1608
1609@node Centering lines
1610@section Centering Lines
1611
1612This script centers all lines of a file on a 80 columns width.
1613To change that width, the number in @code{\@{@dots{}\@}} must be
1614replaced, and the number of added spaces also must be changed.
1615
1616Note how the buffer commands are used to separate parts in
1617the regular expressions to be matched---this is a common
1618technique.
1619
1620@c start-------------------------------------------
1621@example
1622#!/usr/bin/sed -f
1623
1624# Put 80 spaces in the buffer
16251 @{
1626 x
1627 s/^$/ /
1628 s/^.*$/&&&&&&&&/
1629 x
1630@}
1631
1632# del leading and trailing spaces
1633y/@kbd{tab}/ /
1634s/^ *//
1635s/ *$//
1636
1637# add a newline and 80 spaces to end of line
1638G
1639
1640# keep first 81 chars (80 + a newline)
1641s/^\(.\@{81\@}\).*$/\1/
1642
1643# \2 matches half of the spaces, which are moved to the beginning
1644s/^\(.*\)\n\(.*\)\2/\2\1/
1645@end example
1646@c end---------------------------------------------
1647
1648@node Increment a number
1649@section Increment a Number
1650
1651This script is one of a few that demonstrate how to do arithmetic
1652in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg
1653Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
1654It is distributed together with sed.} but must be done manually.
1655
1656To increment one number you just add 1 to last digit, replacing
1657it by the following digit. There is one exception: when the digit
1658is a nine the previous digits must be also incremented until you
1659don't have a nine.
1660
1661This solution by Bruno Haible is very clever and smart because
1662it uses a single buffer; if you don't have this limitation, the
1663algorithm used in @ref{cat -n, Numbering lines}, is faster.
1664It works by replacing trailing nines with an underscore, then
1665using multiple @code{s} commands to increment the last digit,
1666and then again substituting underscores with zeros.
1667
1668@c start-------------------------------------------
1669@example
1670#!/usr/bin/sed -f
1671
1672/[^0-9]/ d
1673
1674# replace all leading 9s by _ (any other character except digits, could
1675# be used)
1676:d
1677s/9\(_*\)$/_\1/
1678td
1679
1680# incr last digit only. The first line adds a most-significant
1681# digit of 1 if we have to add a digit.
1682#
1683# The @code{tn} commands are not necessary, but make the thing
1684# faster
1685
1686s/^\(_*\)$/1\1/; tn
1687s/8\(_*\)$/9\1/; tn
1688s/7\(_*\)$/8\1/; tn
1689s/6\(_*\)$/7\1/; tn
1690s/5\(_*\)$/6\1/; tn
1691s/4\(_*\)$/5\1/; tn
1692s/3\(_*\)$/4\1/; tn
1693s/2\(_*\)$/3\1/; tn
1694s/1\(_*\)$/2\1/; tn
1695s/0\(_*\)$/1\1/; tn
1696
1697:n
1698y/_/0/
1699@end example
1700@c end---------------------------------------------
1701
1702@node Rename files to lower case
1703@section Rename Files to Lower Case
1704
1705This is a pretty strange use of @command{sed}. We transform text, and
1706transform it to be shell commands, then just feed them to shell.
1707Don't worry, even worse hacks are done when using @command{sed}; I have
1708seen a script converting the output of @command{date} into a @command{bc}
1709program!
1710
1711The main body of this is the @command{sed} script, which remaps the name
1712from lower to upper (or vice-versa) and even checks out
1713if the remapped name is the same as the original name.
1714Note how the script is parameterized using shell
1715variables and proper quoting.
1716
1717@c start-------------------------------------------
1718@example
1719#! /bin/sh
1720# rename files to lower/upper case...
1721#
1722# usage:
1723# move-to-lower *
1724# move-to-upper *
1725# or
1726# move-to-lower -R .
1727# move-to-upper -R .
1728#
1729
1730help()
1731@{
1732 cat << eof
1733Usage: $0 [-n] [-r] [-h] files...
1734
1735-n do nothing, only see what would be done
1736-R recursive (use find)
1737-h this message
1738files files to remap to lower case
1739
1740Examples:
1741 $0 -n * (see if everything is ok, then...)
1742 $0 *
1743
1744 $0 -R .
1745
1746eof
1747@}
1748
1749apply_cmd='sh'
1750finder='echo "$@@" | tr " " "\n"'
1751files_only=
1752
1753while :
1754do
1755 case "$1" in
1756 -n) apply_cmd='cat' ;;
1757 -R) finder='find "$@@" -type f';;
1758 -h) help ; exit 1 ;;
1759 *) break ;;
1760 esac
1761 shift
1762done
1763
1764if [ -z "$1" ]; then
1765 echo Usage: $0 [-h] [-n] [-r] files...
1766 exit 1
1767fi
1768
1769LOWER='abcdefghijklmnopqrstuvwxyz'
1770UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
1771
1772case `basename $0` in
1773 *upper*) TO=$UPPER; FROM=$LOWER ;;
1774 *) FROM=$UPPER; TO=$LOWER ;;
1775esac
1776
1777eval $finder | sed -n '
1778
1779# remove all trailing slashes
1780s/\/*$//
1781
1782# add ./ if there is no path, only a filename
1783/\//! s/^/.\//
1784
1785# save path+filename
1786h
1787
1788# remove path
1789s/.*\///
1790
1791# do conversion only on filename
1792y/'$FROM'/'$TO'/
1793
1794# now line contains original path+file, while
1795# hold space contains the new filename
1796x
1797
1798# add converted file name to line, which now contains
1799# path/file-name\nconverted-file-name
1800G
1801
1802# check if converted file name is equal to original file name,
1803# if it is, do not print nothing
1804/^.*\/\(.*\)\n\1/b
1805
1806# now, transform path/fromfile\n, into
1807# mv path/fromfile path/tofile and print it
1808s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
1809
1810' | $apply_cmd
1811@end example
1812@c end---------------------------------------------
1813
1814@node Print bash environment
1815@section Print @command{bash} Environment
1816
1817This script strips the definition of the shell functions
1818from the output of the @command{set} Bourne-shell command.
1819
1820@c start-------------------------------------------
1821@example
1822#!/bin/sh
1823
1824set | sed -n '
1825:x
1826
1827@ifinfo
1828# if no occurrence of "=()" print and load next line
1829@end ifinfo
1830@ifnotinfo
1831# if no occurrence of @samp{=()} print and load next line
1832@end ifnotinfo
1833/=()/! @{ p; b; @}
1834/ () $/! @{ p; b; @}
1835
1836# possible start of functions section
1837# save the line in case this is a var like FOO="() "
1838h
1839
1840# if the next line has a brace, we quit because
1841# nothing comes after functions
1842n
1843/^@{/ q
1844
1845# print the old line
1846x; p
1847
1848# work on the new line now
1849x; bx
1850'
1851@end example
1852@c end---------------------------------------------
1853
1854@node Reverse chars of lines
1855@section Reverse Characters of Lines
1856
1857This script can be used to reverse the position of characters
1858in lines. The technique moves two characters at a time, hence
1859it is faster than more intuitive implementations.
1860
1861Note the @code{tx} command before the definition of the label.
1862This is often needed to reset the flag that is tested by
1863the @code{t} command.
1864
1865Imaginative readers will find uses for this script. An example
1866is reversing the output of @command{banner}.@footnote{This requires
1867another script to pad the output of banner; for example
1868
1869@example
1870#! /bin/sh
1871
1872banner -w $1 $2 $3 $4 |
1873 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
1874 ~/sedscripts/reverseline.sed
1875@end example
1876}
1877
1878@c start-------------------------------------------
1879@example
1880#!/usr/bin/sed -f
1881
1882/../! b
1883
1884# Reverse a line. Begin embedding the line between two newlines
1885s/^.*$/\
1886&\
1887/
1888
1889# Move first character at the end. The regexp matches until
1890# there are zero or one characters between the markers
1891tx
1892:x
1893s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
1894tx
1895
1896# Remove the newline markers
1897s/\n//g
1898@end example
1899@c end---------------------------------------------
1900
1901@node tac
1902@section Reverse Lines of Files
1903
1904This one begins a series of totally useless (yet interesting)
1905scripts emulating various Unix commands. This, in particular,
1906is a @command{tac} workalike.
1907
1908Note that on implementations other than @acronym{GNU} @command{sed}
1909@ifset PERL
1910and @value{SSED}
1911@end ifset
1912this script might easily overflow internal buffers.
1913
1914@c start-------------------------------------------
1915@example
1916#!/usr/bin/sed -nf
1917
1918# reverse all lines of input, i.e. first line became last, ...
1919
1920# from the second line, the buffer (which contains all previous lines)
1921# is *appended* to current line, so, the order will be reversed
19221! G
1923
1924# on the last line we're done -- print everything
1925$ p
1926
1927# store everything on the buffer again
1928h
1929@end example
1930@c end---------------------------------------------
1931
1932@node cat -n
1933@section Numbering Lines
1934
1935This script replaces @samp{cat -n}; in fact it formats its output
1936exactly like @acronym{GNU} @command{cat} does.
1937
1938Of course this is completely useless and for two reasons: first,
1939because somebody else did it in C, second, because the following
1940Bourne-shell script could be used for the same purpose and would
1941be much faster:
1942
1943@c start-------------------------------------------
1944@example
1945#! /bin/sh
1946sed -e "=" $@@ | sed -e '
1947 s/^/ /
1948 N
1949 s/^ *\(......\)\n/\1 /
1950'
1951@end example
1952@c end---------------------------------------------
1953
1954It uses @command{sed} to print the line number, then groups lines two
1955by two using @code{N}. Of course, this script does not teach as much as
1956the one presented below.
1957
1958The algorithm used for incrementing uses both buffers, so the line
1959is printed as soon as possible and then discarded. The number
1960is split so that changing digits go in a buffer and unchanged ones go
1961in the other; the changed digits are modified in a single step
1962(using a @code{y} command). The line number for the next line
1963is then composed and stored in the hold space, to be used in the
1964next iteration.
1965
1966@c start-------------------------------------------
1967@example
1968#!/usr/bin/sed -nf
1969
1970# Prime the pump on the first line
1971x
1972/^$/ s/^.*$/1/
1973
1974# Add the correct line number before the pattern
1975G
1976h
1977
1978# Format it and print it
1979s/^/ /
1980s/^ *\(......\)\n/\1 /p
1981
1982# Get the line number from hold space; add a zero
1983# if we're going to add a digit on the next line
1984g
1985s/\n.*$//
1986/^9*$/ s/^/0/
1987
1988# separate changing/unchanged digits with an x
1989s/.9*$/x&/
1990
1991# keep changing digits in hold space
1992h
1993s/^.*x//
1994y/0123456789/1234567890/
1995x
1996
1997# keep unchanged digits in pattern space
1998s/x.*$//
1999
2000# compose the new number, remove the newline implicitly added by G
2001G
2002s/\n//
2003h
2004@end example
2005@c end---------------------------------------------
2006
2007@node cat -b
2008@section Numbering Non-blank Lines
2009
2010Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
2011have to select which lines are to be numbered and which are not.
2012
2013The part that is common to this script and the previous one is
2014not commented to show how important it is to comment @command{sed}
2015scripts properly...
2016
2017@c start-------------------------------------------
2018@example
2019#!/usr/bin/sed -nf
2020
2021/^$/ @{
2022 p
2023 b
2024@}
2025
2026# Same as cat -n from now
2027x
2028/^$/ s/^.*$/1/
2029G
2030h
2031s/^/ /
2032s/^ *\(......\)\n/\1 /p
2033x
2034s/\n.*$//
2035/^9*$/ s/^/0/
2036s/.9*$/x&/
2037h
2038s/^.*x//
2039y/0123456789/1234567890/
2040x
2041s/x.*$//
2042G
2043s/\n//
2044h
2045@end example
2046@c end---------------------------------------------
2047
2048@node wc -c
2049@section Counting Characters
2050
2051This script shows another way to do arithmetic with @command{sed}.
2052In this case we have to add possibly large numbers, so implementing
2053this by successive increments would not be feasible (and possibly
2054even more complicated to contrive than this script).
2055
2056The approach is to map numbers to letters, kind of an abacus
2057implemented with @command{sed}. @samp{a}s are units, @samp{b}s are
2058tens and so on: we simply add the number of characters
2059on the current line as units, and then propagate the carry
2060to tens, hundreds, and so on.
2061
2062As usual, running totals are kept in hold space.
2063
2064On the last line, we convert the abacus form back to decimal.
2065For the sake of variety, this is done with a loop rather than
2066with some 80 @code{s} commands@footnote{Some implementations
2067have a limit of 199 commands per script}: first we
2068convert units, removing @samp{a}s from the number; then we
2069rotate letters so that tens become @samp{a}s, and so on
2070until no more letters remain.
2071
2072@c start-------------------------------------------
2073@example
2074#!/usr/bin/sed -nf
2075
2076# Add n+1 a's to hold space (+1 is for the newline)
2077s/./a/g
2078H
2079x
2080s/\n/a/
2081
2082# Do the carry. The t's and b's are not necessary,
2083# but they do speed up the thing
2084t a
2085: a; s/aaaaaaaaaa/b/g; t b; b done
2086: b; s/bbbbbbbbbb/c/g; t c; b done
2087: c; s/cccccccccc/d/g; t d; b done
2088: d; s/dddddddddd/e/g; t e; b done
2089: e; s/eeeeeeeeee/f/g; t f; b done
2090: f; s/ffffffffff/g/g; t g; b done
2091: g; s/gggggggggg/h/g; t h; b done
2092: h; s/hhhhhhhhhh//g
2093
2094: done
2095$! @{
2096 h
2097 b
2098@}
2099
2100# On the last line, convert back to decimal
2101
2102: loop
2103/a/! s/[b-h]*/&0/
2104s/aaaaaaaaa/9/
2105s/aaaaaaaa/8/
2106s/aaaaaaa/7/
2107s/aaaaaa/6/
2108s/aaaaa/5/
2109s/aaaa/4/
2110s/aaa/3/
2111s/aa/2/
2112s/a/1/
2113
2114: next
2115y/bcdefgh/abcdefg/
2116/[a-h]/ b loop
2117p
2118@end example
2119@c end---------------------------------------------
2120
2121@node wc -w
2122@section Counting Words
2123
2124This script is almost the same as the previous one, once each
2125of the words on the line is converted to a single @samp{a}
2126(in the previous script each letter was changed to an @samp{a}).
2127
2128It is interesting that real @command{wc} programs have optimized
2129loops for @samp{wc -c}, so they are much slower at counting
2130words rather than characters. This script's bottleneck,
2131instead, is arithmetic, and hence the word-counting one
2132is faster (it has to manage smaller numbers).
2133
2134Again, the common parts are not commented to show the importance
2135of commenting @command{sed} scripts.
2136
2137@c start-------------------------------------------
2138@example
2139#!/usr/bin/sed -nf
2140
2141# Convert words to a's
2142s/[ @kbd{tab}][ @kbd{tab}]*/ /g
2143s/^/ /
2144s/ [^ ][^ ]*/a /g
2145s/ //g
2146
2147# Append them to hold space
2148H
2149x
2150s/\n//
2151
2152# From here on it is the same as in wc -c.
2153/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g
2154/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g
2155/cccccccccc/! bx; s/cccccccccc/d/g
2156/dddddddddd/! bx; s/dddddddddd/e/g
2157/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g
2158/ffffffffff/! bx; s/ffffffffff/g/g
2159/gggggggggg/! bx; s/gggggggggg/h/g
2160s/hhhhhhhhhh//g
2161:x
2162$! @{ h; b; @}
2163:y
2164/a/! s/[b-h]*/&0/
2165s/aaaaaaaaa/9/
2166s/aaaaaaaa/8/
2167s/aaaaaaa/7/
2168s/aaaaaa/6/
2169s/aaaaa/5/
2170s/aaaa/4/
2171s/aaa/3/
2172s/aa/2/
2173s/a/1/
2174y/bcdefgh/abcdefg/
2175/[a-h]/ by
2176p
2177@end example
2178@c end---------------------------------------------
2179
2180@node wc -l
2181@section Counting Lines
2182
2183No strange things are done now, because @command{sed} gives us
2184@samp{wc -l} functionality for free!!! Look:
2185
2186@c start-------------------------------------------
2187@example
2188#!/usr/bin/sed -nf
2189$=
2190@end example
2191@c end---------------------------------------------
2192
2193@node head
2194@section Printing the First Lines
2195
2196This script is probably the simplest useful @command{sed} script.
2197It displays the first 10 lines of input; the number of displayed
2198lines is right before the @code{q} command.
2199
2200@c start-------------------------------------------
2201@example
2202#!/usr/bin/sed -f
220310q
2204@end example
2205@c end---------------------------------------------
2206
2207@node tail
2208@section Printing the Last Lines
2209
2210Printing the last @var{n} lines rather than the first is more complex
2211but indeed possible. @var{n} is encoded in the second line, before
2212the bang character.
2213
2214This script is similar to the @command{tac} script in that it keeps the
2215final output in the hold space and prints it at the end:
2216
2217@c start-------------------------------------------
2218@example
2219#!/usr/bin/sed -nf
2220
22211! @{; H; g; @}
22221,10 !s/[^\n]*\n//
2223$p
2224h
2225@end example
2226@c end---------------------------------------------
2227
2228Mainly, the scripts keeps a window of 10 lines and slides it
2229by adding a line and deleting the oldest (the substitution command
2230on the second line works like a @code{D} command but does not
2231restart the loop).
2232
2233The ``sliding window'' technique is a very powerful way to write
2234efficient and complex @command{sed} scripts, because commands like
2235@code{P} would require a lot of work if implemented manually.
2236
2237To introduce the technique, which is fully demonstrated in the
2238rest of this chapter and is based on the @code{N}, @code{P}
2239and @code{D} commands, here is an implementation of @command{tail}
2240using a simple ``sliding window.''
2241
2242This looks complicated but in fact the working is the same as
2243the last script: after we have kicked in the appropriate number
2244of lines, however, we stop using the hold space to keep inter-line
2245state, and instead use @code{N} and @code{D} to slide pattern
2246space by one line:
2247
2248@c start-------------------------------------------
2249@example
2250#!/usr/bin/sed -f
2251
22521h
22532,10 @{; H; g; @}
2254$q
22551,9d
2256N
2257D
2258@end example
2259@c end---------------------------------------------
2260
2261Note how the first, second and fourth line are inactive after
2262the first ten lines of input. After that, all the script does
2263is: exiting on the last line of input, appending the next input
2264line to pattern space, and removing the first line.
2265
2266@node uniq
2267@section Make Duplicate Lines Unique
2268
2269This is an example of the art of using the @code{N}, @code{P}
2270and @code{D} commands, probably the most difficult to master.
2271
2272@c start-------------------------------------------
2273@example
2274#!/usr/bin/sed -f
2275h
2276
2277:b
2278# On the last line, print and exit
2279$b
2280N
2281/^\(.*\)\n\1$/ @{
2282 # The two lines are identical. Undo the effect of
2283 # the n command.
2284 g
2285 bb
2286@}
2287
2288# If the @code{N} command had added the last line, print and exit
2289$b
2290
2291# The lines are different; print the first and go
2292# back working on the second.
2293P
2294D
2295@end example
2296@c end---------------------------------------------
2297
2298As you can see, we mantain a 2-line window using @code{P} and @code{D}.
2299This technique is often used in advanced @command{sed} scripts.
2300
2301@node uniq -d
2302@section Print Duplicated Lines of Input
2303
2304This script prints only duplicated lines, like @samp{uniq -d}.
2305
2306@c start-------------------------------------------
2307@example
2308#!/usr/bin/sed -nf
2309
2310$b
2311N
2312/^\(.*\)\n\1$/ @{
2313 # Print the first of the duplicated lines
2314 s/.*\n//
2315 p
2316
2317 # Loop until we get a different line
2318 :b
2319 $b
2320 N
2321 /^\(.*\)\n\1$/ @{
2322 s/.*\n//
2323 bb
2324 @}
2325@}
2326
2327# The last line cannot be followed by duplicates
2328$b
2329
2330# Found a different one. Leave it alone in the pattern space
2331# and go back to the top, hunting its duplicates
2332D
2333@end example
2334@c end---------------------------------------------
2335
2336@node uniq -u
2337@section Remove All Duplicated Lines
2338
2339This script prints only unique lines, like @samp{uniq -u}.
2340
2341@c start-------------------------------------------
2342@example
2343#!/usr/bin/sed -f
2344
2345# Search for a duplicate line --- until that, print what you find.
2346$b
2347N
2348/^\(.*\)\n\1$/ ! @{
2349 P
2350 D
2351@}
2352
2353:c
2354# Got two equal lines in pattern space. At the
2355# end of the file we simply exit
2356$d
2357
2358# Else, we keep reading lines with @code{N} until we
2359# find a different one
2360s/.*\n//
2361N
2362/^\(.*\)\n\1$/ @{
2363 bc
2364@}
2365
2366# Remove the last instance of the duplicate line
2367# and go back to the top
2368D
2369@end example
2370@c end---------------------------------------------
2371
2372@node cat -s
2373@section Squeezing Blank Lines
2374
2375As a final example, here are three scripts, of increasing complexity
2376and speed, that implement the same function as @samp{cat -s}, that is
2377squeezing blank lines.
2378
2379The first leaves a blank line at the beginning and end if there are
2380some already.
2381
2382@c start-------------------------------------------
2383@example
2384#!/usr/bin/sed -f
2385
2386# on empty lines, join with next
2387# Note there is a star in the regexp
2388:x
2389/^\n*$/ @{
2390N
2391bx
2392@}
2393
2394# now, squeeze all '\n', this can be also done by:
2395# s/^\(\n\)*/\1/
2396s/\n*/\
2397/
2398@end example
2399@c end---------------------------------------------
2400
2401This one is a bit more complex and removes all empty lines
2402at the beginning. It does leave a single blank line at end
2403if one was there.
2404
2405@c start-------------------------------------------
2406@example
2407#!/usr/bin/sed -f
2408
2409# delete all leading empty lines
24101,/^./@{
2411/./!d
2412@}
2413
2414# on an empty line we remove it and all the following
2415# empty lines, but one
2416:x
2417/./!@{
2418N
2419s/^\n$//
2420tx
2421@}
2422@end example
2423@c end---------------------------------------------
2424
2425This removes leading and trailing blank lines. It is also the
2426fastest. Note that loops are completely done with @code{n} and
2427@code{b}, without relying on @command{sed} to restart the
2428the script automatically at the end of a line.
2429
2430@c start-------------------------------------------
2431@example
2432#!/usr/bin/sed -nf
2433
2434# delete all (leading) blanks
2435/./!d
2436
2437# get here: so there is a non empty
2438:x
2439# print it
2440p
2441# get next
2442n
2443# got chars? print it again, etc...
2444/./bx
2445
2446# no, don't have chars: got an empty line
2447:z
2448# get next, if last line we finish here so no trailing
2449# empty lines are written
2450n
2451# also empty? then ignore it, and get next... this will
2452# remove ALL empty lines
2453/./!bz
2454
2455# all empty lines were deleted/ignored, but we have a non empty. As
2456# what we want to do is to squeeze, insert a blank line artificially
2457i\
2458
2459bx
2460@end example
2461@c end---------------------------------------------
2462
2463@node Limitations
2464@chapter @value{SSED}'s Limitations and Non-limitations
2465
2466@cindex @acronym{GNU} extensions, unlimited line length
2467@cindex Portability, line length limitations
2468For those who want to write portable @command{sed} scripts,
2469be aware that some implementations have been known to
2470limit line lengths (for the pattern and hold spaces)
2471to be no more than 4000 bytes.
2472The @sc{posix} standard specifies that conforming @command{sed}
2473implementations shall support at least 8192 byte line lengths.
2474@value{SSED} has no built-in limit on line length;
2475as long as it can @code{malloc()} more (virtual) memory,
2476you can feed or construct lines as long as you like.
2477
2478However, recursion is used to handle subpatterns and indefinite
2479repetition. This means that the available stack space may limit
2480the size of the buffer that can be processed by certain patterns.
2481
2482@ifset PERL
2483There are some size limitations in the regular expression
2484matcher but it is hoped that they will never in practice
2485be relevant. The maximum length of a compiled pattern
2486is 65539 (sic) bytes. All values in repeating quantifiers
2487must be less than 65536. The maximum nesting depth of
2488all parenthesized subpatterns, including capturing and
2489non-capturing subpatterns@footnote{The
2490distinction is meaningful when referring to Perl-style
2491regular expressions.}, assertions, and other types of
2492subpattern, is 200.
2493
2494Also, @value{SSED} recognizes the @sc{posix} syntax
2495@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
2496where @var{ch} is a ``collating element'', but these
2497are not supported, and an error is given if they are
2498encountered.
2499
2500Here are a few distinctions between the real Perl-style
2501regular expressions and those that @option{-R} recognizes.
2502
2503@enumerate
2504@item
2505Lookahead assertions do not allow repeat quantifiers after them
2506Perl permits them, but they do not mean what you
2507might think. For example, @samp{(?!a)@{3@}} does not assert that the
2508next three characters are not @samp{a}. It just asserts three times that the
2509next character is not @samp{a} --- a waste of time and nothing else.
2510
2511@item
2512Capturing subpatterns that occur inside negative lookahead
2513head assertions are counted, but their entries are counted
2514as empty in the second half of an @code{s} command.
2515Perl sets its numerical variables from any such patterns
2516that are matched before the assertion fails to match
2517something (thereby succeeding), but only if the negative
2518lookahead assertion contains just one branch.
2519
2520@item
2521The following Perl escape sequences are not supported:
2522@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
2523@samp{\Q}. In fact these are implemented by Perl's general
2524string-handling and are not part of its pattern matching engine.
2525
2526@item
2527The Perl @samp{\G} assertion is not supported as it is not
2528relevant to single pattern matches.
2529
2530@item
2531Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
2532and @samp{(?p@{code@})} constructions. However, there is some experimental
2533support for recursive patterns using the non-Perl item @samp{(?R)}.
2534
2535@item
2536There are at the time of writing some oddities in Perl
25375.005_02 concerned with the settings of captured strings
2538when part of a pattern is repeated. For example, matching
2539@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
2540@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
2541to the value @samp{b}, but matching @samp{aabbaa}
2542against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
2543unset. However, if the pattern is changed to
2544@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
2545In Perl 5.004 @samp{$2} is set in both cases, and that is also
2546true of @value{SSED}.
2547
2548@item
2549Another as yet unresolved discrepancy is that in Perl
25505.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
2551the string @samp{a}, whereas in @value{SSED} it does not.
2552However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
2553against @samp{a} leaves $1 unset.
2554@end enumerate
2555@end ifset
2556
2557@node Other Resources
2558@chapter Other Resources for Learning About @command{sed}
2559
2560@cindex Additional reading about @command{sed}
2561In addition to several books that have been written about @command{sed}
2562(either specifically or as chapters in books which discuss
2563shell programming), one can find out more about @command{sed}
2564(including suggestions of a few books) from the FAQ
2565for the @code{sed-users} mailing list, available from any of:
2566@display
2567 @uref{http://www.student.northpark.edu/pemente/sed/sedfaq.html}
2568 @uref{http://sed.sf.net/grabbag/tutorials/sedfaq.html}
2569@end display
2570
2571Also of interest are
2572@uref{http://www.student.northpark.edu/pemente/sed/index.htm}
2573and @uref{http://sed.sf.net/grabbag},
2574which include @command{sed} tutorials and other @command{sed}-related goodies.
2575
2576The @code{sed-users} mailing list itself maintained by Sven Guckes.
2577To subscribe, visit @uref{http://groups.yahoo.com} and search
2578for the @code{sed-users} mailing list.
2579
2580@node Reporting Bugs
2581@chapter Reporting Bugs
2582
2583@cindex Bugs, reporting
2584Email bug reports to @email{bonzini@@gnu.org}.
2585Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
2586Also, please include the output of @samp{sed --version} in the body
2587of your report if at all possible.
2588
2589Please do not send a bug report like this:
2590
2591@example
2592@i{while building frobme-1.3.4}
2593$ configure
2594@error{} sed: file sedscr line 1: Unknown option to 's'
2595@end example
2596
2597If @value{SSED} doesn't configure your favorite package, take a
2598few extra minutes to identify the specific problem and make a stand-alone
2599test case. Unlike other programs such as C compilers, making such test
2600cases for @command{sed} is quite simple.
2601
2602A stand-alone test case includes all the data necessary to perform the
2603test, and the specific invocation of @command{sed} that causes the problem.
2604The smaller a stand-alone test case is, the better. A test case should
2605not involve something as far removed from @command{sed} as ``try to configure
2606frobme-1.3.4''. Yes, that is in principle enough information to look
2607for the bug, but that is not a very practical prospect.
2608
2609Here are a few commonly reported bugs that are not bugs.
2610
2611@table @asis
2612@item @code{N} command on the last line
2613@cindex Portability, @code{N} command on the last line
2614@cindex Non-bugs, @code{N} command on the last line
2615
2616Most versions of @command{sed} exit without printing anything when
2617the @command{N} command is issued on the last line of a file.
2618@value{SSED} prints pattern space before exiting unless of course
2619the @command{-n} command switch has been specified. This choice is
2620by design.
2621
2622For example, the behavior of
2623@example
2624sed N foo bar
2625@end example
2626@noindent
2627would depend on whether foo has an even or an odd number of
2628lines@footnote{which is the actual ``bug'' that prompted the
2629change in behavior}. Or, when writing a script to read the
2630next few lines following a pattern match, traditional
2631implementations of @code{sed} would force you to write
2632something like
2633@example
2634/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
2635@end example
2636@noindent
2637instead of just
2638@example
2639/foo/@{ N;N;N;N;N;N;N;N;N; @}
2640@end example
2641
2642@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
2643In any case, the simplest workaround is to use @code{$d;N} in
2644scripts that rely on the traditional behavior, or to set
2645the @code{POSIXLY_CORRECT} variable to a non-empty value.
2646
2647@item Regex syntax clashes (problems with backslashes)
2648@cindex @acronym{GNU} extensions, to basic regular expressions
2649@cindex Non-bugs, regex syntax clashes
2650@command{sed} uses the @sc{posix} basic regular expression syntax. According to
2651the standard, the meaning of some escape sequences is undefined in
2652this syntax; notable in the case of @command{sed} are @code{\|},
2653@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
2654@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
2655
2656As in all @acronym{GNU} programs that use @sc{posix} basic regular
2657expressions, @command{sed} interprets these escape sequences as special
2658characters. So, @code{x\+} matches one or more occurrences of @samp{x}.
2659@code{abc\|def} matches either @samp{abc} or @samp{def}.
2660
2661This syntax may cause problems when running scripts written for other
2662@command{sed}s. Some @command{sed} programs have been written with the
2663assumption that @code{\|} and @code{\+} match the literal characters
2664@code{|} and @code{+}. Such scripts must be modified by removing the
2665spurious backslashes if they are to be used with modern implementations
2666of @command{sed}, like
2667@ifset PERL
2668@value{SSED} or
2669@end ifset
2670@acronym{GNU} @command{sed}.
2671
2672On the other hand, some scripts use s|abc\|def||g to remove occurrences
2673of @emph{either} @code{abc} or @code{def}. While this worked until
2674@command{sed} 4.0.x, newer versions interpret this as removing the
2675string @code{abc|def}. This is again undefined behavior according to
2676@acronym{POSIX}, and this interpretation is arguably more robust: older
2677@command{sed}s, for example, required that the regex matcher parsed
2678@code{\/} as @code{/} in the common case of escaping a slash, which is
2679again undefined behavior; the new behavior avoids this, and this is good
2680because the regex matcher is only partially under our control.
2681
2682@cindex @acronym{GNU} extensions, special escapes
2683In addition, this version of @command{sed} supports several escape characters
2684(some of which are multi-character) to insert non-printable characters
2685in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
2686@code{\t}, @code{\v}, @code{\x}). These can cause similar problems
2687with scripts written for other @command{sed}s.
2688
2689@item @option{-i} clobbers read-only files
2690@cindex In-place editing
2691@cindex @value{SSEDEXT}, in-place editing
2692@cindex Non-bugs, in-place editing
2693
2694In short, @samp{sed -i} will let you delete the contents of
2695a read-only file, and in general the @option{-i} option
2696(@pxref{Invoking sed, , Invocation}) lets you clobber
2697protected files. This is not a bug, but rather a consequence
2698of how the Unix filesystem works.
2699
2700The permissions on a file say what can happen to the data
2701in that file, while the permissions on a directory say what can
2702happen to the list of files in that directory. @samp{sed -i}
2703will not ever open for writing a file that is already on disk.
2704Rather, it will work on a temporary file that is finally renamed
2705to the original name: if you rename or delete files, you're actually
2706modifying the contents of the directory, so the operation depends on
2707the permissions of the directory, not of the file. For this same
2708reason, @command{sed} does not let you use @option{-i} on a writeable file
2709in a read-only directory (but unbelievably nobody reports that as a
2710bug@dots{}).
2711
2712@item @code{0a} does not work (gives an error)
2713There is no line 0. 0 is a special address that is only used to treat
2714addresses like @code{0,/@var{RE}/} as active when the script starts: if
2715you write @code{1,/abc/d} and the first line includes the word @samp{abc},
2716then that match would be ignored because address ranges must span at least
2717two lines (barring the end of the file); but what you probably wanted is
2718to delete every line up to the first one including @samp{abc}, and this
2719is obtained with @code{0,/abc/d}.
2720
2721@ifclear PERL
2722@item @code{[a-z]} is case insensitive
2723You are encountering problems with locales. POSIX mandates that @code{[a-z]}
2724uses the current locale's collation order -- in C parlance, that means using
2725@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a
2726case-insensitive collation order, others don't: one of those that have
2727problems is Estonian.
2728
2729Another problem is that @code{[a-z]} tries to use collation symbols.
2730This only happens if you are on the @acronym{GNU} system, using
2731@acronym{GNU} libc's regular expression matcher instead of compiling the
2732one supplied with @acronym{GNU} sed. In a Danish locale, for example,
2733the regular expression @code{^[a-z]$} matches the string @samp{aa},
2734because this is a single collating symbol that comes after @samp{a}
2735and before @samp{b}; @samp{ll} behaves similarly in Spanish
2736locales, or @samp{ij} in Dutch locales.
2737
2738To work around these problems, which may cause bugs in shell scripts, set
2739the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2740@end ifclear
2741@end table
2742
2743
2744@node Extended regexps
2745@appendix Extended regular expressions
2746@cindex Extended regular expressions, syntax
2747
2748The only difference between basic and extended regular expressions is in
2749the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
2750and braces (@samp{@{@}}). While basic regular expressions require
2751these to be escaped if you want them to behave as special characters,
2752when using extended regular expressions you must escape them if
2753you want them @emph{to match a literal character}.
2754
2755@noindent
2756Examples:
2757@table @code
2758@item abc?
2759becomes @samp{abc\?} when using extended regular expressions. It matches
2760the literal string @samp{abc?}.
2761
2762@item c\+
2763becomes @samp{c+} when using extended regular expressions. It matches
2764one or more @samp{c}s.
2765
2766@item a\@{3,\@}
2767becomes @samp{a@{3,@}} when using extended regular expressions. It matches
2768three or more @samp{a}s.
2769
2770@item \(abc\)\@{2,3\@}
2771becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It
2772matches either @samp{abcabc} or @samp{abcabcabc}.
2773
2774@item \(abc*\)\1
2775becomes @samp{(abc*)\1} when using extended regular expressions.
2776Backreferences must still be escaped when using extended regular
2777expressions.
2778@end table
2779
2780@ifset PERL
2781@node Perl regexps
2782@appendix Perl-style regular expressions
2783@cindex Perl-style regular expressions, syntax
2784
2785@emph{This part is taken from the @file{pcre.txt} file distributed together
2786with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
2787
2788Perl introduced several extensions to regular expressions, some
2789of them incompatible with the syntax of regular expressions
2790accepted by Emacs and other @acronym{GNU} tools (whose matcher was
2791based on the Emacs matcher). @value{SSED} implements
2792both kinds of extensions.
2793
2794@iftex
2795Summarizing, we have:
2796
2797@itemize @bullet
2798@item
2799A backslash can introduce several special sequences
2800
2801@item
2802The circumflex, dollar sign, and period characters behave specially
2803with regard to new lines
2804
2805@item
2806Strange uses of square brackets are parsed differently
2807
2808@item
2809You can toggle modifiers in the middle of a regular expression
2810
2811@item
2812You can specify that a subpattern does not count when numbering backreferences
2813
2814@item
2815@cindex Greedy regular expression matching
2816You can specify greedy or non-greedy matching
2817
2818@item
2819You can have more than ten back references
2820
2821@item
2822You can do complex look aheads and look behinds (in the spirit of
2823@code{\b}, but with subpatterns).
2824
2825@item
2826You can often improve performance by avoiding that @command{sed} wastes
2827time with backtracking
2828
2829@item
2830You can have if/then/else branches
2831
2832@item
2833You can do recursive matches, for example to look for unbalanced parentheses
2834
2835@item
2836You can have comments and non-significant whitespace, because things can
2837get complex...
2838@end itemize
2839
2840Most of these extensions are introduced by the special @code{(?}
2841sequence, which gives special meanings to parenthesized groups.
2842@end iftex
2843@menu
2844Other extensions can be roughly subdivided in two categories
2845On one hand Perl introduces several more escaped sequences
2846(that is, sequences introduced by a backslash). On the other
2847hand, it specifies that if a question mark follows an open
2848parentheses it should give a special meaning to the parenthesized
2849group.
2850
2851* Backslash:: Introduces special sequences
2852* Circumflex/dollar sign/period:: Behave specially with regard to new lines
2853* Square brackets:: Are a bit different in strange cases
2854* Options setting:: Toggle modifiers in the middle of a regexp
2855* Non-capturing subpatterns:: Are not counted when backreferencing
2856* Repetition:: Allows for non-greedy matching
2857* Backreferences:: Allows for more than 10 back references
2858* Assertions:: Allows for complex look ahead matches
2859* Non-backtracking subpatterns:: Often gives more performance
2860* Conditional subpatterns:: Allows if/then/else branches
2861* Recursive patterns:: For example to match parentheses
2862* Comments:: Because things can get complex...
2863@end menu
2864
2865@node Backslash
2866@appendixsec Backslash
2867@cindex Perl-style regular expressions, escaped sequences
2868
2869There are a few difference in the handling of backslashed
2870sequences in Perl mode.
2871
2872First of all, there are no @code{\o} and @code{\d} sequences.
2873@sc{ascii} values for characters can be specified in octal
2874with a @code{\@var{xxx}} sequence, where @var{xxx} is a
2875sequence of up to three octal digits. If the first digit
2876is a zero, the treatment of the sequence is straightforward;
2877just note that if the character that follows the escaped digit
2878is itself an octal digit, you have to supply three octal digits
2879for @var{xxx}. For example @code{\07} is a @sc{bel} character
2880rather than a @sc{nul} and a literal @code{7} (this sequence is
2881instead represented by @code{\0007}).
2882
2883@cindex Perl-style regular expressions, backreferences
2884The handling of a backslash followed by a digit other than 0
2885is complicated. Outside a character class, @command{sed} reads it
2886and any following digits as a decimal number. If the number
2887is less than 10, or if there have been at least that many
2888previous capturing left parentheses in the expression, the
2889entire sequence is taken as a back reference. A description
2890of how this works is given later, following the discussion
2891of parenthesized subpatterns.
2892
2893Inside a character class, or if the decimal number is
2894greater than 9 and there have not been that many capturing
2895subpatterns, @command{sed} re-reads up to three octal digits following
2896the backslash, and generates a single byte from the
2897least significant 8 bits of the value. Any subsequent digits
2898stand for themselves. For example:
2899
2900@example
2901 \040 @i{is another way of writing a space}
2902 \40 @i{is the same, provided there are fewer than 40}
2903 @i{previous capturing subpatterns}
2904 \7 @i{is always a back reference}
2905 \011 @i{is always a tab}
2906 \11 @i{might be a back reference, or another way of}
2907 @i{writing a tab}
2908 \0113 @i{is a tab followed by the character @samp{3}}
2909 \113 @i{is the character with octal code 113 (since there}
2910 @i{can be no more than 99 back references)}
2911 \377 @i{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}
2912 \81 @i{is either a back reference, or a binary zero}
2913 @i{followed by the two characters @samp{81}}
2914@end example
2915
2916Note that octal values of 100 or greater must not be introduced
2917duced by a leading zero, because no more than three octal
2918digits are ever read.
2919
2920All the sequences that define a single byte value can be
2921used both inside and outside character classes. In addition,
2922inside a character class, the sequence @code{\b} is interpreted
2923as the backspace character (hex 08). Outside a character
2924class it has a different meaning (see below).
2925
2926In addition, there are four additional escapes specifying
2927generic character classes (like @code{\w} and @code{\W} do):
2928
2929@cindex Perl-style regular expressions, character classes
2930@table @samp
2931@item \d
2932Matches any decimal digit
2933
2934@item \D
2935Matches any character that is not a decimal digit
2936@end table
2937
2938In Perl mode, these character type sequences can appear both inside and
2939outside character classes. Instead, in @sc{posix} mode these sequences
2940(as well as @code{\w} and @code{\W}) are treated as two literal characters
2941(a backslash and a letter) inside square brackets.
2942
2943Escaped sequences specifying assertions are also different in
2944Perl mode. An assertion specifies a condition that has to be met
2945at a particular point in a match, without consuming any
2946characters from the subject string. The use of subpatterns
2947for more complicated assertions is described below. The
2948backslashed assertions are
2949
2950@cindex Perl-style regular expressions, assertions
2951@table @samp
2952@item \b
2953Asserts that the point is at a word boundary.
2954A word boundary is a position in the subject string where
2955the current character and the previous character do not both
2956match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
2957the other matches @code{\W}), or the start or end of the string
2958if the first or last character matches @code{\w}, respectively.
2959
2960@item \B
2961Asserts that the point is not at a word boundary.
2962
2963@item \A
2964Asserts the matcher is at the start of pattern space (independent
2965of multiline mode).
2966
2967@item \Z
2968Asserts the matcher is at the end of pattern space,
2969or at a newline before the end of pattern space (independent of
2970multiline mode)
2971
2972@item \z
2973Asserts the matcher is at the end of pattern space (independent
2974of multiline mode)
2975@end table
2976
2977These assertions may not appear in character classes (but
2978note that @code{\b} has a different meaning, namely the
2979backspace character, inside a character class).
2980Note that Perl mode does not support directly assertions
2981for the beginning and the end of word; the @acronym{GNU} extensions
2982@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
2983instead.
2984
2985The @code{\A}, @code{\Z}, and @code{\z} assertions differ
2986from the traditional circumflex and dollar sign (described below)
2987in that they only ever match at the very start and end of the
2988subject string, whatever options are set; in particular @code{\A}
2989and @code{\z} are the same as the @acronym{GNU} extensions
2990@code{\`} and @code{\'} that are active in @sc{posix} mode.
2991
2992@node Circumflex/dollar sign/period
2993@appendixsec Circumflex, dollar sign, period
2994@cindex Perl-style regular expressions, newlines
2995
2996Outside a character class, in the default matching mode, the
2997circumflex character is an assertion which is true only if
2998the current matching point is at the start of the subject
2999string. Inside a character class, the circumflex has an entirely
3000different meaning (see below).
3001
3002The circumflex need not be the first character of the pattern if
3003a number of alternatives are involved, but it should be the
3004first thing in each alternative in which it appears if the
3005pattern is ever to match that branch. If all possible alternatives,
3006start with a circumflex, that is, if the pattern is
3007constrained to match only at the start of the subject, it is
3008said to be an @dfn{anchored} pattern. (There are also other constructs
3009structs that can cause a pattern to be anchored.)
3010
3011A dollar sign is an assertion which is true only if the
3012current matching point is at the end of the subject string,
3013or immediately before a newline character that is the last
3014character in the string (by default). A dollar sign need not be the
3015last character of the pattern if a number of alternatives
3016are involved, but it should be the last item in any branch
3017in which it appears. A dollar sign has no special meaning in a
3018character class.
3019
3020@cindex Perl-style regular expressions, multiline
3021The meanings of the circumflex and dollar sign characters are
3022changed if the @code{M} modifier option is used. When this is
3023the case, they match immediately after and immediately
3024before an internal @code{\n} character, respectively, in addition
3025to matching at the start and end of the subject string. For
3026example, the pattern @code{/^abc$/} matches the subject string
3027@samp{def\nabc} in multiline mode, but not otherwise. Consequently,
3028patterns that are anchored in single line mode
3029because all branches start with @code{^} are not anchored in
3030multiline mode.
3031
3032@cindex Perl-style regular expressions, multiline
3033Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
3034can be used to match the start and end of the subject in both
3035modes, and if all branches of a pattern start with @code{\A}
3036is it always anchored, whether the @code{M} modifier is set or not.
3037
3038@cindex Perl-style regular expressions, single line
3039Outside a character class, a dot in the pattern matches any
3040one character in the subject, including a non-printing character,
3041but not (by default) newline. If the @code{S} modifier is used,
3042dots match newlines as well. Actually, the handling of
3043dot is entirely independent of the handling of circumflex
3044and dollar sign, the only relationship being that they both
3045involve newline characters. Dot has no special meaning in a
3046character class.
3047
3048@node Square brackets
3049@appendixsec Square brackets
3050@cindex Perl-style regular expressions, character classes
3051
3052An opening square bracket introduces a character class, terminated
3053by a closing square bracket. A closing square bracket on its own
3054is not special. If a closing square bracket is required as a
3055member of the class, it should be the first data character in
3056the class (after an initial circumflex, if present) or escaped with a backslash.
3057
3058A character class matches a single character in the subject;
3059the character must be in the set of characters defined by
3060the class, unless the first character in the class is a circumflex,
3061in which case the subject character must not be in
3062the set defined by the class. If a circumflex is actually
3063required as a member of the class, ensure it is not the
3064first character, or escape it with a backslash.
3065
3066For example, the character class [aeiou] matches any lower
3067case vowel, while [^aeiou] matches any character that is not
3068a lower case vowel. Note that a circumflex is just a convenient
3069venient notation for specifying the characters which are in
3070the class by enumerating those that are not. It is not an
3071assertion: it still consumes a character from the subject
3072string, and fails if the current pointer is at the end of
3073the string.
3074
3075@cindex Perl-style regular expressions, case-insensitive
3076When caseless matching is set, any letters in a class
3077represent both their upper case and lower case versions, so
3078for example, a caseless @code{[aeiou]} matches uppercase
3079and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
3080does not match @samp{A}, whereas a case-sensitive version would.
3081
3082@cindex Perl-style regular expressions, single line
3083@cindex Perl-style regular expressions, multiline
3084The newline character is never treated in any special way in
3085character classes, whatever the setting of the @code{S} and
3086@code{M} options (modifiers) is. A class such as @code{[^a]} will
3087always match a newline.
3088
3089The minus (hyphen) character can be used to specify a range
3090of characters in a character class. For example, @code{[d-m]}
3091matches any letter between d and m, inclusive. If a minus
3092character is required in a class, it must be escaped with a
3093backslash or appear in a position where it cannot be interpreted
3094as indicating a range, typically as the first or last
3095character in the class.
3096
3097It is not possible to have the literal character @code{]} as the
3098end character of a range. A pattern such as @code{[W-]46]} is
3099interpreted as a class of two characters (@code{W} and @code{-})
3100followed by a literal string @code{46]}, so it would match
3101@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
3102with a backslash it is interpreted as the end of range, so
3103@code{[W-\]46]} is interpreted as a single class containing a
3104range followed by two separate characters. The octal or
3105hexadecimal representation of @code{]} can also be used to end a range.
3106
3107Ranges operate in @sc{ascii} collating sequence. They can also be
3108used for characters specified numerically, for example
3109@code{[\000-\037]}. If a range that includes letters is used when
3110caseless matching is set, it matches the letters in either
3111case. For example, a caseless @code{[W-c]} is equivalent to
3112@code{[][\^_`wxyzabc]}, matched caselessly, and if character
3113tables for the French locale are in use, @code{[\xc8-\xcb]}
3114matches accented E characters in both cases.
3115
3116Unlike in @sc{posix} mode, the character types @code{\d},
3117@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
3118may also appear in a character class, and add the characters
3119that they match to the class. For example, @code{[\dABCDEF]} matches any
3120hexadecimal digit. A circumflex can conveniently be used
3121with the upper case character types to specify a more restricted
3122set of characters than the matching lower case type.
3123For example, the class @code{[^\W_]} matches any letter or digit,
3124but not underscore.
3125
3126All non-alphameric characters other than @code{\}, @code{-},
3127@code{^} (at the start) and the terminating @code{]}
3128are non-special in character classes, but it does no harm
3129if they are escaped.
3130
3131Perl 5.6 supports the @sc{posix} notation for character classes, which
3132uses names enclosed by @code{[:} and @code{:]} within the enclosing
3133square brackets, and @value{SSED} supports this notation as well.
3134For example,
3135
3136@example
3137 [01[:alpha:]%]
3138@end example
3139
3140@noindent
3141matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
3142The supported class names are
3143
3144@table @code
3145@item alnum
3146Matches letters and digits
3147
3148@item alpha
3149Matches letters
3150
3151@item ascii
3152Matches character codes 0 - 127
3153
3154@item cntrl
3155Matches control characters
3156
3157@item digit
3158Matches decimal digits (same as \d)
3159
3160@item graph
3161Matches printing characters, excluding space
3162
3163@item lower
3164Matches lower case letters
3165
3166@item print
3167Matches printing characters, including space
3168
3169@item punct
3170Matches printing characters, excluding letters and digits
3171
3172@item space
3173Matches white space (same as \s)
3174
3175@item upper
3176Matches upper case letters
3177
3178@item word
3179Matches ``word'' characters (same as \w)
3180
3181@item xdigit
3182Matches hexadecimal digits
3183@end table
3184
3185The names @code{ascii} and @code{word} are extensions valid only in
3186Perl mode. Another Perl extension is negation, which is
3187indicated by a circumflex character after the colon. For example,
3188
3189@example
3190 [12[:^digit:]]
3191@end example
3192
3193@noindent
3194matches @samp{1}, @samp{2}, or any non-digit.
3195
3196@node Options setting
3197@appendixsec Options setting
3198@cindex Perl-style regular expressions, toggling options
3199@cindex Perl-style regular expressions, case-insensitive
3200@cindex Perl-style regular expressions, multiline
3201@cindex Perl-style regular expressions, single line
3202@cindex Perl-style regular expressions, extended
3203
3204The settings of the @code{I}, @code{M}, @code{S}, @code{X}
3205modifiers can be changed from within the pattern by
3206a sequence of Perl option letters enclosed between @code{(?}
3207and @code{)}. The option letters must be lowercase.
3208
3209For example, @code{(?im)} sets caseless, multiline matching. It is
3210also possible to unset these options by preceding the letter
3211with a hyphen; you can also have combined settings and unsettings:
3212@code{(?im-sx)} sets caseless and multiline matching,
3213while unsets single line matching (for dots) and extended
3214whitespace interpretation. If a letter appears both before
3215and after the hyphen, the option is unset.
3216
3217The scope of these option changes depends on where in the
3218pattern the setting occurs. For settings that are outside
3219any subpattern (defined below), the effect is the same as if
3220the options were set or unset at the start of matching. The
3221following patterns all behave in exactly the same way:
3222
3223@example
3224 (?i)abc
3225 a(?i)bc
3226 ab(?i)c
3227 abc(?i)
3228@end example
3229
3230which in turn is the same as specifying the pattern abc with
3231the @code{I} modifier. In other words, ``top level'' settings
3232apply to the whole pattern (unless there are other
3233changes inside subpatterns). If there is more than one setting
3234of the same option at top level, the rightmost setting
3235is used.
3236
3237If an option change occurs inside a subpattern, the effect
3238is different. This is a change of behaviour in Perl 5.005.
3239An option change inside a subpattern affects only that part
3240of the subpattern @emph{that follows} it, so
3241
3242@example
3243 (a(?i)b)c
3244@end example
3245
3246@noindent
3247matches abc and aBc and no other strings (assuming
3248case-sensitive matching is used). By this means, options can
3249be made to have different settings in different parts of the
3250pattern. Any changes made in one alternative do carry on
3251into subsequent branches within the same subpattern. For
3252example,
3253
3254@example
3255 (a(?i)b|c)
3256@end example
3257
3258@noindent
3259matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
3260even though when matching @samp{C} the first branch is
3261abandoned before the option setting.
3262This is because the effects of option settings happen at
3263compile time. There would be some very weird behaviour otherwise.
3264
3265@ignore
3266There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
3267that can be changed in the same way as the Perl-compatible options by
3268using the characters U and X respectively. The (?X) flag
3269setting is special in that it must always occur earlier in
3270the pattern than any of the additional features it turns on,
3271even when it is at top level. It is best put at the start.
3272@end ignore
3273
3274
3275@node Non-capturing subpatterns
3276@appendixsec Non-capturing subpatterns
3277@cindex Perl-style regular expressions, non-capturing subpatterns
3278
3279Marking part of a pattern as a subpattern does two things.
3280On one hand, it localizes a set of alternatives; on the other
3281hand, it sets up the subpattern as a capturing subpattern (as
3282defined above). The subpattern can be backreferenced and
3283referenced in the right side of @code{s} commands.
3284
3285For example, if the string @samp{the red king} is matched against
3286the pattern
3287
3288@example
3289 the ((red|white) (king|queen))
3290@end example
3291
3292@noindent
3293the captured substrings are @samp{red king}, @samp{red},
3294and @samp{king}, and are numbered 1, 2, and 3.
3295
3296The fact that plain parentheses fulfil two functions is not
3297always helpful. There are often times when a grouping
3298subpattern is required without a capturing requirement. If an
3299opening parenthesis is followed by @code{?:}, the subpattern does
3300not do any capturing, and is not counted when computing the
3301number of any subsequent capturing subpatterns. For example,
3302if the string @samp{the white queen} is matched against the pattern
3303
3304@example
3305 the ((?:red|white) (king|queen))
3306@end example
3307
3308@noindent
3309the captured substrings are @samp{white queen} and @samp{queen},
3310and are numbered 1 and 2. The maximum number of captured
3311substrings is 99, while the maximum number of all subpatterns,
3312both capturing and non-capturing, is 200.
3313
3314As a convenient shorthand, if any option settings are
3315equired at the start of a non-capturing subpattern, the
3316option letters may appear between the @code{?} and the
3317@code{:}. Thus the two patterns
3318
3319@example
3320 (?i:saturday|sunday)
3321 (?:(?i)saturday|sunday)
3322@end example
3323
3324@noindent
3325match exactly the same set of strings. Because alternative
3326branches are tried from left to right, and options are not
3327reset until the end of the subpattern is reached, an option
3328setting in one branch does affect subsequent branches, so
3329the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
3330
3331
3332@node Repetition
3333@appendixsec Repetition
3334@cindex Perl-style regular expressions, repetitions
3335
3336Repetition is specified by quantifiers, which can follow any
3337of the following items:
3338
3339@itemize @bullet
3340@item
3341a single character, possibly escaped
3342
3343@item
3344the @code{.} special character
3345
3346@item
3347a character class
3348
3349@item
3350a back reference (see next section)
3351
3352@item
3353a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
3354@end itemize
3355
3356The general repetition quantifier specifies a minimum and
3357maximum number of permitted matches, by giving the two
3358numbers in curly brackets (braces), separated by a comma.
3359The numbers must be less than 65536, and the first must be
3360less than or equal to the second. For example:
3361
3362@example
3363 z@{2,4@}
3364@end example
3365
3366@noindent
3367matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
3368is not a special character. If the second number is omitted,
3369but the comma is present, there is no upper limit; if the
3370second number and the comma are both omitted, the quantifier
3371specifies an exact number of required matches. Thus
3372
3373@example
3374 [aeiou]@{3,@}
3375@end example
3376
3377@noindent
3378matches at least 3 successive vowels, but may match many
3379more, while
3380
3381@example
3382 \d@{8@}
3383@end example
3384
3385@noindent
3386matches exactly 8 digits. An opening curly bracket that
3387appears in a position where a quantifier is not allowed, or
3388one that does not match the syntax of a quantifier, is taken
3389as a literal character. For example, @{,6@} is not a quantifier,
3390but a literal string of four characters.@footnote{It
3391raises an error if @option{-R} is not used.}
3392
3393The quantifier @samp{@{0@}} is permitted, causing the expression to
3394behave as if the previous item and the quantifier were not
3395present.
3396
3397For convenience (and historical compatibility) the three
3398most common quantifiers have single-character abbreviations:
3399
3400@table @code
3401@item *
3402is equivalent to @{0,@}
3403
3404@item +
3405is equivalent to @{1,@}
3406
3407@item ?
3408is equivalent to @{0,1@}
3409@end table
3410
3411It is possible to construct infinite loops by following a
3412subpattern that can match no characters with a quantifier
3413that has no upper limit, for example:
3414
3415@example
3416 (a?)*
3417@end example
3418
3419Earlier versions of Perl used to give an error at
3420compile time for such patterns. However, because there are
3421cases where this can be useful, such patterns are now
3422accepted, but if any repetition of the subpattern does in
3423fact match no characters, the loop is forcibly broken.
3424
3425@cindex Greedy regular expression matching
3426@cindex Perl-style regular expressions, stingy repetitions
3427By default, the quantifiers are @dfn{greedy} like in @sc{posix}
3428mode, that is, they match as much as possible (up to the maximum
3429number of permitted times), without causing the rest of the
3430pattern to fail. The classic example of where this gives problems
3431is in trying to match comments in C programs. These appear between
3432the sequences @code{/*} and @code{*/} and within the sequence, individual
3433@code{*} and @code{/} characters may appear. An attempt to match C
3434comments by applying the pattern
3435
3436@example
3437 /\*.*\*/
3438@end example
3439
3440@noindent
3441to the string
3442
3443@example
3444 /* first command */ not comment /* second comment */
3445@end example
3446
3447@noindent
3448
3449fails, because it matches the entire string owing to the
3450greediness of the @code{.*} item.
3451
3452However, if a quantifier is followed by a question mark, it
3453ceases to be greedy, and instead matches the minimum number
3454of times possible, so the pattern @code{/\*.*?\*/}
3455does the right thing with the C comments. The meaning of the
3456various quantifiers is not otherwise changed, just the preferred
3457number of matches. Do not confuse this use of question
3458mark with its use as a quantifier in its own right.
3459Because it has two uses, it can sometimes appear doubled, as in
3460
3461@example
3462 \d??\d
3463@end example
3464
3465which matches one digit by preference, but can match two if
3466that is the only way the rest of the pattern matches.
3467
3468Note that greediness does not matter when specifying addresses,
3469but can be nevertheless used to improve performance.
3470
3471@ignore
3472 If the PCRE_UNGREEDY option is set (an option which is not
3473 available in Perl), the quantifiers are not greedy by
3474 default, but individual ones can be made greedy by following
3475 them with a question mark. In other words, it inverts the
3476 default behaviour.
3477@end ignore
3478
3479When a parenthesized subpattern is quantified with a minimum
3480repeat count that is greater than 1 or with a limited maximum,
3481more store is required for the compiled pattern, in
3482proportion to the size of the minimum or maximum.
3483
3484@cindex Perl-style regular expressions, single line
3485If a pattern starts with @code{.*} or @code{.@{0,@}} and the
3486@code{S} modifier is used, the pattern is implicitly anchored,
3487because whatever follows will be tried against every character
3488position in the subject string, so there is no point in
3489retrying the overall match at any position after the first.
3490PCRE treats such a pattern as though it were preceded by \A.
3491
3492When a capturing subpattern is repeated, the value captured
3493is the substring that matched the final iteration. For example,
3494after
3495
3496@example
3497 (tweedle[dume]@{3@}\s*)+
3498@end example
3499
3500@noindent
3501has matched @samp{tweedledum tweedledee} the value of the
3502captured substring is @samp{tweedledee}. However, if there are
3503nested capturing subpatterns, the corresponding captured
3504values may have been set in previous iterations. For example,
3505after
3506
3507@example
3508 /(a|(b))+/
3509@end example
3510
3511matches @samp{aba}, the value of the second captured substring is
3512@samp{b}.
3513
3514@node Backreferences
3515@appendixsec Backreferences
3516@cindex Perl-style regular expressions, backreferences
3517
3518Outside a character class, a backslash followed by a digit
3519greater than 0 (and possibly further digits) is a back
3520reference to a capturing subpattern earlier (i.e. to its
3521left) in the pattern, provided there have been that many
3522previous capturing left parentheses.
3523
3524However, if the decimal number following the backslash is
3525less than 10, it is always taken as a back reference, and
3526causes an error only if there are not that many capturing
3527left parentheses in the entire pattern. In other words, the
3528parentheses that are referenced need not be to the left of
3529the reference for numbers less than 10. @ref{Backslash}
3530for further details of the handling of digits following a backslash.
3531
3532A back reference matches whatever actually matched the capturing
3533subpattern in the current subject string, rather than
3534anything matching the subpattern itself. So the pattern
3535
3536@example
3537 (sens|respons)e and \1ibility
3538@end example
3539
3540@noindent
3541matches @samp{sense and sensibility} and @samp{response and responsibility},
3542but not @samp{sense and responsibility}. If caseful
3543matching is in force at the time of the back reference, the
3544case of letters is relevant. For example,
3545
3546@example
3547 ((?i)blah)\s+\1
3548@end example
3549
3550@noindent
3551matches @samp{blah blah} and @samp{Blah Blah}, but not
3552@samp{BLAH blah}, even though the original capturing
3553subpattern is matched caselessly.
3554
3555There may be more than one back reference to the same subpattern.
3556Also, if a subpattern has not actually been used in a
3557particular match, any back references to it always fail. For
3558example, the pattern
3559
3560@example
3561 (a|(bc))\2
3562@end example
3563
3564@noindent
3565always fails if it starts to match @samp{a} rather than
3566@samp{bc}. Because there may be up to 99 back references, all
3567digits following the backslash are taken as part of a potential
3568back reference number; this is different from what happens
3569in @sc{posix} mode. If the pattern continues with a digit
3570character, some delimiter must be used to terminate the back
3571reference. If the @code{X} modifier option is set, this can be
3572whitespace. Otherwise an empty comment can be used, or the
3573following character can be expressed in hexadecimal or octal.
3574
3575A back reference that occurs inside the parentheses to which
3576it refers fails when the subpattern is first used, so, for
3577example, @code{(a\1)} never matches. However, such references
3578can be useful inside repeated subpatterns. For example, the
3579pattern
3580
3581@example
3582 (a|b\1)+
3583@end example
3584
3585@noindent
3586matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
3587etc. At each iteration of the subpattern, the back reference matches
3588the character string corresponding to the previous iteration. In
3589order for this to work, the pattern must be such that the first
3590iteration does not need to match the back reference. This can be
3591done using alternation, as in the example above, or by a
3592quantifier with a minimum of zero.
3593
3594@node Assertions
3595@appendixsec Assertions
3596@cindex Perl-style regular expressions, assertions
3597@cindex Perl-style regular expressions, asserting subpatterns
3598
3599An assertion is a test on the characters following or
3600preceding the current matching point that does not actually
3601consume any characters. The simple assertions coded as @code{\b},
3602@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
3603are described above. More complicated assertions are coded as
3604subpatterns. There are two kinds: those that look ahead of the
3605current position in the subject string, and those that look behind it.
3606
3607@cindex Perl-style regular expressions, lookahead subpatterns
3608An assertion subpattern is matched in the normal way, except
3609that it does not cause the current matching position to be
3610changed. Lookahead assertions start with @code{(?=} for positive
3611assertions and @code{(?!} for negative assertions. For example,
3612
3613@example
3614 \w+(?=;)
3615@end example
3616
3617@noindent
3618matches a word followed by a semicolon, but does not include
3619the semicolon in the match, and
3620
3621@example
3622 foo(?!bar)
3623@end example
3624
3625@noindent
3626matches any occurrence of @samp{foo} that is not followed by
3627@samp{bar}.
3628
3629Note that the apparently similar pattern
3630
3631@example
3632 (?!foo)bar
3633@end example
3634
3635@noindent
3636@cindex Perl-style regular expressions, lookbehind subpatterns
3637finds any occurrence of @samp{bar} even if it is preceded by
3638@samp{foo}, because the assertion @code{(?!foo)} is always true
3639when the next three characters are @samp{bar}. A lookbehind
3640assertion is needed to achieve this effect.
3641Lookbehind assertions start with @code{(?<=} for positive
3642assertions and @code{(?<!} for negative assertions. So,
3643
3644@example
3645 (?<!foo)bar
3646@end example
3647
3648achieves the required effect of finding an occurrence of
3649@samp{bar} that is not preceded by @samp{foo}. The contents of a
3650lookbehind assertion are restricted
3651such that all the strings it matches must have a fixed
3652length. However, if there are several alternatives, they do
3653not all have to have the same fixed length. This is an extension
3654compared with Perl 5.005, which requires all branches to match
3655the same length of string. Thus
3656
3657@example
3658 (?<=dogs|cats|)
3659@end example
3660
3661@noindent
3662is permitted, but the apparently equivalent regular expression
3663
3664@example
3665 (?<!dogs?|cats?)
3666@end example
3667
3668@noindent
3669causes an error at compile time. Branches that match different
3670length strings are permitted only at the top level of
3671a lookbehind assertion: an assertion such as
3672
3673@example
3674 (?<=ab(c|de))
3675@end example
3676
3677@noindent
3678is not permitted, because its single top-level branch can
3679match two different lengths, but it is acceptable if rewritten
3680to use two top-level branches:
3681
3682@example
3683 (?<=abc|abde)
3684@end example
3685
3686All this is required because lookbehind assertions simply
3687move the current position back by the alternative's fixed
3688width and then try to match. If there are
3689insufficient characters before the current position, the
3690match is deemed to fail. Lookbehinds, in conjunction with
3691non-backtracking subpatterns can be particularly useful for
3692matching at the ends of strings; an example is given at the end
3693of the section on non-backtracking subpatterns.
3694
3695Several assertions (of any sort) may occur in succession.
3696For example,
3697
3698@example
3699 (?<=\d@{3@})(?<!999)foo
3700@end example
3701
3702@noindent
3703matches @samp{foo} preceded by three digits that are not @samp{999}.
3704Notice that each of the assertions is applied independently
3705at the same point in the subject string. First there is a
3706check that the previous three characters are all digits, and
3707then there is a check that the same three characters are not
3708@samp{999}. This pattern does not match @samp{foo} preceded by six
3709characters, the first of which are digits and the last three
3710of which are not @samp{999}. For example, it doesn't match
3711@samp{123abcfoo}. A pattern to do that is
3712
3713@example
3714 (?<=\d@{3@}...)(?<!999)foo
3715@end example
3716
3717@noindent
3718This time the first assertion looks at the preceding six
3719characters, checking that the first three are digits, and
3720then the second assertion checks that the preceding three
3721characters are not @samp{999}. Actually, assertions can be
3722nested in any combination, so one can write this as
3723
3724@example
3725 (?<=\d@{3@}(?!999)...)foo
3726@end example
3727
3728or
3729
3730@example
3731 (?<=\d@{3@}...(?<!999))foo
3732@end example
3733
3734@noindent
3735both of which might be considered more readable.
3736
3737Assertion subpatterns are not capturing subpatterns, and may
3738not be repeated, because it makes no sense to assert the
3739same thing several times. If any kind of assertion contains
3740capturing subpatterns within it, these are counted for the
3741purposes of numbering the capturing subpatterns in the whole
3742pattern. However, substring capturing is carried out only
3743for positive assertions, because it does not make sense for
3744negative assertions.
3745
3746Assertions count towards the maximum of 200 parenthesized
3747subpatterns.
3748
3749@node Non-backtracking subpatterns
3750@appendixsec Non-backtracking subpatterns
3751@cindex Perl-style regular expressions, non-backtracking subpatterns
3752
3753With both maximizing and minimizing repetition, failure of
3754what follows normally causes the repeated item to be evaluated
3755again to see if a different number of repeats allows the
3756rest of the pattern to match. Sometimes it is useful to
3757prevent this, either to change the nature of the match, or
3758to cause it fail earlier than it otherwise might, when the
3759author of the pattern knows there is no point in carrying
3760on.
3761
3762Consider, for example, the pattern @code{\d+foo} when applied to
3763the subject line
3764
3765@example
3766 123456bar
3767@end example
3768
3769After matching all 6 digits and then failing to match @samp{foo},
3770the normal action of the matcher is to try again with only 5
3771digits matching the @code{\d+} item, and then with 4, and so on,
3772before ultimately failing. Non-backtracking subpatterns
3773provide the means for specifying that once a portion of the
3774pattern has matched, it is not to be re-evaluated in this way,
3775so the matcher would give up immediately on failing to match
3776@samp{foo} the first time. The notation is another kind of special
3777parenthesis, starting with @code{(?>} as in this example:
3778
3779@example
3780 (?>\d+)bar
3781@end example
3782
3783This kind of parenthesis ``locks up'' the part of the pattern
3784it contains once it has matched, and a failure further into
3785the pattern is prevented from backtracking into it.
3786Backtracking past it to previous items, however, works as
3787normal.
3788
3789Non-backtracking subpatterns are not capturing subpatterns. Simple
3790cases such as the above example can be thought of as a maximizing
3791repeat that must swallow everything it can. So,
3792while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
3793digits they match in order to make the rest of the pattern
3794match, @code{(?>\d+)} can only match an entire sequence of digits.
3795
3796This construction can of course contain arbitrarily complicated
3797subpatterns, and it can be nested.
3798
3799@cindex Perl-style regular expressions, lookbehind subpatterns
3800Non-backtracking subpatterns can be used in conjunction with look-behind
3801assertions to specify efficient matching at the end
3802of the subject string. Consider a simple pattern such as
3803
3804@example
3805 abcd$
3806@end example
3807
3808@noindent
3809when applied to a long string which does not match. Because
3810matching proceeds from left to right, @command{sed} will look for
3811each @samp{a} in the subject and then see if what follows matches
3812the rest of the pattern. If the pattern is specified as
3813
3814@example
3815 ^.*abcd$
3816@end example
3817
3818@noindent
3819the initial @code{.*} matches the entire string at first, but when
3820this fails (because there is no following @samp{a}), it backtracks
3821to match all but the last character, then all but the
3822last two characters, and so on. Once again the search for
3823@samp{a} covers the entire string, from right to left, so we are
3824no better off. However, if the pattern is written as
3825
3826@example
3827 ^(?>.*)(?<=abcd)
3828@end example
3829
3830there can be no backtracking for the .* item; it can match
3831only the entire string. The subsequent lookbehind assertion
3832does a single test on the last four characters. If it fails,
3833the match fails immediately. For long strings, this approach
3834makes a significant difference to the processing time.
3835
3836When a pattern contains an unlimited repeat inside a subpattern
3837that can itself be repeated an unlimited number of
3838times, the use of a once-only subpattern is the only way to
3839avoid some failing matches taking a very long time
3840indeed.@footnote{Actually, the matcher embedded in @value{SSED}
3841 tries to do something for this in the simplest cases,
3842 like @code{([^b]*b)*}. These cases are actually quite
3843 common: they happen for example in a regular expression
3844 like @code{\/\*([^*]*\*)*\/} which matches C comments.}
3845
3846The pattern
3847
3848@example
3849 (\D+|<\d+>)*[!?]
3850@end example
3851
3852([^0-9<]+<(\d+>)?)*[!?]
3853
3854@noindent
3855matches an unlimited number of substrings that either consist
3856of non-digits, or digits enclosed in angular brackets, followed by
3857an exclamation or question mark. When it matches, it runs quickly.
3858However, if it is applied to
3859
3860@example
3861 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3862@end example
3863
3864@noindent
3865it takes a long time before reporting failure. This is
3866because the string can be divided between the two repeats in
3867a large number of ways, and all have to be tried.@footnote{The
3868example used @code{[!?]} rather than a single character at the end,
3869because both @value{SSED} and Perl have an optimization that allows
3870for fast failure when a single character is used. They
3871remember the last single character that is required for a
3872match, and fail early if it is not present in the string.}
3873
3874If the pattern is changed to
3875
3876@example
3877 ((?>\D+)|<\d+>)*[!?]
3878@end example
3879
3880sequences of non-digits cannot be broken, and failure happens
3881quickly.
3882
3883@node Conditional subpatterns
3884@appendixsec Conditional subpatterns
3885@cindex Perl-style regular expressions, conditional subpatterns
3886
3887It is possible to cause the matching process to obey a subpattern
3888conditionally or to choose between two alternative
3889subpatterns, depending on the result of an assertion, or
3890whether a previous capturing subpattern matched or not. The
3891two possible forms of conditional subpattern are
3892
3893@example
3894 (?(@var{condition})@var{yes-pattern})
3895 (?(@var{condition})@var{yes-pattern}|@var{no-pattern})
3896@end example
3897
3898If the condition is satisfied, the yes-pattern is used; otherwise
3899the no-pattern (if present) is used. If there are more than two
3900alternatives in the subpattern, a compile-time error occurs.
3901
3902There are two kinds of condition. If the text between the
3903parentheses consists of a sequence of digits, the condition
3904is satisfied if the capturing subpattern of that number has
3905previously matched. The number must be greater than zero.
3906Consider the following pattern, which contains non-significant
3907white space to make it more readable (assume the @code{X} modifier)
3908and to divide it into three parts for ease of discussion:
3909
3910@example
3911 ( \( )? [^()]+ (?(1) \) )
3912@end example
3913
3914The first part matches an optional opening parenthesis, and
3915if that character is present, sets it as the first captured
3916substring. The second part matches one or more characters
3917that are not parentheses. The third part is a conditional
3918subpattern that tests whether the first set of parentheses
3919matched or not. If they did, that is, if subject started
3920with an opening parenthesis, the condition is true, and so
3921the yes-pattern is executed and a closing parenthesis is
3922required. Otherwise, since no-pattern is not present, the
3923subpattern matches nothing. In other words, this pattern
3924matches a sequence of non-parentheses, optionally enclosed
3925in parentheses.
3926
3927@cindex Perl-style regular expressions, lookahead subpatterns
3928If the condition is not a sequence of digits, it must be an
3929assertion. This may be a positive or negative lookahead or
3930lookbehind assertion. Consider this pattern, again containing
3931non-significant white space, and with the two alternatives
3932on the second line:
3933
3934@example
3935 (?(?=...[a-z])
3936 \d\d-[a-z]@{3@}-\d\d |
3937 \d\d-\d\d-\d\d )
3938@end example
3939
3940The condition is a positive lookahead assertion that matches
3941a letter that is three characters away from the current point.
3942If a letter is found, the subject is matched against the first
3943alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
3944letters and @var{dd} are digits); otherwise it is matched against
3945the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
3946
3947
3948@node Recursive patterns
3949@appendixsec Recursive patterns
3950@cindex Perl-style regular expressions, recursive patterns
3951@cindex Perl-style regular expressions, recursion
3952
3953Consider the problem of matching a string in parentheses,
3954allowing for unlimited nested parentheses. Without the use
3955of recursion, the best that can be done is to use a pattern
3956that matches up to some fixed depth of nesting. It is not
3957possible to handle an arbitrary nesting depth. Perl 5.6 has
3958provided an experimental facility that allows regular
3959expressions to recurse (amongst other things). It does this
3960by interpolating Perl code in the expression at run time,
3961and the code can refer to the expression itself. A Perl pattern
3962tern to solve the parentheses problem can be created like
3963this:
3964
3965@example
3966 $re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
3967@end example
3968
3969The @code{(?p@{...@})} item interpolates Perl code at run time,
3970and in this case refers recursively to the pattern in which it
3971appears. Obviously, @command{sed} cannot support the interpolation of
3972Perl code. Instead, the special item @code{(?R)} is provided for
3973the specific case of recursion. This pattern solves the
3974parentheses problem (assume the @code{X} modifier option is used
3975so that white space is ignored):
3976
3977@example
3978 \( ( (?>[^()]+) | (?R) )* \)
3979@end example
3980
3981First it matches an opening parenthesis. Then it matches any
3982number of substrings which can either be a sequence of
3983non-parentheses, or a recursive match of the pattern itself
3984(i.e. a correctly parenthesized substring). Finally there is
3985a closing parenthesis.
3986
3987This particular example pattern contains nested unlimited
3988repeats, and so the use of a non-backtracking subpattern for
3989matching strings of non-parentheses is important when applying
3990the pattern to strings that do not match. For example, when
3991it is applied to
3992
3993@example
3994 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3995@end example
3996
3997it yields a ``no match'' response quickly. However, if a
3998standard backtracking subpattern is not used, the match runs
3999for a very long time indeed because there are so many different
4000ways the @code{+} and @code{*} repeats can carve up the subject,
4001and all have to be tested before failure can be reported.
4002
4003The values set for any capturing subpatterns are those from
4004the outermost level of the recursion at which the subpattern
4005value is set. If the pattern above is matched against
4006
4007@example
4008 (ab(cd)ef)
4009@end example
4010
4011@noindent
4012the value for the capturing parentheses is @samp{ef}, which is
4013the last value taken on at the top level.
4014
4015@node Comments
4016@appendixsec Comments
4017@cindex Perl-style regular expressions, comments
4018
4019The sequence (?# marks the start of a comment which continues
4020ues up to the next closing parenthesis. Nested parentheses
4021are not permitted. The characters that make up a comment
4022play no part in the pattern matching at all.
4023
4024@cindex Perl-style regular expressions, extended
4025If the @code{X} modifier option is used, an unescaped @code{#} character
4026outside a character class introduces a comment that continues
4027up to the next newline character in the pattern.
4028@end ifset
4029
4030
4031@page
4032@node Concept Index
4033@unnumbered Concept Index
4034
4035This is a general index of all issues discussed in this manual, with the
4036exception of the @command{sed} commands and command-line options.
4037
4038@printindex cp
4039
4040@page
4041@node Command and Option Index
4042@unnumbered Command and Option Index
4043
4044This is an alphabetical list of all @command{sed} commands and command-line
4045options.
4046
4047@printindex fn
4048
4049@contents
4050@bye
4051
4052@c XXX FIXME: the term "cycle" is never defined...
Note: See TracBrowser for help on using the repository browser.

© 2024 Oracle Support Privacy / Do Not Sell My Info Terms of Use Trademark Policy Automated Access Etiquette