YuPcre2 v1.5.0 D7-XE10.1 » Developer.Team

YuPcre2 v1.5.0 D7-XE10.1

YuPcre2 v1.5.0 D7-XE10.1
YuPcre2 v1.5.0 D7-XE10.1 | 6 Mb


YuPcre2 is a library of Delphi components and procedures that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. There are two matching algorithms, the standard Perl and alternative DFA algorithm:

The Perl algorithm is what you are used to from Perl and j@vascript. It is fast and supports the complete pattern syntax. You will likely be using it most of the time.
DFA is a special purpose algorithm. If finds all possible matches and, in particular, it finds the longest. It never backtracks and supports partial matching better, in particular multi-segment matching of very long subject strings.

YuPcre2 has native interfaces for 8-bit, 16-bit, and 32-bit strings. Component wrappers are available for UnicodeString / WideString and AnsiString / Utf8String / RawBytestring:

The YuPcre2 RegEx2 classes descend from common ancestors which implement the core functionalities:

Match strings and and extract full or substring matches.
Search for regular expressions within streams and memory buffers. TDIRegExSearchStream descendants employ a buffered search within streams and files (of virtually unlimited size) and use little memory.
Replace full matches or partial substrings.
List full matches or partial substrings.
Format full matches or partial substrings by adding static or dynamic text.

Users familiar with the DIRegEx might be interessted in the differences between YuPcre2 and DIRegEx.

Pattern Syntax

YuPcre2 RegEx2 Workbench Application The YuPcre2 regular expression pattern syntax is mostly compatible with Perl. It includes the following:

Quoting
Escaped Characters
Character Types
General Category Properties for \p and \P
PCRE2 Special Category Properties for \p and \P
Script Names for \p and \P
Character Classes
Quantifiers
Anchors and Simple Assertions
Match Point Reset
Alternation
Capturing
Atomic Groups
Comment
Option Setting
Newline Convention
What \R Matches
Lookahead and Lookbehind Assertions
Backreferences
Subroutine References (possibly recursive)
Conditional Patterns
Backtracking Control
Callouts

YuPcre2 RegEx2 String Processing

YuPcre2 can Replace, List, or Format regular expressions matches or any of its substrings, useful for text editors and word processors. Variable portions of the match can be included into the result text. The full match can be referenced by number, substrings also by name. The character to introduce these reference is freely configurable. FormatOptions allow to turn features on or off as required.

Replace returns the original subject string with matches replaced, similar to but more flexible than Delphi's StringReplace() function.
List collects all string matches into a single string. It extracts multiple phone numbers, e-mail addresses, or URLs, with a single call.

YuPcre2 RegEx2 MaskControls

The YuPcre2 RegEx2 MaskControls Demo ApplicationYuPcre2 includes two regular expression mask edits: TDIRegEx2MaskEdit and TDIRegEx2ComboBox. Both controls validate keyboard input against a regular expression. They work similar to Delphi's TMaskEdit, but more flexible and powerful.

The regular expression mask edits can:

accept / reject specific characters at determined positions;
allow / reject particular characters if they follow defined character(s);
restrict input text to begin / end with exact character(s);
flag incomplete text to show that more input is needed.

Examples: Numbers, number ranges, dates, phone numbers, e-mail addresses, URLs, currency, and more.

Workbench Application

The YuPcre2 RegEx2 Workbench helps to design and test regular expressions. It allows to set options, measure execution times, and to save and load settings for later use.

The YuPcre2 RegEx2 Workbench is available as

Design-Time Component Editor and
Standalone Application.

YuPcre2 1.5.0

New features:

Implemented pcre2_code_copy_with_tables.
\g{+} (e.g. \g{+2}) is now supported. It is a “forward back reference” and can be useful in repetitions (compare \g{-}). Perl does not recognize this syntax.
Optimizations:

When a pattern is too complicated, PCRE2 gives up trying to find a minimum matching length and just records zero. Typically this happens when there are too many nested or recursive back references. If the limit was reached in certain recursive cases it failed to be triggered and an internal error could be the result.
The pcre2_dfa_match function now takes note of the recursion limit for the internal recursive calls that are used for lookrounds and recursions within the pattern.
Detecting patterns that are too large inside the length-measuring loop saves processing ridiculously long patterns to their end.
When autopossessifying, skip empty branches without recursion, to reduce stack usage. Example pattern: X?(R||){3335}.
A pattern with very many explicit back references to a group that is a long way from the start of the pattern could take a long time to compile because searching for the referenced group in order to find the minimum length was being done repeatedly. Now up to 128 group minimum lengths are cached and the attempt to find a minimum length is abandoned if there is a back reference to a group whose number is greater than 128. (In that case, the pattern is so complicated that this optimization probably isn't worth it.)
Bug fixes:

In any wide-character mode (8-bit UTF or any 16-bit or 32-bit mode), without PCRE2_UCP set, a negative character type such as \D in a positive class should cause all characters greater than 255 to match, whatever else is in the class. There was a bug that caused this not to happen if a Unicode property item was added to such a class, for example [\D\P{Nd}] or [\W\pL].
There has been a major re-factoring of pcre2_compile. Most syntax checking is now done in the pre-pass that identifies capturing groups. While doing this, some minor bugs and Perl incompatibilities were fixed, including:
\Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead of giving an invalid quantifier error.
{0} can now be used after a group in a lookbehind assertion; previously this caused an “assertion is not fixed length” error.
Perl always treats (?(DEFINE) as a “define” group, even if a group with the name “DEFINE” exists. PCRE2 now does likewise.
A recursion condition test such as (?(R2)…) must now refer to an existing subpattern.
A conditional recursion test such as (?(R)…) misbehaved if there was a group whose name began with “R”.
A hyphen appearing immediately after a POSIX character class (for example [[:ascii:]-z]) now generates an error. Perl does accept this as a literal, but gives a warning, so it seems best to fail it in PCRE.
An empty \Q\E sequence may appear after a callout that precedes an assertion condition (it is, of course, ignored).

One effect of the refactoring is that some error numbers and messages have changed, and the pattern offset given for compiling errors is not always the right-most character that has been read. In particular, for a variable-length lookbehind assertion it now points to the start of the assertion. Another change is that when a callout appears before a group, the “length of next pattern item” that is passed now just gives the length of the opening parenthesis item, not the length of the whole group. A length of zero is now given only for a callout at the end of the pattern. Automatic callouts are no longer inserted before and after explicit callouts in the pattern. * Back references are now permitted in lookbehind assertions when there are no duplicated group numbers (that is, (?| has not been used), and, if the reference is by name, there is only one group of that name. The referenced group must, of course be of fixed length.
Automatic callouts are no longer generated before and after callouts in the pattern.
A number of bugs have been mended relating to match start-up optimizations when the first thing in a pattern is a positive lookahead. These all applied only when PCRE2_NO_START_OPTIMIZE was *not* set:
A pattern such as (?=.*X)X$ was incorrectly optimized as if it needed both an initial 'X' and a following 'X'.
Some patterns starting with an assertion that started with .* were incorrectly optimized as having to match at the start of the subject or after a newline. There are cases where this is not true, for example, (?=.*[A-Z])(?=.{8,16})(?!.*[\s]) matches after the start in lines that start with spaces. Starting .* in an assertion is no longer taken as an indication of matching at the start (or after a newline).
A pattern with PCRE2_DOTALL (/s) set but not PCRE2_NO_DOTSTAR_ANCHOR, and which started with .* inside a positive lookahead was incorrectly being compiled as implicitly anchored.
Fix out-of-bounds read for partial matching of . against an empty string when the newline type is CRLF.
The appearance of \p, \P, or \X in a substitution string when PCRE2_SUBSTITUTE_EXTENDED was set caused a segmentation fault (nil dereference).
If the starting offset was specified as greater than the subject length in a call to pcre2_substitute an out-of-bounds memory reference could occur.
Incorrect data was compiled for a pattern with PCRE2_UCP set without PCRE2_UTF if a class required all wide characters to match (for example, [\s[:^ascii:]]).
The limit in the auto-possessification code that was intended to catch overly-complicated patterns and not spend too much time auto-possessifying was being reset too often, resulting in very long compile times for some patterns. Now such patterns are no longer completely auto-possessified.
Ignore PCRE2_CASELESS when processing \h, \H, \v, and \V in classes as it just wastes time. In the UTF case it can also produce redundant entries in XCLASS lists caused by characters with multiple other cases and pairs of characters in the same “not-x” sublists.


[/b]

[b] Only for V.I.P
Warning! You are not allowed to view this text.
SiteLock