\documentclass[11pt]{article} \hyphenation{white-space} \usepackage[utf8]{inputenc} \usepackage[margin=0.9in]{geometry} \renewcommand\arraystretch{2.7} \renewcommand\tabcolsep{20pt} \usepackage{array} \usepackage{hyperref} \usepackage{booktabs} \usepackage{longtable} \title{The Little Regular Expressionist} \author{Vilja Hulden} \date{August 2016 \\ v0.1b \\ CC-BY-SA 4.0} \begin{document} \setlength{\parindent}{0pt} \setlength{\parskip}{1.5ex plus 0.5ex minus 0.2ex} \maketitle This little pamphlet, which is inspired by \emph{The Little Schemer} by Daniel Friedman and Matthias Felleisen, aims to serve as a gentle introduction to regular expressions. You may want to cover the right half of the page, and only move the cover down answer by answer; that way you give yourself time to digest the question and maybe even come up with the answer (sometimes you will have enough information to at least take a guess, though not always). This pamphlet is by no means exhaustive; there is much more to regular expressions. It's a good idea to test and play; \url{http://regexr.com/} has both a tester and reference resources, and a good comprehensive cheat sheet can be found at \url{https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/}. \section{Basics} \begin{longtable}{m{2.5in}m{2.5in}} Is \verb!a! a regular expression? & Yes, it matches string \verb!a!.\\ \hline Is \verb!ba*c! a regular expression? & Yes.\\ \hline Does \verb!ba*c! match \verb!bac!? & Yes.\\ \hline Does \verb!ba*c! match \verb!baaaac!? & Yes.\\ \hline Does \verb!ba*c! match \verb!bc!? & Yes.\\ \hline Hmm. So do you mean that \verb!ba*c! can be read as ``\verb!b! followed by zero or more \verb!a!'s and then a \verb!c!?'' & I do indeed.\\ \hline So \verb!Mo*re! would match both \verb!More! and \verb!Moore!? & Yes, but it would of course also match \verb!Mre!.\\ \hline % (and \verb!Moooore!, and \verb!Moooooooore!, and\ldots).\\ \hline Oh. What if I don't want it to match \verb!Mre!? & Use the plus (\verb!+!) instead of the star (\verb!*!).\\ \hline Aha! So \verb!ba+c! matches \verb!bac! but not \verb!bc!? & Exactly.\\ \hline OK, so does \verb!ba+c! match \verb!baaaac!? & Yes, because \verb!+! means ``one or more.''\\ \hline And it also matches \verb!baac! and \verb!baaaaac! and \verb!baaaaaaaac! and \ldots & Yes, you've got the idea.\\ \hline So \verb!ba+c! can be read as ``\verb!b! followed by one or more \verb!a!'s and then a \verb!c!?'' & Quite so.\\ \hline OK, let me check a few more. Does \verb!b*a! match \verb!ba!? & Yes.\\ \hline Does \verb!b*a! match \verb!a!? & Yes, because the star means you don't have to have a \verb!b!.\\ \hline Does \verb!b+a+! match \verb!aaaa!? & No, because the plus means that you have to have at least one b.\\ \hline Does \verb!b+a! match \verb!bbbba!? & Yes, because you can have as many \verb!b!s as you like, as long as you have at least one.\\ \hline Does \verb!b*a! match \verb!bbbb!? & No. The \verb!a! is not followed by a star, so it has to be there.\\ \hline Does \verb!ba*c! match \verb!BAC!? & No, because the matches are case-sensitive. \\ \hline Does \verb!b*a! match \verb!bbBBa!? & No, because the matches are case-sensitive! A lowercase \verb!b! only matches a lowercase \verb!b!, not an uppercase \verb!B!.\\ \hline \label{Bb}Oh, OK. What if I want to match both? & We'll get to that, don't worry.\\ \hline Fine. What if I want to match any character? & Use \verb!.! (the period).\\ \hline Like this: \verb!b.c! to match \verb!bac!? & Yes.\\ \hline Or the same thing, \verb!b.c!, to match \verb!bxc!? & Yes.\\ \hline Does \verb!b.c! match \verb!baaaac! too? & No.\\ \hline Does it match \verb!bxxxxc!? & No. The period only matches a single character.\\ \hline Oh. So does \verb!b.*c! match \verb!baaaac!? & Yes!\\ \hline Does \verb!b.*c! match \verb!bbbbbaaaac!? & Yes, because \verb!b! is also a character. \\ \hline Does \verb!b.*k! match \verb!bark!? & Yes, because both \verb!a! and \verb!r! are characters.\\ \hline Does \verb!B.*K! match \verb!bark!? & No, because the matches are case-sensitive. \verb!B!~only matches an uppercase \verb!B!, not a lowercase \verb!b!.\\ \hline OK. But does \verb!c.*p! match \verb!co-op!? & Yes, because \verb!-! (the hyphen) is a character too.\\ \hline So \verb!c.*p! would match \verb!c&?$#e$p!? & Yes, those are all characters.\\ \hline Does \verb!a.*p! match \verb!b635%#p!? & No, because there's no \verb!a!.\\ \hline Does \verb!a.*p! match \verb!a&4>p! ? & Yes.\\ \hline What if I have a word that looks like this: \newline \verb!ab! \newline \verb!cd! \newline Will \verb!a.*d! match it? & No, the one character that \verb!.! does not match is the line break.\\ \hline Oh, so it will only match a string if the string is all on one line? & Yes.\\ \hline So let's say I have a line like this: \verb!Cats beat dogs! \newline I want to match the \verb!Cats! in that line. I write \verb!C.*s! to do that, right? & Actually, no. \\ \hline What?? & \verb!C.*s! will match the whole line.\\ \hline Why?? & Because \textbf{regular expressions are greedy.} They take everything they can.\\ \hline Oh, so \verb!C.*s! will actually match the whole string \verb!Cats beat dogs! because that string ends with s too? & Exactly. \\ \hline So, let me make sure I've got all this right. & Please do.\\ \hline What's the minimum number of \verb!a!'s a string has to have for the expression \verb!ba*c! to match? & Zero.\\ \hline And what's the minimum number of \verb!a!'s a string has to have for the expression \verb!ba+! to match? & One.\\ \hline Is there an upper limit to how many consecutive \verb!a!'s there can be in a string for the expressions \verb!ba*! or \verb!ba+! to match? & No.\\ \hline What does \verb!.! match? & Any character except line breaks.\\ \hline And, \textbf{a regular expression is greedy.} & Yes!! \\ \hline Can I make it ``ungreedy?'' Can I match the \verb!Cats! in \verb!Cats beat dogs! somehow? & Yes, you can add \verb!?! to the quantifier (the star or plus).\\ \hline Like this: \verb!C.*?s!? & Exactly. \\ \hline \end{longtable} \bigskip \begin{center} \renewcommand\arraystretch{1} \begin{tabular}{ll}\toprule \multicolumn{2}{l}{Cheat sheet} \\\midrule \verb!.! & any character \\ \hline \verb!*! & zero or more \\ \hline \verb!+! & one or more \\ \hline \verb!*?! & zero or more, ungreedy \\ \hline \verb!+?! & one or more, ungreedy \\ \bottomrule \end{tabular} \end{center} \clearpage \section{Alternatives} \begin{longtable}{m{2.7in}m{2.7in}} Does \verb!a|b! match \verb!a!? & Yes.\\ \hline Does \verb!a|b! match \verb!b!? & Yes.\\ \hline Does \verb!a|b! match \verb!c!? & No, there's nothing in the expression that could match \verb!c!.\\ \hline Does \verb!a|b! match \verb!ab!? & No.\\ \hline So \verb!a|b! means ``a or b''? & Yes!\\ \hline Does \verb!a|b+! match \verb!abbbb!? & No, it only matches one or the other side of the \verb!|!, not both at the same time.\\ \hline OK. So does \verb!a|b+! match \verb!bbbb!? & Yes.\\ \hline And does \verb!a|b+! match \verb!a!? & Yes.\\ \hline Does \verb!a|b+! match \verb!aaa!? & No, because there's only one \verb!a! in the expression.\\ \hline Does \verb!a+|b+! match \verb!aaabbb!? & No, it still only matches one or the other side of the pipe!\\ \hline Oh, right. So does \verb!a+|b+! match \verb!aaa!? & Yes.\\ \hline Does \verb!a+|b+! match \verb!bbb!? & Yes.\\ \hline So \verb!dog|cat! matches \verb!dog!? & Yes.\\ \hline But does \verb!dog|cat! match \verb!dogs!? & No, there's no \verb!s! in the expression.\\ \hline Oh yeah, that's true. But does \verb!dog|cat! match the \verb!dog! in \verb!dogs!? & Yes!\\ \hline Does \verb!dog|cat! match \verb!docat!? & No, because the whole expression on one side of the \verb!|! (the pipe) has to match.\\ \hline Ah, right. So does \verb!dog|cat! match the \verb!cat! in \verb!docat!? & Yes!\\ \hline What if I want to match both \verb!dogat! and \verb!docat!? & Group the \verb!g|c! with parentheses.\\ \hline Like this: \verb!do(g|c)at!? & Yes. That matches \verb!dogat! as well as \verb!docat!.\\ \hline Does \verb!dog|cats! match \verb!dogs!? & No, you can't combine the two sides of the pipe. \\ \hline Does \verb!dog|cats! match \verb!cats!? & Yes, because \verb!cats! is all on one side of the pipe.\\ \hline Does \verb!(dog|cat)s !match \verb!dogs!? & Yes, because now you've grouped the alternatives.\\ \hline Does \verb!(dog|cat)s! match \verb!cats!? & Yes, because you've grouped the alternatives.\\ \hline Does \verb!(dog|cat)s! match \verb!cat!? & No, because the \verb!s! has to be there.\\ \hline Does \verb!(dog|cat)s*! match \verb!cat!? & Yes, because now you've made the \verb!s! optional.\\ \hline %Does \verb!(dog|cat)s*! match \verb!dog!? & Yes. \end{longtable} \clearpage \section{Character classes} \begin{longtable}{m{2.7in}m{2.7in}} Does \verb![abc]! match \verb!a!? & Yes, because \verb!a! is included in the class.\\ \hline Does \verb![abc]! match \verb!b!? & Yes, because \verb!b! is included in the class.\\ \hline Does \verb![abc]! match \verb!d!? & No, because \verb!d! is not included in the class.\\ \hline Does \verb![abc]! match \verb!ab!? & No, because the class only represents one of its members at a time.\\ \hline Does \verb![abc][abc]! match \verb!ab!? & Yes, because now there are two classes in a row.\\ \hline Does \verb![abc][abc]! match \verb!cb!? & Yes, because the class can represent any one of its members.\\ \hline Does \verb![abc]+! match \verb!ab!? & Yes! Very good.\\ \hline Does \verb![abc]+ !match \verb!ba!? & Yes.\\ \hline Does \verb!b[abc]*! match \verb!ba!? & Yes.\\ \hline Does \verb![abc]+! match \verb!bacab!? & Yes.\\ \hline So if a class is followed by \verb!*! or \verb!+!, any number of the members of that class can be strung together in any order? & Yes (well, to be precise, zero or more for \verb!*! and one or more for \verb!+!).\\ \hline If a class contains all letters between a and h, do I have to list them all, like this: \verb![abcdefgh]!? & No, you can define the range like this: \verb![a-h]!.\\ \hline So, \verb![a-z]! matches \verb!a!? & Yes.\\ \hline And \verb![a-z]! matches \verb!g!? & Yes.\\ \hline And \verb![a-z]! match \verb!k!? & Yes. We could go on.\\ \hline Let's not. But does [\verb!a-z]! match \verb!A!? & No, the class is case sensitive.\\ \hline Does \verb![a-z]! match \verb!aa!? & No, because the class only represents one of its members at a time.\\ \hline Does \verb![a-z]+! match \verb!abc!? & Yes, of course.\\ \hline Does \verb![a-z]+! match \verb!bye!? & Yes, of course.\\ \hline Does \verb![a-z]! match \verb!a-z!? & No, the expression defines a class, not a string.\\ \hline Does \verb![a-k]! match \verb!k!? & Yes.\\ \hline Does \verb![a-k]! match \verb!x!? & No, \verb!x! is not included in the range \verb!a-k!.\\ \hline Does \verb![a-z]! match \verb!2!? & No, no digits are included in the range \verb!a-z!.\\ \hline Does \verb![a-z]! match \verb!&!? & No, no punctuation marks are included in the range \verb!a-z!.\\ \hline Does [\verb!0-5]! match \verb!2!? & Yes.\\ \hline Does \verb![0-5]! match \verb!7!? & No, \verb!7! is not included in the range \verb!0-5!.\\ \hline Does \verb![0-5]! match \verb!b!? & No, no letters are included in the range \verb!0-5!.\\ \hline Does \verb![a-z0-9]! match \verb!b!? & Yes.\\ \hline Does \verb![a-z0-9]! match \verb!5!? & Yes.\\ \hline Does \verb![a-z0-9]! match \verb!b5!? & No, because the class only represents one of its members at a time.\\ \hline Does \verb![a-z0-9]+! match \verb!b5!? & Yes, because both \verb!b! and \verb!5! are members of the class and \verb!+! means one or more.\\ \hline Does \verb![a-z0-9]+! match \verb!good4you!? & Yes.\\ \hline Does \verb![a-z0-9]+! match \verb?good4you!? ? & No, because the exclamation mark is not a member of the class.\\ \hline Does \verb?[a-z0-9!]+? match \verb?good4you!? ? & Yes, because now you've added the exclamation mark to the class.\\ \hline So if I put any characters inside square brackets, those characters become members of a class? & Yes.\\ \hline Hey, couldn't I use this to match both uppercase and lowercase letters, like I wanted to do earlier (on page \ref{Bb})? & Yes!\\ \hline Like this: \verb![Bb]ob! to match both ``bob'' and ``Bob''? & Yes!\\ \hline Or different spellings, like \verb!gr[ae]y! to match both \verb!grey! and \verb!gray!? & Absolutely!\\ \hline OK, so by saying \verb![0-9]! I can match anything that's a number. & That's right.\\ \hline But what if I want to match anything \emph{except} a number? Do I have to make a class that lists everything that's not a number? & No, silly, of course not.\\ \hline So how do I match anything except a number? & You negate the number class with \verb!^! (a~caret).\\ \hline Oh, like this: \verb![^0-9]!? & Exactly. \\ \hline \end{longtable} \bigskip \begin{center} \renewcommand\arraystretch{1} \begin{tabular}{ll}\toprule \multicolumn{2}{l}{\textbf{Cheat sheet}}\\ \midrule \verb!a|b! & a or b \\ \verb![abc]! & a or b or c \\ \verb![a-z]! & any lowercase letter in the range a-z \\ \verb![A-Z]! & any uppercase letter in the range a-z \\ \verb![Bb]! & uppercase or lowercase b \\ \verb![0-9]! & any number in the range 0-9 \\ \verb![0-4]! & any number in the range 0-4 \\ \verb![^246]! & not 2, 4, or 6 \\ \verb![^0-9]! & not a number \\ \bottomrule \end{tabular} \end{center} \clearpage \section{Special characters and shorthands} \begin{longtable}{m{2.7in}m{2.7in}} So if a period is short for ``any character,'' then how do I match a period and nothing else? & Good question! You have to ``escape'' it with a backslash, like this: \verb!\.!\\ \hline So \verb!\.org! would only match \verb!.org!, not, say, \verb!borg!? & That's right.\\ \hline Is it the same for \verb!*! and \verb!+!? & Yes. The period, star, and plus are all special characters with a special meaning. You have to escape them to make them represent the literal character.\\ \hline So writing \verb!\*borg\.org\+! would only match \verb!*borg.org+! and nothing else? & That's right.\\ \hline Hey, could I also match \verb!.org! by saying \verb![.]org!? & Yes! Inside a character class, only \verb!-! (the hyphen) and \verb!^! the caret are special characters.\\ \hline What if I want to include the hyphen in a character class? & You put it first, since then it can't define a range.\\ \hline So \verb![-a-z]+! would match \verb!bye-bye!? & Yes.\\ \hline And the caret? & You put it anywhere but first, since it only negates if it's the first character.\\ \hline So \verb![a-z^]+! would match \verb!o^o!? & Yep.\\ \hline What about if I want to match all whitespace? Do I make a character class? & You could, or you can just say \verb!\s! to cover spaces, tabs, and the various flavors of newlines.\\ \hline So \verb!kitty\s*cat! would match both \verb!kittycat! and \verb!kitty cat!? & Exactly.\\ \hline Are there more shorthands like that? & You bet. Too many to list here. \\ \hline Is there a shorthand for ``all digits''? & Yes! It's \verb!\d!. \\ \hline So \verb!\d! matches the same thing as \verb![0-9]! & Exactly. \\ \hline Is there another way to say \verb![^0-9]!, too? & Yes, \verb!\D!. \\ \hline Oh! So is \verb!\S! then ``any character except whitespace''? & It is! \\ \hline Speaking of special characters, does the caret mean anything outside a character class? & Yes, it means ``beginning of line.'' \\\hline So \verb!^dog! would find all lines beginning with \verb!dog!? & Exactly. \\\hline Is there a character for ``end of line,'' too? & Yes, \verb!$!. \\\hline \end{longtable} \bigskip \section{Grouping and substitution} \begin{longtable}{m{2.7in}m{2.7in}} Say I want to match \verb!kittykittykitty! as well as \verb!kittykitty!. Can I do that with a plus, like I can match \verb!aa! and \verb!aaa! with \verb!a+!? & Yes! You can make \verb!kitty! a single group by putting it in parentheses.\\ \hline Like this: \verb!(kitty)+!? & Exactly. \\ \hline So if I replace \verb!(kitty)+! with \verb!doggie!, do I get \verb!doggiedoggiedoggie!? & No, you get \verb!doggie!, because \verb!(kitty)+! matches the whole string, no matter how many times \verb!kitty! it has.\\ \hline Oh. So actually to replace \verb!kittykittykitty! with \verb!doggiedoggiedoggie!, I should just replace \verb!kitty! with \verb!doggie!? & Yes. \\ \hline Can I also say \verb!(kitty|doggie)+! and match \emph{either} \verb!kittykittykitty! and \verb!doggiedoggiedoggie!? & Yes, of course you can. And that will of course also match \verb!kitty! and \verb!doggiedoggie! and so on. \\ \hline Right. So what if I wanted to replace \verb!kittykitty! with \verb!here-kittykitty! and \verb!doggiedoggiedoggie! with \verb!here-doggiedoggiedoggie!? Can I do that with one expression? & Yes; you can use a backreference. Any grouped expression is saved and numbered and can be accessed by \verb!$1!, \verb!$2!, and so on. \\ \hline So replacing \verb!That's a cute (kitty)! with \verb!I like that $1! would produce \verb!I like that kitty!? & Absolutely.\\ \hline So then can I say replace \verb!(kitty|doggie)+! with \verb!here-$1! to get \verb!here-doggiedoggie! and so on? & No, because now the grouped part only has \verb!kitty! or \verb!doggie! once.\\ \hline Oh, so I'll always get \verb!here-kitty! or \verb!here-doggie! and never \verb!here-doggiedoggie!. & Yes; you have to include the plus in a group.\\ \hline Would this work: \verb!((kitty|doggie|)+)!? & It would. \\ \hline \end{longtable}\enlargethispage{\baselineskip} %\bigskip \begin{center} \renewcommand\arraystretch{1} \begin{tabular}{ll}\toprule \multicolumn{2}{l}{Cheat sheet} \\\midrule \verb!\! & escape character \\ \hline \verb!\.! & period (literal) \\ \hline \verb!\s! & any whitespace character (space, tab, newline) \\ \hline \verb!\d! & any digit \\ \hline \verb!\S! & any non-whitespace character \\ \hline \verb!\D! & any non-digit character \\ \hline \verb!(ab)*! & ab zero or more times \\ \hline \verb!$1! & \verb!cat! in \verb!kitty(cat)! \\ \hline \verb!$2! & \verb!s! in \verb!kitty(cat)(s)! \\ \hline \verb!$1! & \verb!kittycat! in \verb!(kitty(cat))! \\ \hline \verb!$2! & \verb!cat! in \verb!(kitty(cat))! \\ \bottomrule \end{tabular} \end{center} %Maybe like this: \end{document}