SCoRE: Search a Corpus for Regular Expressions

Help

Search String | Sentences/Turns | Case/Puncuation Sensitivity | Included/Excluded Files | Directory Restriction | First/Last Sentence/Turn | BNC PoS Tag Scheme


Search String

At its simplest, the search string can be a list of words. This will match any sentence/turn which contains these words, adjacent to each other, in the same order.

Word Adjacency

Words required to be adjacent should be separated by a space, words that can be non-adjacent can be separated by a comma and space. For example, a dialogue containing
   "... a xyz b ..."
would be matched by the search string "a, b", but not by "a b".

Wild Cards

Wild cards are allowed, but must be specified as "DOS-style" wildcards rather than in Unix regular expression syntax:
   "*" matches zero or more alphanumeric chars,
   "+" matches one or more,
   "?" matches exactly one.

   " ... axy b ..." will be matched by "a* b" or "a+ b", but not "a? b".

Punctuation

Punctuation marks can be included as part of the string, but must be separated by spaces from other parts of the string. This is partly due to the BNC tagging scheme, but also to distinguish between the use of "?" as a wildcard and as a punctuation mark. For example, a search for sentences ending "why?" requires the search string
   "why ?", rather than
   "why?" (which will match four-letter words beginning with "why").

Parts-of-Speech

Putting a term in <> brackets means it will match any word in a given PoS category (rather than a specific word). Details of the BNC PoS tagging scheme are given below. Wildcards can be used within the tag specification:
   "<AVQ>" would match any wh-adverb (when, how, why)
   "<??Q>" would match any wh-question word.

Optional Characters

Optional words or characters can be contained within square brackets.
   "[good]bye" matches both "... goodbye ..." and "... bye ...".
   "hello [there]" matches both "... hello ..." and "... hello there ...".

Sentence/Turn Beginnings/Ends

"^" matches the beginning of a sentence/turn, "$" matches the end (including an optional punctuation mark). Again, these characters must be separated by spaces from other search terms. "^ what, ? $" would match any sentence/turn beginning with "what" and ending with a "?".

Sentence/Turn Breaks

"|" (the pipe character) matches the break between sentences/turns. Together with "\n" this allows repeats to be searched for: "* | ^ \1 ? $" would match single-word questions which are repeats of a word in the previous sentence/turn (such as clarification ellipsis). Beware, though - searches like this where the first term can be matched by anything will be slow.

Variables

"\n" matches whatever matched the nth term in the string. All terms count here, so if string 1 were "^ wh*, ? $", "\1" would match the "^" (not particularly useful), and "\2" would match whichever word the "wh*" matched (more useful).

Sentences/Turns

This determines whether the scope of a simple search string is a single sentence or a whole speaker turn (there may be more than one sentence within a turn). It also determines whether "|" breaks (see above) match the end of individual sentences or speaker turns.

Case/Punctuation Sensitivity

Case Sensitivity

If checked, only case-sensitive matches are allowed - this may prevent your search string from matching in sentence-initial positions.

Punctuation Sensitivity

If checked, punctuation is not allowed to separate words required to be adjacent (i.e. those not followed by "," in the search string).

Included/Excluded Files

Included Files

The search will ONLY consider those files that have been defined as belonging to the categories selected here (so at least one must be selected). These categories are broad descriptions: "monologue" and "dialogue". Non-spoken texts are currently unavailable but will be added in later versions.

Excluded Files

The search will NOT consider those files that have been defined as belonging to the sub-categories selected here. These sub-categories define contextual parameters for spoken files.

Directory Restriction

OPTIONALLY restricts the search to a particular subdirectory of the BNC - useful when testing out a new search string, as full searches take so long. K and L are good ones for context-free conversation, as are their subdirectories (such as K/KB).

First/Last Sentence/Turn

OPTIONALLY restricts the search to a particular section of each file (leaving these fields blank means the search considers the whole file). This can be used to select a small sub-corpus while still maintaining a broad range of contexts.

BNC PoS Tag Scheme

AJ0adjective (general or positive), e.g. `good', `old'
AJCcomparative adjective, e.g. `better', `older'
AJSsuperlative adjective, e.g. `best', `oldest'
AT0article, e.g. `the', `a', `an', `no'
AV0adverb (general, not sub-classified as AVP or AVQ), e.g. `often', `well', `longer', `furthest'
AVPadverb particle, e.g. `up', `off', `out'
AVQwh-adverb, e.g. `when', `how', `why'
CJCcoordinating conjunction, e.g. `and', `or', `but'
CJSsubordinating conjunction, e.g. `although', `when'
CJTthe subordinating conjunction `that', when introducing a relative clause, as in ``the day that follows Christmas''
CRDcardinal numeral, e.g. `one', `3', `fifty-five', `6609'
DPSpossessive determiner form, e.g. `your', `their', `his'
DT0general determiner: a determiner which is not a DTQ, e.g. `this' both in ``This is my house'' and ``This house is mine''
DTQwh-determiner, e.g. `which', `what', `whose', `which'
EX0existential `there', the word `there' appearing in the constructions ``there is...'', ``there are ...''
ITJinterjection or other isolate, e.g. `oh', `yes', `mhm', `wow'
NN0common noun, neutral for number, e.g. `aircraft', `data', `committee'
NN1singular common noun, e.g. `pencil', `goose', `time', `revelation'
NN2plural common noun, e.g. `pencils', `geese', `times', `revelations'
NP0proper noun, e.g. `London', `Michael', `Mars', `IBM'
ORDordinal numeral, e.g. `first', `sixth', `77th', `next', `last'
PNIindefinite pronoun, e.g. `none', `everything', `one' (pronoun), `nobody'
PNPpersonal pronoun, e.g. `I', `you', `them', `ours'
PNQwh-pronoun, e.g. `who', `whoever', `whom'
PNXreflexive pronoun, e.g. `myself', `yourself', `itself',`ourselves'
POSthe possessive or genitive marker `'s' or ` ' '
PRFthe preposition `of'
PRPpreposition other than `of', e.g. `about', `at', `in', `on behalf of', `with'
TO0the infinitive marker `to'
UNCunclassified items which are not appropriately classified as items of the English lexicon
VBBthe present tense forms of the verb `be', except for `is' or `'s', i.e. `am', `'m', `are', `'re', `be', `ai' (as in `ain't')
VBDthe past tense forms of the verb `be', i.e. `was', `were'
VBGthe -ing form of the verb `be', i.e. `being'
VBIthe infinitive form of the verb `be', i.e.`be'
VBNthe past participle form of the verb `be', i.e. `been'
VBZthe -s form of the verb `be', i.e.`is', `'s'
VDBthe finite base form of the verb `do', i.e. `do'
VDDthe past tense form of the verb `do', i.e. `did'
VDGthe -ing form of the verb `do', i.e. `doing'
VDIthe infinitive form of the verb `do', i.e. `do'
VDNthe past participle form of the verb `do', i.e. `done'
VDZthe -s form of the verb `do', i.e. `does'
VHBthe finite base form of the verb `have', i.e. `have', `'ve'
VHDthe past tense form of the verb `have', i.e. `had', `'d'
VHGthe -ing form of the verb `have', i.e. `having'
VHIthe infinitive form of the verb `have', i.e. `have'
VHNthe past participle form of the verb `have', i.e. `had'
VHZthe -s form of the verb `have', i.e. `has', `'s'
VM0modal auxiliary verb, e.g. `can', `could', `will', `'ll', `'d', `wo' (as in `won't')
VVBthe finite base form of lexical verbs, e.g. `forget', `send', `live', `return'
VVDthe past tense form of lexical verbs, e.g. `forgot', `sent', `lived', `returned'
VVGthe -ing form of lexical verbs, e.g. `forgetting', `sending', `living', `returning'
VVIthe infinitive form of lexical verbs, e.g. `forget', `send', `live', `return'
VVNthe past participle form of lexical verbs, e.g. `forgotten', `sent', `lived', `returned'
VVZthe -s form of lexical verbs, e.g. `forgets', `sends', `lives', `returns'
XX0the negative particle `not' or `n't'
ZZ0alphabetical symbols, e.g. `A', `a', `B', `b', `c', `d'

This page, search engine © Matthew Purver, 2001 Problems, comments, suggestions: matthew.purver@kcl.ac.uk