`
`(12) United States Patent
`Dill et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,730,013 B2
`Jun. 1, 2010
`
`(54) SYSTEMAND METHOD FOR SEARCHING
`DATES EFFICIENTLY IN A COLLECTION OF
`WEB DOCUMENTS
`
`(*) Notice:
`
`(75) Inventors: Stephen Dill, San Jose, CA (US);
`Madhukar R. Korupolu, Sunnyvale,
`CA (US)
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 248 days.
`(21) Appl. No.: 11/259,664
`(22) Filed:
`Oct. 25, 2005
`(65)
`Prior Publication Data
`US 2007/OO94246A1
`Apr. 26, 2007
`
`(51) Int. Cl.
`(2006.01)
`G06F 7/30
`(2006.01)
`GO6F 7/OO
`(52) U.S. Cl. ........................ 707/5: 704/1: 704/8; 704/9
`(58) Field of Classification Search ..................... 707/5,
`707/6, 102,105,999.005, 999.006, 999.102,
`707/999.105: 704/1, 8, 9
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`
`6,144.963 A * 1 1/2000 Tsuda ......................... 707/10
`6,167,368 A * 12/2000 Wacholder ..................... TO4/9
`6,249,765 B1* 6/2001 Adler et al. ................. TO4,500
`7,107,528 B2 * 9/2006 Gerstlet al. ................ 715,534
`2002/0143871 A1* 10/2002 Meyer et al. ................ TO9.204
`2003/0200199 A1 * 10/2003 Snyder .......................... 707/2
`2003/0212649 A1* 11/2003 DeneSuk et al. ................ 707/1
`2003/0212675 A1* 11/2003 DeneSuk et al. ................ 707/5
`
`2003/0212699 A1* 11/2003 DeneSuk et al. ............. 707/102
`2004/O123240 A1* 6/2004 Gerstlet al.
`... 715,513
`2005/0057584 A1* 3, 2005 Gruen et al. ...
`... 345,752
`2005/0149858 A1* 7/2005 Stern et al. .................. 715,513
`2005/0177564 A1
`8/2005 Kobayashi et al.
`2006/0101005 A1* 5/2006 Yang et al. ..................... 707/3
`2006/024845.6 A1* 11/2006 Bender et al. ............... 715,531
`
`FOREIGN PATENT DOCUMENTS
`
`JP
`
`2004141855
`
`5, 2004
`
`OTHER PUBLICATIONS
`“Date and Time Formats.” W3C, Aug. 27, 1998, http://www.w3.org/
`TRNOTE-datetime.
`“Convert HTML to text and remove markup.” Jafsoft, Oct. 9, 2004,
`http://web.archive.org/web/2004 1009 125145/http://www.jafsoft.
`com/detagger?.
`
`(Continued)
`Primary Examiner James Trujillo
`Assistant Examiner—Dawaune Conyers
`(74) Attorney, Agent, or Firm—Samuel A. Kassatly;
`Shimokaji & Associates, P.C.
`
`(57)
`
`ABSTRACT
`
`A date querying system processes free-form text in docu
`ments to identify and locate some or all of the dates in the
`documents using extended regular expression matching to
`capture various date formats. The system packages a canoni
`calized format of each identified date to Support various types
`of queries such as, for example, specific date querying, hier
`archical date querying, range date querying, proximity que
`ries comprising a date and any keywords, and any combina
`tion of types of queries. The system scans a document to
`identify the various format dates occurring in the document,
`disambiguates the resulting occurrences of dates, and canoni
`calizes the dates according to one or more predetermined
`formats.
`
`25 Claims, 6 Drawing Sheets
`
`
`
`At URYSYSTM
`
`1.
`
`ExNEATCHINGRE
`215
`
`DATEMUs
`
`
`
`
`
`
`
`
`
`PRFXools
`
`SUFFXMODULE
`
`SABIGUATOR
`
`23s
`
`CANONICALZER
`
`
`
`2
`PACKASNs
`Roo.
`
`245
`
`output
`
`Bright Data Ltd. v. Oxylabs, UAB
`IPR2023-01442 | Oxylabs EX2011
`Page 1 of 14
`
`
`
`US 7,730,013 B2
`Page 2
`
`OTHER PUBLICATIONS
`Freidl, J. "Mastering Regular Expressions, Second Edition.”
`O'Reilly & Associates, Inc., Sebastopol, CA. 2002.*
`Klopping, H. Mesman, Beno. Plomp, P. Schreuder, W. “The LPIC-2
`Exam Prep.” Sep. 12, 2004.*
`
`Ignat et al., “Extending an Information Extraction tool set to Central
`and Eastern European languages'. Sep. 2003, Proceedings of the
`International Workshop, pp. 33-39.*
`
`* cited by examiner
`
`Page 2 of 14
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 1, 2010
`Jun. 1, 2010
`
`Sheet 1 of 6
`Sheet 1 of 6
`
`US 7,730,013 B2
`US 7,730,013 B2
`
`
`
`
`
`
`
`Sco
`Sc
`
`Sco
`
`
`
`ONIHDYVAS
`s
`aLlva
`
`WALSAS
`
`Page 3 of 14
`
`Page 3 of 14
`
`
`
`
`
`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 2 of 6
`
`US 7,730,013 B2
`
`
`
`10
`
`DATE QUERY SYSTEM
`
`EXTENDED MATCHING MODULE
`215
`
`
`
`
`
`
`
`DATE MODULE
`
`
`
`PREFX MODULE
`
`SUFFX MODULE
`
`DISAMBIGUATOR
`
`235
`
`CANONICALZER
`
`
`
`210
`PACKAGING
`MODULE
`
`FIG. 2
`
`245
`
`OUTPUT
`
`Page 4 of 14
`
`
`
`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 3 of 6
`
`US 7,730,013 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`400
`
`EXTENDED MATCHING MODULE USES EXTENDED
`REGULAR EXPRESSION MATCHING TO DENTIFY
`DATES IN INPUT AND GENERATE CANONICALIZED
`FORMAT FOREACH IDENTIFIED DATE
`(METHOD 400, FIG. 4)
`
`PACKAGING MODULE PACKAGES THE
`CANONICALIZED FORMAT OF EACH IDENTIFIED
`DATE
`
`305
`
`OUTPUT OCCURRENCES OF THE QUERED DATE
`BASED ON PACKAGING
`
`
`
`310
`
`FIG. 3
`
`Page 5 of 14
`
`
`
`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 4 of 6
`
`US 7,730,013 B2
`
`OO
`
`405
`
`SELECT A DOCUMENT
`
`DATE MODULE SCANS SELECTED DOCUMENT FOR
`OCCURRENCES OF A DATE IN NUMERIC FORMAT OR ADATE
`COMPRISING AMONTHNAME IN ALPHABETIC FORMAT
`
`410
`
`
`
`
`
`
`
`415
`
`
`
`FOUND
`DATE COMPRISES
`MONTHNAME
`
`->(a
`
`YES
`
`420
`
`PREFX MODULE FINDS ANUMERIC PREFX PART OF A
`DATE PRECEDING THE FOUND MONTHNAME
`
`SUFFIX MODULE FINDS ANUMERIC SUFFIX PART OF A
`DATE FOLLOWING THE FOUND MONTHNAME
`
`425
`
`EXTENDED MATCHING MODULE DETERMINES THE FORMAT OF A
`DATE OCCURRING WITH THE ALPHABETIC MONTHNAME BY
`CORRELATING THE PREFIX PART WITH THE SUFFIX PART
`
`430
`
`
`
`
`
`FIG. 4A
`
`Page 6 of 14
`
`
`
`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 5 of 6
`
`US 7,730,013 B2
`
`
`
`
`
`DISAMBIGUATOR DISAMBIGUATES THE FORMAT OF
`EACH OF THE OCCURRENCES OF ADATE
`
`435
`
`CANONICALIZER CREATES CANONICALIZED FORMATS
`FOREACH OF THE OCCURRENCES OF A DATE
`
`440
`
`FIG. 4B
`
`Page 7 of 14
`
`
`
`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 6 of 6
`
`US 7,730,013 B2
`
`500
`
`EXEMPLARY SAMPLE RESULTS FOR ALL, DATES IN OCTOBER 2004'
`
`505
`
`
`
`MENU BAR
`
`510
`SEARCH CONTROL BUTTON(S)
`
`1. Title 1
`( from the October 19 2004.
`515
`URL1
`Crawl date
`
`(1 2. Title 2
`Headlines Oct 26 2004...
`520
`URL2
`Crawl date
`
`(1
`525
`
`3. Title 3
`Last Updated October 14, O4
`URL3
`crawl date
`
`(1 4. Title 4
`article text Oct 18 2004.
`530
`URL4
`crawl date
`
`(1 5. Title 5
`Published 10.14.2004...
`535
`URL5
`Crawl date
`
`FIG. 5
`
`Page 8 of 14
`
`
`
`US 7,730,013 B2
`
`1.
`SYSTEMAND METHOD FOR SEARCHING
`DATES EFFICIENTLY IN A COLLECTION OF
`WEB DOCUMENTS
`
`FIELD OF THE INVENTION
`
`The present invention generally relates to text analysis of
`electronic documents. More specifically, the present inven
`tion relates to identifying dates in electronic documents in
`which dates occur in various formats and further relates to
`packaging the dates uniformly for purposes of querying.
`
`10
`
`BACKGROUND OF THE INVENTION
`
`15
`
`25
`
`30
`
`35
`
`Searching for dates is a useful primitive in understanding
`and extracting relevant pieces from large collections of docu
`ments. Locating a source date for content on the web is
`especially useful in determining relevancy to a search request
`comprising a date. However efficiently performing a query
`for dates is challenging since dates tend to occur in various
`formats in unstructured text.
`For example, the date October 11, 2004 can occur in text as
`11' of October 2004, 11-10-2004, 11 October, '04, Oct. 11"
`04, 11/10/04, 10.11.2004, 2004 Oct 11, etc. Variations in date
`expression can be even more pronounced on a diversified
`collection such as the web, where many different people and
`organizations write web content such as free-form text. This
`is a natural consequence of the decentralized nature of the
`web and the few rigid requirements imposed on free-form
`text.
`Nevertheless, the free-form text on the web is an important
`source of information, both current and archived. Newspa
`pers and magazines provide news articles online on the web:
`an estimate for news sources on the web is over 10,000.
`Covering a range of topics, these new articles cater to the
`needs of both businesses and individuals. Moreover, organi
`Zations such as companies and universities post a wealth of
`information available online. Some search engine sites esti
`mate the number of web pages indexed at over 8 billion.
`40
`Given the large number of sources and the large number of
`pages on the web, the need for automated techniques for
`searching and navigating Such a large collection is increasing.
`Dates are an important means to understand the temporal
`context of the information found near the dates or on the same
`web page as the dates. Queries such as:
`Show all pages that mention a particular date D (e.g., 11
`Oct 2004),
`Show all pages that mention any date in a given month
`(e.g., Oct 2004), or
`Show all pages that mention any date in a given year (e.g.,
`2004) with one or more keywords with a specified con
`text such as "on the same page”, “on the same line', etc.
`are natural and useful ways to filter and navigate Such
`large collections of pages.
`Although conventional web search engines perform well
`using standard keyword and proximity searches, it would be
`desirable to present additional improvements. Conventional
`web search engines do not adequately search by dates. Even
`a basic date query such as “find all pages that mention 11"
`October 2004 requires a separate search for each possible
`date format. Such a search is tedious and unmanageable since
`the number of possible date formats is sizeable. Furthermore,
`some formats such as 11.10.2004 are difficult to search
`because Some search engines ignore the numbers and periods
`in a date format if they occur frequently.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Searching on dates using a conventional web search engine
`becomes more unmanageable for hierarchical date queries
`Such as “find all pages that mention any date in October
`2004.
`Conventional web search engines have further difficulty
`searching for dates in ambiguous format. For example,
`11.10.2004 can mean either 11 October 2004 or 10
`November 2004, depending on the context. The ambiguity is
`further compounded when the year is specified as a two-digit
`number and the month, day, and year are in similar in value
`(for example, 01/04/05).
`Another conventional approach for finding a source date
`finds a single date for each page, representing when the page
`may have been written, i.e., a date-of-page. However, this
`date-of-page may not exist for all web pages. A date-of-page
`is typically not well defined and is usually a best guess based
`on different dates that appear on the page or in the http header
`of the page. Furthermore, this conventional approach still
`retains only one date per page even when a page contains
`additional dates. Consequently, the information about other
`dates is lost, including the locations of the other dates for
`proximity queries.
`A further conventional approach that identifies named enti
`ties such as different forms in which a keyword can be refer
`enced in text lists all possible alternatives explicitly. This
`conventional approach works well in cases where the number
`of variants is a small number. However, in the context of
`locating Source dates on the web, the large number of possible
`formats for each date and the large number of possible distinct
`dates renders this approach cumbersome. Consequently,
`regular expression-based spotting is a better alternative for
`dates.
`Yet another conventional approach comprises a natural
`single-step regular expression matching. In particular con
`texts such as weblogs (also known as blogs), this conventional
`approach addresses identification of dates to some extent
`based on the structure of blogs. However, this conventional
`approach does not address the wide range of possible formats
`for dates that appear on the web and the resulting disambigu
`ation required to identify dates. Furthermore, efficiency and
`processing time become serious issues for this conventional
`approach considering the large number of possible formats
`and the large number of pages requiring processing.
`What is therefore needed is a system, a computer program
`product, and an associated method for searching dates effi
`ciently in a large collection of web documents. The need for
`Such a solution has heretofore remained unsatisfied.
`
`SUMMARY OF THE INVENTION
`
`The present invention satisfies this need, and presents a
`system, a computer program product, and an associated
`method (collectively referred to herein as “the system” or “the
`present system') for searching dates efficiently in a large
`collection of web documents.
`A date matching module of the present system processes
`free-form text in documents to identify and locate some or all
`of the dates in the documents using extended regular expres
`sion matching to capture various date formats. A packaging
`module of the present system packages a canonicalized for
`mat of each identified date to Support various types of queries
`Such as, for example, specific date querying, hierarchical date
`querying, range date querying, proximity date querying,
`proximity queries comprising a date and any keywords, and
`any combination of types of queries.
`The date module scans a document for Some or all occur
`rences of dates, searching for numerical dates and month
`
`Page 9 of 14
`
`
`
`US 7,730,013 B2
`
`3
`names in alphabetic format. If a month name is found, a prefix
`module applies a prefix regular expression matching to a
`prefix Substring preceding the found month name to identify
`a prefix part of a date, a portion of the date preceding the
`month name. The Suffix module applies a Suffix regular
`expression matching to a suffix Substring following the found
`month name to identify a Suffix part of a date, a portion of the
`date following the month name. The date matching module
`determines one or more formats for a date corresponding to
`the found month name by correlating the prefix part and the
`Suffix part. The date matching module generates a date in the
`selected format(s) from the found month, the prefix part, and
`the suffix part.
`A disambiguator of the present system disambiguates
`found occurrences of dates comprising either a found numeri
`cal date or the date generated by the date matching module.
`Disambiguation is desired for dates with a day, month, or year
`that cannot easily be discerned. A canonicalizer formats dates
`in one or more canonical forms for the disambiguatized
`occurrences of dates.
`
`10
`
`15
`
`4
`World WideWeb (WWW, also Web): An Internet client
`server hypertext distributed information retrieval system.
`FIG.1 portrays an exemplary overall environment in which
`a system, a service, a computer program product, and an
`associated method (the “date searching system 10 or the
`“system 10') for searching dates efficiently in a collection of
`web documents according to the present invention may be
`used. A text analysis system 15 comprises system 10 and a
`search engine 20. The text analysis engine 15 analyzes docu
`ments obtained from a source Such as, for example, the
`WWW, for data analysis, trend discover, etc. The text analysis
`engine comprises search functionalities provided by the
`search engine 20. The text analysis system 15 is installed on
`a computer such as the host server 25.
`System 10 comprises a software programming code or a
`computer program product that is typically embedded within,
`or installed on the host server 25. Alternatively, system 10 can
`be saved on a Suitable storage medium Such as a diskette, a
`CD, a hard drive, or like devices. A database 30 (dB 30)
`comprises documents from sources such as the WWW. While
`the system 10 will be described in connection with the WWW,
`the system 10 can be used with a stand-alone dB 30 of content
`that may have been derived from the WWW or other sources.
`Users, such as remote Internet users, are represented by a
`variety of computers such as computers 35, 40, 45, and can
`access the host server 25 through a network 50. Computers
`35, 40, 45, each comprise software that allows the user to
`interface securely with the host server 25. The host server 25
`is connected to network50 via a communications linkS5 such
`as a telephone, cable, or satellite link. Computers 35, 40, 45.
`can be connected to network 50 via communications links 60,
`65,70, respectively. While system 10 is described in terms of
`network50, computers 35, 40, 45, may also access system 10
`locally rather than remotely. Computers 35, 40, 45, may
`access system 10 either manually, or automatically through
`the use of an application. Users query data on dB 30 via
`network 50 and the search engine 20.
`FIG. 2 illustrates a high-level hierarchy of system 10. Sys
`tem 10 comprises an extended matching module 205 and a
`packaging module 210. The extended matching module 205
`comprises a date module 215, a prefix module 220, a suffix
`module 225, a disambiguator 230, and a canonicalizer 235.
`An input 240 comprises documents that have no html tags
`such as, for example, any content derived from the WWW that
`has been de-tagged. Data in dB 30 is processed to remove
`html tags using standard de-tagging methods to remove the
`html tags from the crawled html content of the document.
`Removing html tags simplifies a process of spotting a date by
`removing the complexity of html tags interleaved within a
`date string. Removing html tags requires reconstruction of the
`span and location of the occurrence of a date; span and loca
`tion are required for queries comprising proximity. An output
`245 generated by System 10 comprises the span and location
`of each identified date to support proximity querying.
`FIG.3 illustrates a method 300 of system 10. The extended
`matching module 205 processes documents in dB 30 to iden
`tify and locate some or all of the dates in the document (step
`400, further illustrated in method 400 of FIG. 4). The
`extended matching module 205 uses extended regular expres
`sion matching to capture various date formats.
`In free-form text, dates can occur either numerically or
`alphanumerically. Alphanumeric dates occur in forms, such
`as, for example:
`October 11 hh:mm:ss EST 2004
`October 11 hh:mm:ss 2004
`11 October 2004
`October 11 2004
`2004 October 11.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`30
`
`The various features of the present invention and the man
`ner of attaining them will be described in greater detail with
`25
`reference to the following description, claims, and drawings,
`wherein reference numerals are reused, where appropriate, to
`indicate a correspondence between the referenced items, and
`wherein:
`FIG. 1 is a schematic illustration of an exemplary operating
`environment in which a date searching system of the present
`invention can be used;
`FIG. 2 is a block diagram of the high-level architecture of
`the date searching system of FIG. 1;
`FIG. 3 is a process flow chart illustrating a method of
`operation of the date searching system of FIGS. 1 and 2:
`FIG. 4 is comprised of FIGS. 4A and 4B and represents a
`process flow chart illustrating a method of operation of the
`date matching module of the date searching system of FIGS.
`1 and 2; and
`40
`FIG. 5 is a diagram illustrating an exemplary screen shot of
`a response to a query by a user Submitted to a search engine
`comprising the date searching system of FIGS. 1 and 2.
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`45
`
`The following definitions and explanations provide back
`ground information pertaining to the technical field of the
`present invention, and are intended to facilitate the under
`standing of the present invention without limiting its scope:
`Free-form text: Unstructured text such as the input to a
`word processor or text editor comprising, for example, words,
`sentences, dates, numbers, etc.
`HTML (Hypertext Markup Language): A standard lan
`guage for attaching presentation and linking attributes to
`informational content within documents. During a document
`authoring stage, HTML “tags' are embedded within the
`informational content of the document. When a web server
`transmits the web document (or “HTML document) to a web
`browser, the tags are interpreted by the browser and used to
`parse and display the document. In addition to specifying how
`the web browser is to display the document, HTML tags can
`be used to create hyperlinks to other web documents.
`Internet: A collection of interconnected public and private
`computer networks that are linked together with routers by a
`set of standard protocols to form a global, distributed net
`work.
`
`50
`
`55
`
`60
`
`65
`
`Page 10 of 14
`
`
`
`5
`In each of these exemplary alphanumeric formats, additional
`variants result when a day 11 of a month is written with a
`superscripted “th” as 11". In general, a day of a month can be
`written as 1', 2',3', 4', 5", etc., or as 1, 2, 3, 4, 5, etc.
`The number of possible text formats in which any date may
`be written alphanumerically is non-trivial. Date format com
`prises year format, month format, day format, and separator
`format.
`Text representations of years comprise a variety of year
`formats. For example, a year such as 2004 can be written as
`2004, 04, or 04. In general, the year can occur as a full
`four-digit integer (e.g., 2004), as a two-digit integer (e.g. 04),
`or as a two-digit integer with a preceding apostrophe (e.g.,
`04). For disambiguation purposes, system 10 uses a conven
`tion specifying that years before 1970 are fully specified with
`four digits. System 10 interprets two digit years from 00 to 69
`as 2000 to 2069. System 10 interprets two digit years from 70
`to 99 as 1970 to 1999.
`Text representations of a month comprise a variety of
`month formats. For example, months can be written either
`spelled out in a complete form or in an abbreviated form such
`as Oct for October. Months can also be written in any number
`of languages, either in complete form or abbreviated form.
`Months can be capitalized, lower case, or uppercase.
`Furthermore, incompletely specified alphanumeric for
`mats of dates are common. For example, a date may be
`represented as a month and year without an accompanying
`day such as, for example, October 2004 or 2004 October. A
`date may be further represented as a month and day without
`an accompanying year such as, for example, October 11",
`11 October, 11 October, or October 11.
`Moreover, various characters such as a space, a dash ("-).
`a period (“ ”), a comma (".."), etc. may separate components of
`a date in alphanumeric format.
`Numeric dates comprise numeric formats such as, for
`example:
`11/10/2004, 11-10-2004, 11.10.2004,
`10/11/2004, 10-11-2004, 10.11.2004, or
`2004/10/11, 2004-10-11, 2004.10.11
`Components of a date in alphanumeric format may be
`separated by various characters such as a space, a dash ("-).
`a period (“ ”), an underline (“ ”), a forward slash ("/"), etc. In
`alphanumeric format, a year can comprise two digits (e.g. 04)
`or four digits (e.g., 2004).
`A URL string can comprise dates without any separators
`between day, month, or year. In this case, the date is written as
`a numeric string Such as, for example, the following dates for
`October 11, 2004: 11102004, 101 12004, or 20041011.
`For either alphanumeric dates or numeric dates appearing
`in html.html tags may be imbedded within a string represent
`ing the date. For example the string 11 October 2004 may
`have been written as “11'<b> October </b>2004 to make
`the month October a bold word.
`FIG. 4 illustrates a method 400 of the extended matching
`module 205 of system 10. System 10 selects a document for
`processing (step 405). The date module 215 scans the selected
`document for one or more occurrences of a date in numeric
`format or a date comprising a month name in alphabetic
`format in either full form or abbreviated form (step 410).
`The extended matching module 205 determines whether
`the found date comprises a month name (decision step 415).
`A date comprising a month name typically comprises an
`alphanumeric portion preceding the month name (further ref
`erenced as the prefix part) and an alphanumeric portion fol
`lowing the month name (further referenced as the suffix part).
`If the found date comprises a month name, the prefix module
`220 applies a prefix regular expression matching to the char
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 7,730,013 B2
`
`10
`
`15
`
`6
`acters preceding the found month name to identify the prefix
`part (step 420). The prefix module 220 captures any possible
`day-year patterns occurring in a prefix Substring of length
`ranging from 5 characters to 10 characters. An exemplary
`length of the prefix Substring is approximately 10 characters.
`The suffix module 225 applies suffix regular expression
`matching to the characters following the found month name
`to identify the suffix part (step 425). The suffix module 225
`captures any possible day-year patterns occurring in a suffix
`Substring of length ranging from 10 characters to 30 charac
`ters. An exemplary length of the Suffix substring is approxi
`mately 20 characters.
`The extended matching module 205 determines one or
`more formats for the date comprising the month name by
`correlating the prefix part and the suffix part (step 435). The
`extended matching module 205 appropriately handles any
`overlapping dates. For example, a string “2003 Oct 11, 2004
`may represent “2003 Oct 11” and “Oct 11 2004.
`If the found date comprises a numeric format (decision step
`425), the disambiguator 230 disambiguates the format of each
`of the found occurrences of dates (step 435). Input to the
`disambiguator 230 comprises a date in numeric format found
`in step 410 or a date generated from a prefix part, a month
`name, and a suffix part in step 430.
`Ambiguities arise both with numeric date (e.g.,
`11.10.2004) and alphanumeric date (e.g., 02 October 04) in
`that the day, month, or year may not be easily discerned. The
`disambiguator 230 checks ranges values of the portions of the
`date to reduce ambiguity. For example, a number greater than
`12 is either a day or year since each year contains 12 months.
`Similarly, a number greater than 31 is a year since the maxi
`mum number of days in a month is 31.
`The disambiguator 230 uses information from a page on
`which a date if found to further disambiguate a date, if nec
`essary. For example, another date on a page with an unam
`biguous date may provide a clear indication of the date for
`mat. A page may comprise date 9.10.2004; this date can be
`either October 9, 2004 or September 10, 2004. By examining
`other dates on the same page, the disambiguator 230 may find
`the date 15.10.2004; this date can only be October 15, 2004.
`Consequently, the format for dates on this page is dd.m-
`m.yyyy. The disambiguator 230 extrapolates and infers that
`the format of the currently selected date is similar to that of
`other dates on the selected page, assuming that a given page
`uses the same format throughout.
`Disambiguator 230 applies additional rules as desired to
`further disambiguate a date. Disambiguator 230 applies these
`additional rules based on continuity in the page. For example,
`a date on a page comprising 08.10.2004, 11.10.2004, and
`04.10.2004 likely comprises format dd.mm.yyyy Disam
`biguator 230 further applies additional rules based on dates
`compared to past/future. For example, if one interpretation of
`a date is after the current crawled date for a collection of
`documents in which the selected document resides, then the
`interpretation is rejected. Disambiguator 230 applies addi
`tional rules based on recency and closeness. If possible inter
`pretations of dates are less than the crawled date, then the date
`interpretation that is closer to the crawl date is selected. For
`example, a page crawled in december 2004 lists a date 01
`october 04. The disambiguator 230 interprets the date 01
`October 04 as 1st October 2004 instead of 4th October 2001.
`In one embodiment, the disambiguator 230 uses site-level
`information as available. The disambiguator 230 gathers date
`format information from one or more pages of a site and uses
`the gathered date information for other pages where a date is
`ambiguous.
`
`Page 11 of 14
`
`
`
`US 7,730,013 B2
`
`10
`
`15
`
`45
`
`7
`The canonicalizer 235 takes the month, day, and year deter
`mined by the disambiguator 230, and represents the date in a
`canonicalized form as MMM-dd-yyyy (step 430).
`The packaging module 210 packages the canonicalized
`format of each identified date (step 305) to, for example,
`Support specific date querying, hierarchical date querying,
`and range date querying. Specific date querying queries con
`tent for documents comprising a specific date; an exemplary
`specific date query is:
`"Show all pages that mention a particular date D (e.g., 11
`Oct 2004). Hierarchical date querying queries content
`for documents comprising a month, a month and year, or
`a year, exemplary hierarchical date queries are:
`Show all pages that mention any date in a given month
`(e.g., Oct 2004), or
`Show all pages that mention any date in a given year (e.g.,
`2004) Range date querying queries content for docu
`ments comprising a range of dates; an exemplary range
`date query is:
`Show all pages that mention any date between two dates
`(e.g., 11 Oct 2004 and 22 Oct 2004).
`The packaging module 210 packages the canonicalized
`date in the following formats: IMMM-dd-yyyy. IMMM
`yyyy, yyyy, and a 32-bit integer. The first format
`25
`MMM-dd-yyyy assists in performing specific date query
`ing, the second and third formats IMMM-yyyy. Iyyyy
`assistin performinghierarchical date querying, and the fourth
`format assists in performing date range querying.
`For example, to search for all occurrences of a specific date
`30
`such as 11 October 2004 in a specific date query, a user
`requests Oct-11-2004. This query returns occurrences of
`the requested date in all formats including 11.10.2004, 11-10
`04, 2004 oct. 11, etc. Similarly, to search for occurrences of
`any date in October 2004 in a hierarchical query, the user
`35
`requests Oct-2004. To search for any date in 2004, the user
`requests 2004.
`The packaging module 210 further packages the canoni
`calized format of each identified date (step 305) to, for
`example, Support proximity queries comprising a date and
`40
`any other keywords. Support of proximity queries requires
`knowledge of the location of a date in a document, a span of
`the date, and a span associated with words in some orall of the
`documents in the input 240. A span comprises a count of the
`number of tokens in the date or in a phrase. In one embodi
`ment, the canonicalized format of a date generated by the
`packaging module 210 comprises the location of the date
`within a document. In another embodiment, the canonical
`ized format of a date generated by the packaging module 210
`comprises the span of the date within a document.
`The packaging module 210 further packages the canoni
`calized format of each identified date (step 305) to, for
`example, Support proximity queries comprising dates and any
`other keywords with any other type of date querying Such as,
`for example, specific date querying, hierarchical date query
`ing, and range date querying.
`System 10 supports date querying of dB 30, outputting all
`occurrences of a queried date using the packaging generated
`by the packaging module 210 (step 310).
`Exemplary canonical packages comprise MMM-dd
`60
`yyyy. IMMM-yyyyyyyy, and as a 32-bit integer. The
`canonicalizer 235 indexes the formatted date according to the
`canonical forms. For example, the canonicalizer 235 indexes
`11 October 2004 as indexed as Oct-11-2004, Oct-2004, and
`2004. Indexing a date in canonical forms allows the date to
`match specific queries and hierarchical queries. Each date
`will be converted to integers using, for example, 11 bits for the
`
`50
`
`55
`
`65
`
`8
`year, 4 bits for the month, and 5 bits for the day. These 20 bits
`are packaged as a 32-bit integer.
`A date ambiguity not resolved by the disambiguator 230 is
`indexed for each possible interpretation, allowing any inter
`pretations of the ambiguous date to match a corresponding
`user query. Indexing ambiguous dates increases recall for the
`USC.
`FIG. 5 illustrates an exemplary screenshot 500 displaying
`results of a single query for “all dates in Oct-2004. The
`screenshot 500 comprises a menu bar 505 and one or more
`search control button(s) 510. Listed in the screenshot 500 are
`exemplary query responses 515,520, 525, 530, 535 (collec
`tively referenced as query responses 540). Additional query
`responses may be found by the query but not shown in screen
`Shot 500.
`As seen in query responses 540, system 10 id



