throbber
USOO7730013B2
`
`(12) United States Patent
`Dill et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,730,013 B2
`Jun. 1, 2010
`
`(54) SYSTEMAND METHOD FOR SEARCHING
`DATES EFFICIENTLY IN A COLLECTION OF
`WEB DOCUMENTS
`
`(*) Notice:
`
`(75) Inventors: Stephen Dill, San Jose, CA (US);
`Madhukar R. Korupolu, Sunnyvale,
`CA (US)
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 248 days.
`(21) Appl. No.: 11/259,664
`(22) Filed:
`Oct. 25, 2005
`(65)
`Prior Publication Data
`US 2007/OO94246A1
`Apr. 26, 2007
`
`(51) Int. Cl.
`(2006.01)
`G06F 7/30
`(2006.01)
`GO6F 7/OO
`(52) U.S. Cl. ........................ 707/5: 704/1: 704/8; 704/9
`(58) Field of Classification Search ..................... 707/5,
`707/6, 102,105,999.005, 999.006, 999.102,
`707/999.105: 704/1, 8, 9
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`
`6,144.963 A * 1 1/2000 Tsuda ......................... 707/10
`6,167,368 A * 12/2000 Wacholder ..................... TO4/9
`6,249,765 B1* 6/2001 Adler et al. ................. TO4,500
`7,107,528 B2 * 9/2006 Gerstlet al. ................ 715,534
`2002/0143871 A1* 10/2002 Meyer et al. ................ TO9.204
`2003/0200199 A1 * 10/2003 Snyder .......................... 707/2
`2003/0212649 A1* 11/2003 DeneSuk et al. ................ 707/1
`2003/0212675 A1* 11/2003 DeneSuk et al. ................ 707/5
`
`2003/0212699 A1* 11/2003 DeneSuk et al. ............. 707/102
`2004/O123240 A1* 6/2004 Gerstlet al.
`... 715,513
`2005/0057584 A1* 3, 2005 Gruen et al. ...
`... 345,752
`2005/0149858 A1* 7/2005 Stern et al. .................. 715,513
`2005/0177564 A1
`8/2005 Kobayashi et al.
`2006/0101005 A1* 5/2006 Yang et al. ..................... 707/3
`2006/024845.6 A1* 11/2006 Bender et al. ............... 715,531
`
`FOREIGN PATENT DOCUMENTS
`
`JP
`
`2004141855
`
`5, 2004
`
`OTHER PUBLICATIONS
`“Date and Time Formats.” W3C, Aug. 27, 1998, http://www.w3.org/
`TRNOTE-datetime.
`“Convert HTML to text and remove markup.” Jafsoft, Oct. 9, 2004,
`http://web.archive.org/web/2004 1009 125145/http://www.jafsoft.
`com/detagger?.
`
`(Continued)
`Primary Examiner James Trujillo
`Assistant Examiner—Dawaune Conyers
`(74) Attorney, Agent, or Firm—Samuel A. Kassatly;
`Shimokaji & Associates, P.C.
`
`(57)
`
`ABSTRACT
`
`A date querying system processes free-form text in docu
`ments to identify and locate some or all of the dates in the
`documents using extended regular expression matching to
`capture various date formats. The system packages a canoni
`calized format of each identified date to Support various types
`of queries such as, for example, specific date querying, hier
`archical date querying, range date querying, proximity que
`ries comprising a date and any keywords, and any combina
`tion of types of queries. The system scans a document to
`identify the various format dates occurring in the document,
`disambiguates the resulting occurrences of dates, and canoni
`calizes the dates according to one or more predetermined
`formats.
`
`25 Claims, 6 Drawing Sheets
`
`
`
`At URYSYSTM
`
`1.
`
`ExNEATCHINGRE
`215
`
`DATEMUs
`
`
`
`
`
`
`
`
`
`PRFXools
`
`SUFFXMODULE
`
`SABIGUATOR
`
`23s
`
`CANONICALZER
`
`
`
`2
`PACKASNs
`Roo.
`
`245
`
`output
`
`Bright Data Ltd. v. Oxylabs, UAB
`IPR2023-01442 | Oxylabs EX2011
`Page 1 of 14
`
`

`

`US 7,730,013 B2
`Page 2
`
`OTHER PUBLICATIONS
`Freidl, J. "Mastering Regular Expressions, Second Edition.”
`O'Reilly & Associates, Inc., Sebastopol, CA. 2002.*
`Klopping, H. Mesman, Beno. Plomp, P. Schreuder, W. “The LPIC-2
`Exam Prep.” Sep. 12, 2004.*
`
`Ignat et al., “Extending an Information Extraction tool set to Central
`and Eastern European languages'. Sep. 2003, Proceedings of the
`International Workshop, pp. 33-39.*
`
`* cited by examiner
`
`Page 2 of 14
`
`

`

`U.S. Patent
`U.S. Patent
`
`Jun. 1, 2010
`Jun. 1, 2010
`
`Sheet 1 of 6
`Sheet 1 of 6
`
`US 7,730,013 B2
`US 7,730,013 B2
`
`
`
`
`
`
`
`Sco
`Sc
`
`Sco
`
`
`
`ONIHDYVAS
`s
`aLlva
`
`WALSAS
`
`Page 3 of 14
`
`Page 3 of 14
`
`
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 2 of 6
`
`US 7,730,013 B2
`
`
`
`10
`
`DATE QUERY SYSTEM
`
`EXTENDED MATCHING MODULE
`215
`
`
`
`
`
`
`
`DATE MODULE
`
`
`
`PREFX MODULE
`
`SUFFX MODULE
`
`DISAMBIGUATOR
`
`235
`
`CANONICALZER
`
`
`
`210
`PACKAGING
`MODULE
`
`FIG. 2
`
`245
`
`OUTPUT
`
`Page 4 of 14
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 3 of 6
`
`US 7,730,013 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`400
`
`EXTENDED MATCHING MODULE USES EXTENDED
`REGULAR EXPRESSION MATCHING TO DENTIFY
`DATES IN INPUT AND GENERATE CANONICALIZED
`FORMAT FOREACH IDENTIFIED DATE
`(METHOD 400, FIG. 4)
`
`PACKAGING MODULE PACKAGES THE
`CANONICALIZED FORMAT OF EACH IDENTIFIED
`DATE
`
`305
`
`OUTPUT OCCURRENCES OF THE QUERED DATE
`BASED ON PACKAGING
`
`
`
`310
`
`FIG. 3
`
`Page 5 of 14
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 4 of 6
`
`US 7,730,013 B2
`
`OO
`
`405
`
`SELECT A DOCUMENT
`
`DATE MODULE SCANS SELECTED DOCUMENT FOR
`OCCURRENCES OF A DATE IN NUMERIC FORMAT OR ADATE
`COMPRISING AMONTHNAME IN ALPHABETIC FORMAT
`
`410
`
`
`
`
`
`
`
`415
`
`
`
`FOUND
`DATE COMPRISES
`MONTHNAME
`
`->(a
`
`YES
`
`420
`
`PREFX MODULE FINDS ANUMERIC PREFX PART OF A
`DATE PRECEDING THE FOUND MONTHNAME
`
`SUFFIX MODULE FINDS ANUMERIC SUFFIX PART OF A
`DATE FOLLOWING THE FOUND MONTHNAME
`
`425
`
`EXTENDED MATCHING MODULE DETERMINES THE FORMAT OF A
`DATE OCCURRING WITH THE ALPHABETIC MONTHNAME BY
`CORRELATING THE PREFIX PART WITH THE SUFFIX PART
`
`430
`
`
`
`
`
`FIG. 4A
`
`Page 6 of 14
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 5 of 6
`
`US 7,730,013 B2
`
`
`
`
`
`DISAMBIGUATOR DISAMBIGUATES THE FORMAT OF
`EACH OF THE OCCURRENCES OF ADATE
`
`435
`
`CANONICALIZER CREATES CANONICALIZED FORMATS
`FOREACH OF THE OCCURRENCES OF A DATE
`
`440
`
`FIG. 4B
`
`Page 7 of 14
`
`

`

`U.S. Patent
`
`Jun. 1, 2010
`
`Sheet 6 of 6
`
`US 7,730,013 B2
`
`500
`
`EXEMPLARY SAMPLE RESULTS FOR ALL, DATES IN OCTOBER 2004'
`
`505
`
`
`
`MENU BAR
`
`510
`SEARCH CONTROL BUTTON(S)
`
`1. Title 1
`( from the October 19 2004.
`515
`URL1
`Crawl date
`
`(1 2. Title 2
`Headlines Oct 26 2004...
`520
`URL2
`Crawl date
`
`(1
`525
`
`3. Title 3
`Last Updated October 14, O4
`URL3
`crawl date
`
`(1 4. Title 4
`article text Oct 18 2004.
`530
`URL4
`crawl date
`
`(1 5. Title 5
`Published 10.14.2004...
`535
`URL5
`Crawl date
`
`FIG. 5
`
`Page 8 of 14
`
`

`

`US 7,730,013 B2
`
`1.
`SYSTEMAND METHOD FOR SEARCHING
`DATES EFFICIENTLY IN A COLLECTION OF
`WEB DOCUMENTS
`
`FIELD OF THE INVENTION
`
`The present invention generally relates to text analysis of
`electronic documents. More specifically, the present inven
`tion relates to identifying dates in electronic documents in
`which dates occur in various formats and further relates to
`packaging the dates uniformly for purposes of querying.
`
`10
`
`BACKGROUND OF THE INVENTION
`
`15
`
`25
`
`30
`
`35
`
`Searching for dates is a useful primitive in understanding
`and extracting relevant pieces from large collections of docu
`ments. Locating a source date for content on the web is
`especially useful in determining relevancy to a search request
`comprising a date. However efficiently performing a query
`for dates is challenging since dates tend to occur in various
`formats in unstructured text.
`For example, the date October 11, 2004 can occur in text as
`11' of October 2004, 11-10-2004, 11 October, '04, Oct. 11"
`04, 11/10/04, 10.11.2004, 2004 Oct 11, etc. Variations in date
`expression can be even more pronounced on a diversified
`collection such as the web, where many different people and
`organizations write web content such as free-form text. This
`is a natural consequence of the decentralized nature of the
`web and the few rigid requirements imposed on free-form
`text.
`Nevertheless, the free-form text on the web is an important
`source of information, both current and archived. Newspa
`pers and magazines provide news articles online on the web:
`an estimate for news sources on the web is over 10,000.
`Covering a range of topics, these new articles cater to the
`needs of both businesses and individuals. Moreover, organi
`Zations such as companies and universities post a wealth of
`information available online. Some search engine sites esti
`mate the number of web pages indexed at over 8 billion.
`40
`Given the large number of sources and the large number of
`pages on the web, the need for automated techniques for
`searching and navigating Such a large collection is increasing.
`Dates are an important means to understand the temporal
`context of the information found near the dates or on the same
`web page as the dates. Queries such as:
`Show all pages that mention a particular date D (e.g., 11
`Oct 2004),
`Show all pages that mention any date in a given month
`(e.g., Oct 2004), or
`Show all pages that mention any date in a given year (e.g.,
`2004) with one or more keywords with a specified con
`text such as "on the same page”, “on the same line', etc.
`are natural and useful ways to filter and navigate Such
`large collections of pages.
`Although conventional web search engines perform well
`using standard keyword and proximity searches, it would be
`desirable to present additional improvements. Conventional
`web search engines do not adequately search by dates. Even
`a basic date query such as “find all pages that mention 11"
`October 2004 requires a separate search for each possible
`date format. Such a search is tedious and unmanageable since
`the number of possible date formats is sizeable. Furthermore,
`some formats such as 11.10.2004 are difficult to search
`because Some search engines ignore the numbers and periods
`in a date format if they occur frequently.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Searching on dates using a conventional web search engine
`becomes more unmanageable for hierarchical date queries
`Such as “find all pages that mention any date in October
`2004.
`Conventional web search engines have further difficulty
`searching for dates in ambiguous format. For example,
`11.10.2004 can mean either 11 October 2004 or 10
`November 2004, depending on the context. The ambiguity is
`further compounded when the year is specified as a two-digit
`number and the month, day, and year are in similar in value
`(for example, 01/04/05).
`Another conventional approach for finding a source date
`finds a single date for each page, representing when the page
`may have been written, i.e., a date-of-page. However, this
`date-of-page may not exist for all web pages. A date-of-page
`is typically not well defined and is usually a best guess based
`on different dates that appear on the page or in the http header
`of the page. Furthermore, this conventional approach still
`retains only one date per page even when a page contains
`additional dates. Consequently, the information about other
`dates is lost, including the locations of the other dates for
`proximity queries.
`A further conventional approach that identifies named enti
`ties such as different forms in which a keyword can be refer
`enced in text lists all possible alternatives explicitly. This
`conventional approach works well in cases where the number
`of variants is a small number. However, in the context of
`locating Source dates on the web, the large number of possible
`formats for each date and the large number of possible distinct
`dates renders this approach cumbersome. Consequently,
`regular expression-based spotting is a better alternative for
`dates.
`Yet another conventional approach comprises a natural
`single-step regular expression matching. In particular con
`texts such as weblogs (also known as blogs), this conventional
`approach addresses identification of dates to some extent
`based on the structure of blogs. However, this conventional
`approach does not address the wide range of possible formats
`for dates that appear on the web and the resulting disambigu
`ation required to identify dates. Furthermore, efficiency and
`processing time become serious issues for this conventional
`approach considering the large number of possible formats
`and the large number of pages requiring processing.
`What is therefore needed is a system, a computer program
`product, and an associated method for searching dates effi
`ciently in a large collection of web documents. The need for
`Such a solution has heretofore remained unsatisfied.
`
`SUMMARY OF THE INVENTION
`
`The present invention satisfies this need, and presents a
`system, a computer program product, and an associated
`method (collectively referred to herein as “the system” or “the
`present system') for searching dates efficiently in a large
`collection of web documents.
`A date matching module of the present system processes
`free-form text in documents to identify and locate some or all
`of the dates in the documents using extended regular expres
`sion matching to capture various date formats. A packaging
`module of the present system packages a canonicalized for
`mat of each identified date to Support various types of queries
`Such as, for example, specific date querying, hierarchical date
`querying, range date querying, proximity date querying,
`proximity queries comprising a date and any keywords, and
`any combination of types of queries.
`The date module scans a document for Some or all occur
`rences of dates, searching for numerical dates and month
`
`Page 9 of 14
`
`

`

`US 7,730,013 B2
`
`3
`names in alphabetic format. If a month name is found, a prefix
`module applies a prefix regular expression matching to a
`prefix Substring preceding the found month name to identify
`a prefix part of a date, a portion of the date preceding the
`month name. The Suffix module applies a Suffix regular
`expression matching to a suffix Substring following the found
`month name to identify a Suffix part of a date, a portion of the
`date following the month name. The date matching module
`determines one or more formats for a date corresponding to
`the found month name by correlating the prefix part and the
`Suffix part. The date matching module generates a date in the
`selected format(s) from the found month, the prefix part, and
`the suffix part.
`A disambiguator of the present system disambiguates
`found occurrences of dates comprising either a found numeri
`cal date or the date generated by the date matching module.
`Disambiguation is desired for dates with a day, month, or year
`that cannot easily be discerned. A canonicalizer formats dates
`in one or more canonical forms for the disambiguatized
`occurrences of dates.
`
`10
`
`15
`
`4
`World WideWeb (WWW, also Web): An Internet client
`server hypertext distributed information retrieval system.
`FIG.1 portrays an exemplary overall environment in which
`a system, a service, a computer program product, and an
`associated method (the “date searching system 10 or the
`“system 10') for searching dates efficiently in a collection of
`web documents according to the present invention may be
`used. A text analysis system 15 comprises system 10 and a
`search engine 20. The text analysis engine 15 analyzes docu
`ments obtained from a source Such as, for example, the
`WWW, for data analysis, trend discover, etc. The text analysis
`engine comprises search functionalities provided by the
`search engine 20. The text analysis system 15 is installed on
`a computer such as the host server 25.
`System 10 comprises a software programming code or a
`computer program product that is typically embedded within,
`or installed on the host server 25. Alternatively, system 10 can
`be saved on a Suitable storage medium Such as a diskette, a
`CD, a hard drive, or like devices. A database 30 (dB 30)
`comprises documents from sources such as the WWW. While
`the system 10 will be described in connection with the WWW,
`the system 10 can be used with a stand-alone dB 30 of content
`that may have been derived from the WWW or other sources.
`Users, such as remote Internet users, are represented by a
`variety of computers such as computers 35, 40, 45, and can
`access the host server 25 through a network 50. Computers
`35, 40, 45, each comprise software that allows the user to
`interface securely with the host server 25. The host server 25
`is connected to network50 via a communications linkS5 such
`as a telephone, cable, or satellite link. Computers 35, 40, 45.
`can be connected to network 50 via communications links 60,
`65,70, respectively. While system 10 is described in terms of
`network50, computers 35, 40, 45, may also access system 10
`locally rather than remotely. Computers 35, 40, 45, may
`access system 10 either manually, or automatically through
`the use of an application. Users query data on dB 30 via
`network 50 and the search engine 20.
`FIG. 2 illustrates a high-level hierarchy of system 10. Sys
`tem 10 comprises an extended matching module 205 and a
`packaging module 210. The extended matching module 205
`comprises a date module 215, a prefix module 220, a suffix
`module 225, a disambiguator 230, and a canonicalizer 235.
`An input 240 comprises documents that have no html tags
`such as, for example, any content derived from the WWW that
`has been de-tagged. Data in dB 30 is processed to remove
`html tags using standard de-tagging methods to remove the
`html tags from the crawled html content of the document.
`Removing html tags simplifies a process of spotting a date by
`removing the complexity of html tags interleaved within a
`date string. Removing html tags requires reconstruction of the
`span and location of the occurrence of a date; span and loca
`tion are required for queries comprising proximity. An output
`245 generated by System 10 comprises the span and location
`of each identified date to support proximity querying.
`FIG.3 illustrates a method 300 of system 10. The extended
`matching module 205 processes documents in dB 30 to iden
`tify and locate some or all of the dates in the document (step
`400, further illustrated in method 400 of FIG. 4). The
`extended matching module 205 uses extended regular expres
`sion matching to capture various date formats.
`In free-form text, dates can occur either numerically or
`alphanumerically. Alphanumeric dates occur in forms, such
`as, for example:
`October 11 hh:mm:ss EST 2004
`October 11 hh:mm:ss 2004
`11 October 2004
`October 11 2004
`2004 October 11.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`30
`
`The various features of the present invention and the man
`ner of attaining them will be described in greater detail with
`25
`reference to the following description, claims, and drawings,
`wherein reference numerals are reused, where appropriate, to
`indicate a correspondence between the referenced items, and
`wherein:
`FIG. 1 is a schematic illustration of an exemplary operating
`environment in which a date searching system of the present
`invention can be used;
`FIG. 2 is a block diagram of the high-level architecture of
`the date searching system of FIG. 1;
`FIG. 3 is a process flow chart illustrating a method of
`operation of the date searching system of FIGS. 1 and 2:
`FIG. 4 is comprised of FIGS. 4A and 4B and represents a
`process flow chart illustrating a method of operation of the
`date matching module of the date searching system of FIGS.
`1 and 2; and
`40
`FIG. 5 is a diagram illustrating an exemplary screen shot of
`a response to a query by a user Submitted to a search engine
`comprising the date searching system of FIGS. 1 and 2.
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`45
`
`The following definitions and explanations provide back
`ground information pertaining to the technical field of the
`present invention, and are intended to facilitate the under
`standing of the present invention without limiting its scope:
`Free-form text: Unstructured text such as the input to a
`word processor or text editor comprising, for example, words,
`sentences, dates, numbers, etc.
`HTML (Hypertext Markup Language): A standard lan
`guage for attaching presentation and linking attributes to
`informational content within documents. During a document
`authoring stage, HTML “tags' are embedded within the
`informational content of the document. When a web server
`transmits the web document (or “HTML document) to a web
`browser, the tags are interpreted by the browser and used to
`parse and display the document. In addition to specifying how
`the web browser is to display the document, HTML tags can
`be used to create hyperlinks to other web documents.
`Internet: A collection of interconnected public and private
`computer networks that are linked together with routers by a
`set of standard protocols to form a global, distributed net
`work.
`
`50
`
`55
`
`60
`
`65
`
`Page 10 of 14
`
`

`

`5
`In each of these exemplary alphanumeric formats, additional
`variants result when a day 11 of a month is written with a
`superscripted “th” as 11". In general, a day of a month can be
`written as 1', 2',3', 4', 5", etc., or as 1, 2, 3, 4, 5, etc.
`The number of possible text formats in which any date may
`be written alphanumerically is non-trivial. Date format com
`prises year format, month format, day format, and separator
`format.
`Text representations of years comprise a variety of year
`formats. For example, a year such as 2004 can be written as
`2004, 04, or 04. In general, the year can occur as a full
`four-digit integer (e.g., 2004), as a two-digit integer (e.g. 04),
`or as a two-digit integer with a preceding apostrophe (e.g.,
`04). For disambiguation purposes, system 10 uses a conven
`tion specifying that years before 1970 are fully specified with
`four digits. System 10 interprets two digit years from 00 to 69
`as 2000 to 2069. System 10 interprets two digit years from 70
`to 99 as 1970 to 1999.
`Text representations of a month comprise a variety of
`month formats. For example, months can be written either
`spelled out in a complete form or in an abbreviated form such
`as Oct for October. Months can also be written in any number
`of languages, either in complete form or abbreviated form.
`Months can be capitalized, lower case, or uppercase.
`Furthermore, incompletely specified alphanumeric for
`mats of dates are common. For example, a date may be
`represented as a month and year without an accompanying
`day such as, for example, October 2004 or 2004 October. A
`date may be further represented as a month and day without
`an accompanying year such as, for example, October 11",
`11 October, 11 October, or October 11.
`Moreover, various characters such as a space, a dash ("-).
`a period (“ ”), a comma (".."), etc. may separate components of
`a date in alphanumeric format.
`Numeric dates comprise numeric formats such as, for
`example:
`11/10/2004, 11-10-2004, 11.10.2004,
`10/11/2004, 10-11-2004, 10.11.2004, or
`2004/10/11, 2004-10-11, 2004.10.11
`Components of a date in alphanumeric format may be
`separated by various characters such as a space, a dash ("-).
`a period (“ ”), an underline (“ ”), a forward slash ("/"), etc. In
`alphanumeric format, a year can comprise two digits (e.g. 04)
`or four digits (e.g., 2004).
`A URL string can comprise dates without any separators
`between day, month, or year. In this case, the date is written as
`a numeric string Such as, for example, the following dates for
`October 11, 2004: 11102004, 101 12004, or 20041011.
`For either alphanumeric dates or numeric dates appearing
`in html.html tags may be imbedded within a string represent
`ing the date. For example the string 11 October 2004 may
`have been written as “11'<b> October </b>2004 to make
`the month October a bold word.
`FIG. 4 illustrates a method 400 of the extended matching
`module 205 of system 10. System 10 selects a document for
`processing (step 405). The date module 215 scans the selected
`document for one or more occurrences of a date in numeric
`format or a date comprising a month name in alphabetic
`format in either full form or abbreviated form (step 410).
`The extended matching module 205 determines whether
`the found date comprises a month name (decision step 415).
`A date comprising a month name typically comprises an
`alphanumeric portion preceding the month name (further ref
`erenced as the prefix part) and an alphanumeric portion fol
`lowing the month name (further referenced as the suffix part).
`If the found date comprises a month name, the prefix module
`220 applies a prefix regular expression matching to the char
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 7,730,013 B2
`
`10
`
`15
`
`6
`acters preceding the found month name to identify the prefix
`part (step 420). The prefix module 220 captures any possible
`day-year patterns occurring in a prefix Substring of length
`ranging from 5 characters to 10 characters. An exemplary
`length of the prefix Substring is approximately 10 characters.
`The suffix module 225 applies suffix regular expression
`matching to the characters following the found month name
`to identify the suffix part (step 425). The suffix module 225
`captures any possible day-year patterns occurring in a suffix
`Substring of length ranging from 10 characters to 30 charac
`ters. An exemplary length of the Suffix substring is approxi
`mately 20 characters.
`The extended matching module 205 determines one or
`more formats for the date comprising the month name by
`correlating the prefix part and the suffix part (step 435). The
`extended matching module 205 appropriately handles any
`overlapping dates. For example, a string “2003 Oct 11, 2004
`may represent “2003 Oct 11” and “Oct 11 2004.
`If the found date comprises a numeric format (decision step
`425), the disambiguator 230 disambiguates the format of each
`of the found occurrences of dates (step 435). Input to the
`disambiguator 230 comprises a date in numeric format found
`in step 410 or a date generated from a prefix part, a month
`name, and a suffix part in step 430.
`Ambiguities arise both with numeric date (e.g.,
`11.10.2004) and alphanumeric date (e.g., 02 October 04) in
`that the day, month, or year may not be easily discerned. The
`disambiguator 230 checks ranges values of the portions of the
`date to reduce ambiguity. For example, a number greater than
`12 is either a day or year since each year contains 12 months.
`Similarly, a number greater than 31 is a year since the maxi
`mum number of days in a month is 31.
`The disambiguator 230 uses information from a page on
`which a date if found to further disambiguate a date, if nec
`essary. For example, another date on a page with an unam
`biguous date may provide a clear indication of the date for
`mat. A page may comprise date 9.10.2004; this date can be
`either October 9, 2004 or September 10, 2004. By examining
`other dates on the same page, the disambiguator 230 may find
`the date 15.10.2004; this date can only be October 15, 2004.
`Consequently, the format for dates on this page is dd.m-
`m.yyyy. The disambiguator 230 extrapolates and infers that
`the format of the currently selected date is similar to that of
`other dates on the selected page, assuming that a given page
`uses the same format throughout.
`Disambiguator 230 applies additional rules as desired to
`further disambiguate a date. Disambiguator 230 applies these
`additional rules based on continuity in the page. For example,
`a date on a page comprising 08.10.2004, 11.10.2004, and
`04.10.2004 likely comprises format dd.mm.yyyy Disam
`biguator 230 further applies additional rules based on dates
`compared to past/future. For example, if one interpretation of
`a date is after the current crawled date for a collection of
`documents in which the selected document resides, then the
`interpretation is rejected. Disambiguator 230 applies addi
`tional rules based on recency and closeness. If possible inter
`pretations of dates are less than the crawled date, then the date
`interpretation that is closer to the crawl date is selected. For
`example, a page crawled in december 2004 lists a date 01
`october 04. The disambiguator 230 interprets the date 01
`October 04 as 1st October 2004 instead of 4th October 2001.
`In one embodiment, the disambiguator 230 uses site-level
`information as available. The disambiguator 230 gathers date
`format information from one or more pages of a site and uses
`the gathered date information for other pages where a date is
`ambiguous.
`
`Page 11 of 14
`
`

`

`US 7,730,013 B2
`
`10
`
`15
`
`45
`
`7
`The canonicalizer 235 takes the month, day, and year deter
`mined by the disambiguator 230, and represents the date in a
`canonicalized form as MMM-dd-yyyy (step 430).
`The packaging module 210 packages the canonicalized
`format of each identified date (step 305) to, for example,
`Support specific date querying, hierarchical date querying,
`and range date querying. Specific date querying queries con
`tent for documents comprising a specific date; an exemplary
`specific date query is:
`"Show all pages that mention a particular date D (e.g., 11
`Oct 2004). Hierarchical date querying queries content
`for documents comprising a month, a month and year, or
`a year, exemplary hierarchical date queries are:
`Show all pages that mention any date in a given month
`(e.g., Oct 2004), or
`Show all pages that mention any date in a given year (e.g.,
`2004) Range date querying queries content for docu
`ments comprising a range of dates; an exemplary range
`date query is:
`Show all pages that mention any date between two dates
`(e.g., 11 Oct 2004 and 22 Oct 2004).
`The packaging module 210 packages the canonicalized
`date in the following formats: IMMM-dd-yyyy. IMMM
`yyyy, yyyy, and a 32-bit integer. The first format
`25
`MMM-dd-yyyy assists in performing specific date query
`ing, the second and third formats IMMM-yyyy. Iyyyy
`assistin performinghierarchical date querying, and the fourth
`format assists in performing date range querying.
`For example, to search for all occurrences of a specific date
`30
`such as 11 October 2004 in a specific date query, a user
`requests Oct-11-2004. This query returns occurrences of
`the requested date in all formats including 11.10.2004, 11-10
`04, 2004 oct. 11, etc. Similarly, to search for occurrences of
`any date in October 2004 in a hierarchical query, the user
`35
`requests Oct-2004. To search for any date in 2004, the user
`requests 2004.
`The packaging module 210 further packages the canoni
`calized format of each identified date (step 305) to, for
`example, Support proximity queries comprising a date and
`40
`any other keywords. Support of proximity queries requires
`knowledge of the location of a date in a document, a span of
`the date, and a span associated with words in some orall of the
`documents in the input 240. A span comprises a count of the
`number of tokens in the date or in a phrase. In one embodi
`ment, the canonicalized format of a date generated by the
`packaging module 210 comprises the location of the date
`within a document. In another embodiment, the canonical
`ized format of a date generated by the packaging module 210
`comprises the span of the date within a document.
`The packaging module 210 further packages the canoni
`calized format of each identified date (step 305) to, for
`example, Support proximity queries comprising dates and any
`other keywords with any other type of date querying Such as,
`for example, specific date querying, hierarchical date query
`ing, and range date querying.
`System 10 supports date querying of dB 30, outputting all
`occurrences of a queried date using the packaging generated
`by the packaging module 210 (step 310).
`Exemplary canonical packages comprise MMM-dd
`60
`yyyy. IMMM-yyyyyyyy, and as a 32-bit integer. The
`canonicalizer 235 indexes the formatted date according to the
`canonical forms. For example, the canonicalizer 235 indexes
`11 October 2004 as indexed as Oct-11-2004, Oct-2004, and
`2004. Indexing a date in canonical forms allows the date to
`match specific queries and hierarchical queries. Each date
`will be converted to integers using, for example, 11 bits for the
`
`50
`
`55
`
`65
`
`8
`year, 4 bits for the month, and 5 bits for the day. These 20 bits
`are packaged as a 32-bit integer.
`A date ambiguity not resolved by the disambiguator 230 is
`indexed for each possible interpretation, allowing any inter
`pretations of the ambiguous date to match a corresponding
`user query. Indexing ambiguous dates increases recall for the
`USC.
`FIG. 5 illustrates an exemplary screenshot 500 displaying
`results of a single query for “all dates in Oct-2004. The
`screenshot 500 comprises a menu bar 505 and one or more
`search control button(s) 510. Listed in the screenshot 500 are
`exemplary query responses 515,520, 525, 530, 535 (collec
`tively referenced as query responses 540). Additional query
`responses may be found by the query but not shown in screen
`Shot 500.
`As seen in query responses 540, system 10 id

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket