WO2009000141A1 - Representation method, system and device of layout file logical structure information - Google Patents

Representation method, system and device of layout file logical structure information Download PDF

Info

Publication number
WO2009000141A1
WO2009000141A1 PCT/CN2008/000910 CN2008000910W WO2009000141A1 WO 2009000141 A1 WO2009000141 A1 WO 2009000141A1 CN 2008000910 W CN2008000910 W CN 2008000910W WO 2009000141 A1 WO2009000141 A1 WO 2009000141A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
file
logical structure
structure information
description
Prior art date
Application number
PCT/CN2008/000910
Other languages
French (fr)
Chinese (zh)
Inventor
Jing Qu
Zhensheng He
Yi Wang
Li Zhang
Original Assignee
Peking University Founder Group Co., Ltd.
Beijing Founder Apabi Technology Ltd.
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co., Ltd., Beijing Founder Apabi Technology Ltd., Peking University filed Critical Peking University Founder Group Co., Ltd.
Publication of WO2009000141A1 publication Critical patent/WO2009000141A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents

Definitions

  • the present invention relates to a method and system for representing structural information of a computer electronic document, and more particularly to a method, system and apparatus for representing logical structure information of a layout file. Background technique
  • the layout file technology converts the original format of the original format of various formats into a unified format, and truly maintains the layout and information of the text, graphics, formulas and colors in the original file in the conversion, and realizes in different terminal devices and reading.
  • the software has consistent display results.
  • the layout file adopts an absolute description method. In the customized coordinate system, the position and size of each primitive (such as characters, pictures, tables, etc.) are clearly recorded, so that the printed results of the document and The results of browsing on the computer are consistent, and display consistency is achieved in any computer environment (such as Windows system or operating system of a PDA, a smart phone, etc.) to ensure that the original appearance of the document is truly reproduced.
  • the current layout file formats mainly include PDF (Portable Document Format) from Adobe, XPS (Xml Paper Specification) from Microsoft Corporation, and CEB (Chinese e-Paper Basic) from Beijing Founder Apabi Technology Co., Ltd. Electronic files in other formats (such as WPS, Microsoft Word, etc.) can also be easily converted into layout files.
  • PDF Portable Document Format
  • XPS Xml Paper Specification
  • CEB Choinese e-Paper Basic
  • the logical structure information of the document refers to: According to a certain understanding, the logical meaning of each part of the document, and the relationship between the parts, such as the title of the document, Hierarchical information on the content of documents such as text, paragraphs, and tables.
  • the logical structure information of the document includes the logical unit of the document and the hierarchical relationship between the logical units, wherein each document logical unit corresponds to a certain part of the document, the logical unit is an abstract concept that humans can understand, and the relationship between the logical units represents A logical combination of these concepts, as shown in Figure 1, the comic unit of an article may have a title, author, abstract, body, etc. These logics also form a tree structure, and these logical units are Corresponds to one or more text blocks.
  • This type of logical structure information is not included in a large number of layout files.
  • Adobe's Tagged PDF technology represents the logical structure information of the document in the layout file. It uses the method of adding special symbols in the content description instruction stream of the layout file to divide the logical unit, as shown in Figure 2, in the content data stream. Tag tags are added to them, and Tag... and End Tag are used to represent a logical unit.
  • This method has various drawbacks in practical applications: First, modify, add, and delete the logical structure of the document. The information requires modification of the content instruction stream of the layout file. This modification process is complicated and error-prone.
  • the granularity of the instruction stream partitioning (a granularity can be considered as a logical unit) is limited. The minimum granularity is the entire content of an output instruction, and there may be cases where a certain content fragment cannot be further divided.
  • Embodiments of the present invention provide a method, system, and apparatus for representing logical structure information of a layout file. It is used to solve the problem that the layout file in the prior art is inflexible to the logical structure information processing, is inconvenient to add and modify the layout file, and cannot meet the user's needs.
  • An embodiment of the present invention provides a method for expressing logical structure information of a layout file, including the following steps:
  • the step of obtaining the logical structure information of the layout file includes:
  • the steps of obtaining the content reference sequence of the layout file include:
  • the content of the layout file is read, and the content reference sequence is generated according to the order in which the primitives in the content of the layout file appear in the content data stream or the traversal order of the document tree.
  • the steps of dividing the content reference sequence into a plurality of content reference subsequences include:
  • the content reference sequence is divided into a plurality of content reference sub-sequences according to the primitives in the contents of the layout file at the offset position of the content reference sequence or the primitive symbols in the content reference sequence.
  • the step of associating the content division description file with the logical unit description file includes: associating the content division description file with the comic unit description file by the number of the content reference subsequence.
  • the content division description file or the logical unit description file is a data block in a separate file or a layout file on the storage device.
  • the content partitioning description file or logical unit description is described in a structured markup language.
  • An embodiment of the present invention further provides a system for expressing logical structure information of a layout file, including: a logical structure information obtaining system, configured to obtain logical structure information of the layout file; a logical structure description generating module, configured to acquire a content reference sequence, and divide the content reference sequence into a plurality of content reference indexes according to the logical structure information a sequence, generating a content partitioning file and a unit description file;
  • the logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
  • the foregoing logical structure description generating module includes:
  • a content reference sequence generating module configured to read a layout file content, and generate a content reference sequence
  • a content division description generation module configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate Content division description file
  • the logical unit description generation module generates a logical unit description file according to the logical structure information.
  • the foregoing logical structure description generating module further includes: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module.
  • the foregoing logical structure description parsing module further includes:
  • a content reference sequence generating module configured to read the layout file content, and generate a content reference sequence
  • the content partitioning parsing module is configured to divide the content reference sequence into a plurality of content reference sub-sequences, and generate a content division description file.
  • the above logical structure description parsing module further includes:
  • the logical unit description parsing module is configured to read and parse the data in the logical unit description file
  • mapping module configured to associate the content division description file with the logical unit description file.
  • An embodiment of the present invention provides a device for displaying logical structure information of a layout file, including: a logical structure information acquiring module, configured to obtain logical structure information of a layout file;
  • a logical structure description generating module configured to acquire a content reference sequence, and according to the logical structure Decoding the content reference sequence into a plurality of content reference sub-sequences, and generating a content division description file according to the plurality of content reference sub-sequences; generating a logical unit description file according to the logical structure information;
  • the logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
  • the above technical solution divides the content reference sequence of the layout file into a plurality of content reference sub-sequences, generates a corresponding content division description file, and generates a logical unit description file, and then associates the content division description file with the logical unit description file.
  • the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately described and extracted, and can be described according to different document logical structure models, the description range is more accurate, and the logical structure information is represented. It is more flexible, and can also add multiple document logical structure information descriptions to the same layout file.
  • Figure 1 is a schematic diagram showing the structure of the structure information of the existing layout file
  • FIG. 2 is a schematic diagram showing the structure of the logical structure information of the document in the layout file by the existing Adobe Tagged PDF technology
  • FIG. 3 is a schematic diagram of a method for representing logical structure information of a layout file according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of relationship between logical structure information and a layout file of a layout file according to an embodiment of the present invention
  • FIG. 5 is a layout file according to an embodiment of the present invention. Schematic diagram of the reference sequence with its content;
  • FIG. 6 is a schematic diagram showing the structure of an offset position of the content reference sequence shown in FIG. 5;
  • FIG. 7 is a content division description file according to the content of the layout file document shown in FIG. 5;
  • FIG. 8 is a divisional description file according to another content of the layout file content shown in FIG. 5;
  • FIG. 9 is a diagram according to FIG. Or a logical unit description file of the layout file shown in FIG. 8;
  • 10 is another block description file according to the layout file shown in FIG. 6, FIG. 7, or FIG. 8.
  • FIG. 11 is another logical unit description file according to the layout file shown in FIG. 6, FIG. 7, or FIG. 8.
  • FIG. 13 is a schematic structural diagram of a logic structure description generation module in a logical structure information representation system of a layout file according to an embodiment of the present invention
  • FIG. 14 is a schematic structural diagram of a logical structure description parsing module in a logical structure information representation system of a layout file according to an embodiment of the present invention. detailed description
  • the method for representing the logical structure information of the layout file includes the following steps: Step 31: Obtain logical structure information and a content reference sequence of the layout file;
  • Step 32 Divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file;
  • Step 33 Generate a logical unit description file according to the logical structure information
  • Step 34 Associate the content division description file with a logical unit description file. Corresponding content division description file, and generating a logical unit description file, and then associating the content division description file with the logical unit description file, so that the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately performed.
  • the description range is more accurate, the logical structure information representation is more flexible, and at the same time, multiple document logical structure information descriptions can be added to the same layout file, Or modify the logical structure information of the document, it is not necessary to modify the content description of the layout file, which reduces the possibility of error, and the flexible representation of the logical structure information of the layout file can describe a large number of existing layout files. Improve compatibility without affecting existing systems.
  • the logical file information of the layout file may be obtained by using the computer application to mark the layout file or the document analysis and the document understanding processing system by analyzing the electronic document that already contains the logical structure information.
  • the document processing system of the document can be utilized to extract logical structure information therein, such as for Microsoft Word documents. Office automation objects to get logical structure information.
  • the user can mark the logical unit of the layout file through a computer application with a graphical interface. It is also possible to obtain its logical structure information through a processing system based on document analysis and document understanding.
  • the content of the layout file may be read first, and then the content reference is generated according to the order in which the primitives (such as characters, pictures, tables, etc.) in the content of the layout file appear in the content data stream or the traversal order of the document tree. sequence.
  • a content reference sequence is a collection of multiple ordered meta-information information in a layout file.
  • the layout file 43 shown in FIG. 4, the CEB file Sample.ceb, generates a logical unit description file 41 and a content division description file 42 according to the logical structure information acquired above.
  • the layout file is described in an XML language.
  • the logical unit description file 41 and the content division description file 42 herein may also be described by other structured markup languages, such as the SGML language.
  • the content reference sequence may be divided into multiple content reference sub-sequences according to the offset position of the content reference sequence in the content of the content file or the primitive symbol in the content reference sequence, and Each of the content reference subsequences is assigned a number. This number can be saved in the content partitioning description file.
  • a layout file such as 51 has a document content data stream description 52, which contains text primitives.
  • Figure 6 is a specific embodiment of the logical structure in accordance with the layout file 51 of Figure 5.
  • 61 is a content reference sequence of the layout file, and the content reference sequence is arranged according to the order in which the primitives appear in the content description 52.
  • 62 represents the offset position of the primitive in the content reference sequence.
  • 71 or 81 is a content division description file, the description The file is divided by specifying the starting offset position of the content reference subsequence in the content reference sequence and the length of the subsequence.
  • Each division is given a unique number PID, as shown in Figure 7, number 8 corresponds to "before the bed, moonlight,” subsequence, number 9 corresponds to "suspicious ground frost, head to see the moon,,, subsequence.
  • PID unique number
  • FIG. 7 and FIG. 8 can exist at the same time.
  • 91, 101 or 111 in Fig. 9, Fig. 10, Fig. 11 are comma element description files in XML language, and the logical unit can be associated with the content reference subsequence through the PID of the content reference subsequence.
  • the logical unit description file in the above step 33 includes: a logical unit of the layout file and a relationship between the logical units. As shown in Figure 9, Figure 10, Figure 11. Structured description languages can be used to describe logical units and their relationships, such as XML, SGML, and the relationship between logical units can reflect the reading order of layout files.
  • the content division description file may be associated with the logical unit description file by the number given above for the content reference subsequence.
  • the logical unit and its corresponding content reference subsequence can be associated by the number of the content reference subsequence.
  • the 8"/> is associated with the "Before the Moon" content reference subsequence.
  • the content division description file or the logical unit description file in the above embodiment may be a separate file on the storage device, so that the logical structure information and the layout file are separated from each other, and the representation of the logical structure information is more flexible.
  • the content division description file or the logical unit description file in the above embodiment may also be a data block in the layout file.
  • the embodiment of the present invention further provides a system for expressing the logical structure information of the layout file, including: a logical structure information acquiring system, configured to obtain logical structure information of the layout file; a logical structure description generating module, configured to obtain a content reference sequence from the layout file parsing system, and divide the content reference sequence obtained by the logical reference information into multiple Content reference subsequences, generating a content division description file and a logical unit description file;
  • the logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
  • the logical structure description generating module in FIG. 12 above includes:
  • a content reference sequence generating module configured to read the content of the layout file, and generate a content reference sequence in a specified order; the specified order may be a sequence in which the primitives in the content of the layout file appear in the content data stream, or may be a traversal of the document tree order.
  • a content division description generating module configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file; the division manner may be in accordance with a primitive in the content of the layout file
  • the content references the offset position of the sequence or the primitive symbol in the content reference sequence, and assigns a number to each content reference subsequence; the number can be saved in the content division description file.
  • a logical unit description generating module configured to generate a unit description file according to the logical structure information, where the logical unit description file includes a plurality of logical units and a relationship between the logical units, and the logical description unit may be used to describe the logical unit And the relationship between them, such as the use of XML, SGML language, and the relationship between the units can reflect the reading order of the layout files.
  • the foregoing logical structure description generating module may further include: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file.
  • a storage device configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file.
  • the above content reference sequence and content reference subsequence may or may not be stored in the storage device.
  • the logical structure description parsing module in FIG. 12 above includes:
  • the logic unit describes a parsing module, configured to read and parse data in the logical unit description file, and a mapping module, configured to associate the content partitioning description file with the logical unit description file. Specifically, the unit can be edited according to the number of the content reference subsequence and its corresponding The reference subsequence is closed.
  • the logical structure description parsing module should also Includes the following modules:
  • a content reference sequence generating module configured to read a layout file content, and generate a content reference sequence
  • a content division description parsing module configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate Content division description file.
  • a content reference sequence generation module In an actual application, a content reference sequence generation module, a content division description parsing module, a content re-generating content reference sequence and a content division description file are used, and a large number of content reference sequences and content division description file data are read from the memory. The way, the operation speed is fast and the efficiency is high.
  • the logical structure description generation module works as follows:
  • the logical structure information acquisition system obtains the logical structure information of the layout file.
  • the document processing system of the document can be utilized to extract the logical structure information, for example, the Microsoft Word document can utilize Office.
  • Automate objects to get logical structure information the user can mark the logical unit of the layout file through a computer application with a graphical interface. It is also possible to obtain its logical structure information through a processing system based on document analysis and document understanding.
  • the content reference sequence generation module uses the layout file parsing system to arrange the contents of the layout file into an ordered sequence according to a certain order, and obtain a content reference sequence of the layout file.
  • the content division description generation module divides the content reference sequence according to the logical structure information obtained in the above-mentioned logical structure information acquisition system, and outputs a content division description file.
  • the logic unit description generation module acquires the logic node obtained in the system according to the above logical structure information The information output logical unit description file.
  • the content partitioning description file and the logical unit description file can be embedded in the layout file or saved separately.
  • the logical structure description parsing module works as follows:
  • the content reference sequence generation module is required to reuse the layout file parsing system to set the inner oblique data of the layout file.
  • the order is arranged as an ordered sequence, resulting in a content reference sequence.
  • the content division description parsing module reads the content division description file, and divides the content reference sequence obtained in the logic structure description generation module shown in FIG. 13 above.
  • the logical unit description parsing module reads the logical unit description file in the logical structure description generating module shown in Fig. 13 above and verifies its validity.
  • the mapping module associates the logical unit with the content reference subsequence according to the content reference sub-sequence number in the content partition description file and the logical unit description file.
  • an external system interacting with the system may have a layout file resolution system, a logical structure information acquisition system, and other document processing systems.
  • Other document processing systems may be format conversion systems, layout rearrangements, and the like. These systems use logical structure information to process layout files, such as extracting information, rearranging pages, converting to other formats, and so on.
  • the content division description file and the logical unit description file described above may be saved in the layout document or may be separately saved as a separate file from the layout file. For the same layout file, you can have multiple logical structure information descriptions.
  • the embodiment of the present invention further provides a device for expressing logical structure information of a layout file, where the device includes a logical structure information acquiring module, a logical structure description generating module, and a logical structure description parsing module, where:
  • a logical structure information obtaining module configured to obtain logical structure information of the layout file
  • a logical structure description generating module configured to acquire a content reference sequence, and according to the logical structure
  • the information is divided into a plurality of content reference sub-sequences, and a content division description file is generated according to the plurality of content reference sub-sequences; and a re-synthesis unit description file is generated according to the logical structure information;
  • the logical structure description parsing module is configured to parse and associate the content partitioning file and the logical unit description file.
  • the logical structure description generation module includes a content reference sequence generation module, a content division description generation module, and a logic unit description generation module, where:
  • a content reference sequence generating module configured to read the content of the layout file, and generate a content reference sequence in a specified order; the specified order may be a sequence in which the primitives in the content of the layout file appear in the content data stream, or may be a traversal of the document tree order.
  • a content division description generating module configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file; the division manner may be in accordance with a primitive in the content of the layout file
  • the content references the offset position of the sequence or the primitive symbol in the content reference sequence, and assigns a number to each content reference subsequence; the number can be saved in the content division description file.
  • a logic unit description generating module configured to generate a logical unit description file according to the logical structure information, where the logical unit description file includes a plurality of logical units and a relationship between the logical units, and the structured description language may be used to describe the logic Units and their relationships, such as XML, SGML language, and the relationship between logical units can reflect the reading order of the layout files.
  • the foregoing logical structure description generating module may further include: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file.
  • a storage device configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file.
  • the above content reference sequence and content reference subsequence may or may not be stored in the storage device.
  • the logical structure description parsing module includes a logical unit description parsing module and a mapping module, where:
  • a logic unit description parsing module configured to read and parse data in the logic unit description file
  • a mapping module configured to perform the content division description file and the logic unit description file Association.
  • the logical unit and its corresponding content reference subsequence may be associated by the number of the content reference subsequence.
  • the logical structure description parsing module should also Includes the following modules:
  • a content reference sequence generating module configured to read the content of the layout file, to generate a content reference sequence
  • a content division description parsing module configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, And generate a content division description file.
  • the method and system of the present invention divides a content reference sequence of a layout file into a plurality of content reference sub-sequences, generates a corresponding content division description file, and generates a logical unit description file, and then divides the content.
  • the description file is associated with the logical unit description file, so that the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately described and extracted, and can be described according to different document logical structure models.
  • the description range is more accurate, the representation of logical structure information is more flexible, and multiple logical structure information descriptions can be added to the same layout file, that is, the same layout file can have multiple content division description files and logical unit description files, and is added.

Abstract

A representation method, system and device are disclosed, relating to the information representation method and system in the computer information process technique. The invention is to solve the prior layout file's problems of being not flexible and not easy to add or modify. By obtaining the logical structure information and content reference sequence of a layout file; dividing said content reference sequence into multiple content reference sub-sequences according to said logical structure information, and creating content dividing description file; creating logical unit description file according to said logical structure information; associating said content dividing description file to logical unit description file, it is effective and flexible to represent the logical structure information of layout file, and there is no need to modify the original layout file and any content in the layout file can be separately described in logical structure information, extracted and reused in different document logical structure modules.

Description

版式文件逻辑结构信息的表示方法和系统、 装置 技术领域  Method, system and device for expressing logical structure information of layout file
本发明涉及计算机电子文档的结构信息的表示方法和系统, 特别涉及版 式文件逻辑结构信息的表示方法和系统、 装置。 背景技术  The present invention relates to a method and system for representing structural information of a computer electronic document, and more particularly to a method, system and apparatus for representing logical structure information of a layout file. Background technique
版式文件技术是将各种格式的文件原版原式的转换成统一格式, 在转换 中真实地保持了原有文件中的文字、 图表、 公式和色彩等版式和信息, 实现 在不同终端设备和阅读软件上具有显示结果一致性。 版式文件采用的是一种 绝对描述方式, 在自定义的坐标系中, 明确记录了每个图元(如字符、 图片、 表格等)显示的位置和尺寸等, 从而使文档打印出的结果和计算机上浏览的 结果一致, 而且在任何计算机环境(如 Windows系统或者 PDA、 智能手机等 终端的操作系统)下具有显示一致性, 保证真实地重现文档的原貌。  The layout file technology converts the original format of the original format of various formats into a unified format, and truly maintains the layout and information of the text, graphics, formulas and colors in the original file in the conversion, and realizes in different terminal devices and reading. The software has consistent display results. The layout file adopts an absolute description method. In the customized coordinate system, the position and size of each primitive (such as characters, pictures, tables, etc.) are clearly recorded, so that the printed results of the document and The results of browsing on the computer are consistent, and display consistency is achieved in any computer environment (such as Windows system or operating system of a PDA, a smart phone, etc.) to ensure that the original appearance of the document is truly reproduced.
目前的版式文件格式主要有 Adobe公司推出的 PDF ( Portable Document Format ),微软公司推出的 XPS ( Xml Paper Specification )和北京方正阿帕比 技^ 限公司推出的 CEB ( Chinese e-Paper Basic ),且其它格式的电子文件 (如 WPS、 Microsoft Word等格式的文件 )也可以方便地转换成版式文件。  The current layout file formats mainly include PDF (Portable Document Format) from Adobe, XPS (Xml Paper Specification) from Microsoft Corporation, and CEB (Chinese e-Paper Basic) from Beijing Founder Apabi Technology Co., Ltd. Electronic files in other formats (such as WPS, Microsoft Word, etc.) can also be easily converted into layout files.
由于版式文件的相对稳定性, 非常适合作为电子文档的最终发布和传播 形式, 广泛用于电子公文、 电子图书、 电子期刊、 电子 纸等领域, 但由于 版式文件对局部信息的绝对描述(绝对描述是指版式文件中文字的显示位置 是相对于版式文件的坐标明确指定的, 与文字的逻辑顺序无关), 使得它不利 于编辑, 每次修改文档内容后需要对布局重新计算, 重写整个文档的布局信 息, 因此对版式文件内容的检索, 结构化存储、 修改等编辑操作都会非常麻 烦。 同时目前客户端种类增多, 例如 PDA、 智能手机等, 用户要求在多种客 户端都能够方便地阅读版式文件, 这要求客户端能够突破版式文件显示固定 的局限性, 根据显示设备屏幕的大小对版式文件的内容重新进行排版, 可以 不用拖动水平或者竖直滚动条就能连续阅读文档内容。 这些应用都需要版式 文件中的文档逻辑结构信息, 文档的逻辑结构信息是指: 按照某种理解方式, 文档各部分内容的逻辑含义, 以及各部分之间的关系, 如能够反映文档的标 题、 正文、 段落、 表格等文档内容的层次结构信息。 Due to the relative stability of the layout files, it is very suitable as the final release and dissemination form of electronic documents. It is widely used in electronic documents, e-books, electronic journals, electronic papers, etc., but due to the absolute description of local information in the layout files (absolute description It means that the display position of the text in the layout file is specified relative to the coordinates of the layout file, regardless of the logical order of the text, which makes it unfavorable for editing. After modifying the content of the document, it is necessary to recalculate the layout and rewrite the entire document. The layout information, so the retrieval of the layout file content, structured storage, modification and other editing operations will be very troublesome. At the same time, there are more types of clients, such as PDAs and smart phones. Users are required to easily read layout files on a variety of clients. This requires the client to break through the limitations of the layout file display, depending on the size of the display device screen. The content of the layout file is re-typed, You can read the contents of a document continuously without dragging horizontal or vertical scroll bars. These applications require the logical structure information of the document in the layout file. The logical structure information of the document refers to: According to a certain understanding, the logical meaning of each part of the document, and the relationship between the parts, such as the title of the document, Hierarchical information on the content of documents such as text, paragraphs, and tables.
文档的逻辑结构信息包括文档的逻辑单元以及逻辑单元之间的层次关 系, 其中, 每个文档逻辑单元对应文档的某一部分内容, 逻辑单元是人类能 够理解的抽象概念, 逻辑单元之间的关系代表了这些概念的一个合乎逻辑的 组合, 如图 1所示, 一篇文章的逗辑单元可能有标题、 作者、 摘要、 正文等, 这些逻辑還辑单元组成一个树状结构, 而这些逻辑单元都对应到一个或者多 个文字块。  The logical structure information of the document includes the logical unit of the document and the hierarchical relationship between the logical units, wherein each document logical unit corresponds to a certain part of the document, the logical unit is an abstract concept that humans can understand, and the relationship between the logical units represents A logical combination of these concepts, as shown in Figure 1, the comic unit of an article may have a title, author, abstract, body, etc. These logics also form a tree structure, and these logical units are Corresponds to one or more text blocks.
目前大量的版式文件中都不包含这类逻辑结构信息。 但 Adobe公司的 Tagged PDF技术对版式文件中文档逻辑结构信息进行了表示, 其采用在版式 文件的内容描述指令流中加入特殊符号的方法来划分逻辑单元, 如图 2所示, 在内容数据流中加入 Tag标记符号,用 Tag...和 End Tag来表示一个逻辑单元。 这种方式在实际应用中存在种种缺陷: 首先修改, 添加, 删除文档逻辑结构 信息都要求对版式文件的内容指令流进行修改, 这一修改过程复杂并且容易 出错。 其次对指令流划分的粒度(一个粒度可以认为是一个逻辑单元)有限, 最小粒度是某一输出指令中的全部内容, 可能出现对某一内容片段无法再进 行进一步划分的情况。  This type of logical structure information is not included in a large number of layout files. However, Adobe's Tagged PDF technology represents the logical structure information of the document in the layout file. It uses the method of adding special symbols in the content description instruction stream of the layout file to divide the logical unit, as shown in Figure 2, in the content data stream. Tag tags are added to them, and Tag... and End Tag are used to represent a logical unit. This method has various drawbacks in practical applications: First, modify, add, and delete the logical structure of the document. The information requires modification of the content instruction stream of the layout file. This modification process is complicated and error-prone. Secondly, the granularity of the instruction stream partitioning (a granularity can be considered as a logical unit) is limited. The minimum granularity is the entire content of an output instruction, and there may be cases where a certain content fragment cannot be further divided.
针对目前大量的不包含文档逻辑结构信息的版式文件, 应用中却要求向 这些版式文件添加文档逻辑结构信息, 以及包含文档逻辑结构信息的版式文 件对逻辑结构信息处理不灵活、 不便于添加和修改版式文件, 不能满足用户 需求的问题, 设计一种版式文件中文档逻辑结构信息的表示方法对于版式文 件的实际应用具有重要价值。 发明内容  For the current large number of layout files that do not contain the logical structure information of the document, the application requires adding the logical structure information of the document to these layout files, and the layout file containing the logical structure information of the document is inflexible and inconvenient to add and modify the logical structure information. Layout files, which can not meet the needs of users, designing a representation of the logical structure information of a document in a layout file is of great value to the practical application of the layout file. Summary of the invention
本发明实施例提供一种版式文件逻辑结构信息的表示方法和系统、 装置, 用于解决现有技术中版式文件对逻辑结构信息处理不灵活、 不便于添加和修 改版式文件, 不能满足用户需求的问题。 Embodiments of the present invention provide a method, system, and apparatus for representing logical structure information of a layout file. It is used to solve the problem that the layout file in the prior art is inflexible to the logical structure information processing, is inconvenient to add and modify the layout file, and cannot meet the user's needs.
本发明实施例提供一种版式文件逻辑结构信息的表示方法, 包括如下步 骤:  An embodiment of the present invention provides a method for expressing logical structure information of a layout file, including the following steps:
获取版式文件的逻辑结构信息和内容参考序列;  Obtaining logical structure information and content reference sequence of the layout file;
根据所述逻辑结构信息将所述内容参考序列划分为多个内容参考子序 列, 并生成内容划分描述文件;  And dividing the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generating a content division description file;
根据所迷逻辑结构信息生成逻辑单元描述文件;  Generating a logical unit description file according to the logical structure information;
将所迷内容划分描述文件与逻辑单元描述文件进行关联。  Associate the content partitioning description file with the logical unit description file.
其中, 所述获取版式文件的逻辑结构信息的步骤包括:  The step of obtaining the logical structure information of the layout file includes:
利用计算机应用程序对版式文件进行标注或者基于文档分析和文档理解 处理系统获取版式文件的逻辑结构信息。  Use a computer application to annotate the layout file or to obtain logical structure information for the layout file based on the document analysis and document understanding processing system.
上述获取版式文件的内容参考序列的步骤包括:  The steps of obtaining the content reference sequence of the layout file include:
读取版式文件内容, 按照版式文件内容中的图元在内容数据流中出现的 先后顺序或者是文档树的遍历顺序, 生成内容参考序列。  The content of the layout file is read, and the content reference sequence is generated according to the order in which the primitives in the content of the layout file appear in the content data stream or the traversal order of the document tree.
上述将内容参考序列划分为多个内容参考子序列的步骤包括:  The steps of dividing the content reference sequence into a plurality of content reference subsequences include:
按照版式文件内容中的图元在所迷内容参考序列的偏移位置或者内容参 考序列中的图元符号, 将所述内容参考序列划分为多个内容参考子序列。  The content reference sequence is divided into a plurality of content reference sub-sequences according to the primitives in the contents of the layout file at the offset position of the content reference sequence or the primitive symbols in the content reference sequence.
且可以为所述多个内容参考子序列分别赋予一个编号。  And assigning a number to each of the plurality of content reference subsequences.
上述将内容划分描述文件与逻辑单元描述文件进行关联的步骤包括: 通过内容参考子序列的编号将内容划分描迷文件与逗辑单元描述文件进 行关联。  The step of associating the content division description file with the logical unit description file includes: associating the content division description file with the comic unit description file by the number of the content reference subsequence.
上述内容划分描述文件或者逻辑单元描述文件为存储设备上一个独立的 文件或者版式文件中的一个数据块。  The content division description file or the logical unit description file is a data block in a separate file or a layout file on the storage device.
上迷内容划分描述文件或者逻辑单元描述丈件采用结构化标记语言来描 述。  The content partitioning description file or logical unit description is described in a structured markup language.
本发明实施例还提供一种版式文件逻辑结构信息的表示系统, 包括: 逻辑结构信息获取系统, 用于获取版式文件的逻辑结构信息; 逻辑结构描迷生成模块, 用于获取内容参考序列, 并根据所述逻辑结构 信息将所述内容参考序列划分为多个内容参考子序列, 生成内容划分描迷文 件和還辑单元描述文件; An embodiment of the present invention further provides a system for expressing logical structure information of a layout file, including: a logical structure information obtaining system, configured to obtain logical structure information of the layout file; a logical structure description generating module, configured to acquire a content reference sequence, and divide the content reference sequence into a plurality of content reference indexes according to the logical structure information a sequence, generating a content partitioning file and a unit description file;
逻辑结构描述解析模块, 用于对所述内容划分描述文件和所迷逻辑单元 描述文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
其中, 上述逻辑结构描述生成模块包括:  The foregoing logical structure description generating module includes:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描述生成模块, 用于根据所述逻辑结构信息将所述内容参考序 列划分为多个内容参考子序列, 并生成内容划分描述文件;  a content reference sequence generating module, configured to read a layout file content, and generate a content reference sequence; a content division description generation module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate Content division description file;
逻辑单元描述生成模块, 根据所述逻辑结构信息生成逻辑单元描述文件。 上述逻辑结构描述生成模块还包括: 存储设备, 用于存储所述内容参考 序列生成模块生成的内容参考序列, 或者所述内容划分描述生成模块划分的 多个内容参考子序列。  The logical unit description generation module generates a logical unit description file according to the logical structure information. The foregoing logical structure description generating module further includes: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module.
上述逻辑结构描述解析模块在内容参考序列、 内容参考子序列没有保存 到上述存储设备中时, 还需包括:  When the content reference sequence and the content reference subsequence are not saved in the storage device, the foregoing logical structure description parsing module further includes:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描迷解析模块, 用于将所述内容参考序列划分为多个内容参考 子序列, 并生成内容划分描述文件。  a content reference sequence generating module, configured to read the layout file content, and generate a content reference sequence; the content partitioning parsing module is configured to divide the content reference sequence into a plurality of content reference sub-sequences, and generate a content division description file.
上述逻辑结构描述解析模块还包括:  The above logical structure description parsing module further includes:
逻辑单元描述解析模块, 用于读取并解析所迷逻辑单元描述文件中的数 据;  The logical unit description parsing module is configured to read and parse the data in the logical unit description file;
映射模块, 用于将所述内容划分描述文件与所述逻辑单元描述文件进行 关联。  And a mapping module, configured to associate the content division description file with the logical unit description file.
本发明实施例提供一种版式文件逻辑结构信息的表示装置, 包括: 逻辑结构信息获取模块, 用于获取版式文件的逻辑结构信息;  An embodiment of the present invention provides a device for displaying logical structure information of a layout file, including: a logical structure information acquiring module, configured to obtain logical structure information of a layout file;
逻辑结构描述生成模块, 用于获取内容参考序列, 并根据所述逻辑结构 信息将所述内容参考序列划分为多个内容参考子序列, 才艮据所述多个内容参 考子序列生成内容划分描述文件; 根据所述逻辑结构信息生成逻辑单元描述 文件; a logical structure description generating module, configured to acquire a content reference sequence, and according to the logical structure Decoding the content reference sequence into a plurality of content reference sub-sequences, and generating a content division description file according to the plurality of content reference sub-sequences; generating a logical unit description file according to the logical structure information;
逻辑结构描述解析模块, 用于对所述内容划分描述文件和所述逻辑单元 描述文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
上述技术方案通过将版式文件的内容参考序列划分为多个内容参考子序 列, 生成相应的内容划分描述文件, 并且生成逻辑单元描述文件, 然后将所 迷内容划分描述文件与逻辑单元描述文件关联起来, 使得逻辑结构信息与版 式文件相互分离, 可以对版式文件中的任意内容单独进行£辑结构描述、 提 取, 并可以根据不同的文档逻辑结构模型进行描述, 描述范围更加准确, 逻 辑结构信息的表示更加灵活, 同时还可以对同一个版式文件添加多个文档逻 辑结构信息描迷, 在添加或修改文档逻辑结构信息时, 不需要对版式文件的 内容描述进行修改, 减小了出错的可能, 且版式文件逻辑结构信息的这种灵 活表示方式可以对已经存在的大量的版式文件进行描述, 而不影响已有的系 统, 提高了兼容性。 附图说明  The above technical solution divides the content reference sequence of the layout file into a plurality of content reference sub-sequences, generates a corresponding content division description file, and generates a logical unit description file, and then associates the content division description file with the logical unit description file. , the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately described and extracted, and can be described according to different document logical structure models, the description range is more accurate, and the logical structure information is represented. It is more flexible, and can also add multiple document logical structure information descriptions to the same layout file. When adding or modifying the logical structure information of the document, it is not necessary to modify the content description of the layout file, thereby reducing the possibility of error, and This flexible representation of the logical structure information of the layout file can describe a large number of existing layout files without affecting the existing system, improving compatibility. DRAWINGS
图 1 为现有的版式文件中 £辑结构信息表示结构示意图;  Figure 1 is a schematic diagram showing the structure of the structure information of the existing layout file;
图 2为现有的 Adobe公司的 Tagged PDF技术对版式文件中文档逻辑结构 信息的表示结构示意图;  2 is a schematic diagram showing the structure of the logical structure information of the document in the layout file by the existing Adobe Tagged PDF technology;
图 3为本发明实施例所提供的版式文件逻辑结构信息的表示方法示意图; 图 4为本发明实施例的版式文件逻辑结构信息与版式文件的关系示意图; 图 5为本发明实施例的版式文件与其内容参考序列示意图;  FIG. 3 is a schematic diagram of a method for representing logical structure information of a layout file according to an embodiment of the present invention; FIG. 4 is a schematic diagram of relationship between logical structure information and a layout file of a layout file according to an embodiment of the present invention; FIG. 5 is a layout file according to an embodiment of the present invention; Schematic diagram of the reference sequence with its content;
图 6为图 5所示的内容参考序列的偏移位置结构示意图;  6 is a schematic diagram showing the structure of an offset position of the content reference sequence shown in FIG. 5;
图 7为根据图 5所示的版式文件文档内容的内容划分描述文件; 图 8为根据图 5所示的版式文件文档内容的另一内容划分描迷文件; 图 9为根据图 6、 图 7或图 8所示版式文件的一种逻辑单元描述文件; 图 10为根据图 6、 图 7或图 8所示版式文件的另一 辑单元描述文件; 图 11为根据图 6、 图 7或图 8所示版式文件的又一逻辑单元描述文件; 图 12为本发明实施例所提供的版式文件逻辑结构信息表示系统的结构示 意图; 7 is a content division description file according to the content of the layout file document shown in FIG. 5; FIG. 8 is a divisional description file according to another content of the layout file content shown in FIG. 5; FIG. 9 is a diagram according to FIG. Or a logical unit description file of the layout file shown in FIG. 8; 10 is another block description file according to the layout file shown in FIG. 6, FIG. 7, or FIG. 8. FIG. 11 is another logical unit description file according to the layout file shown in FIG. 6, FIG. 7, or FIG. 8. FIG. A schematic structural diagram of a logical structure information representation system of a layout file provided by an embodiment of the present invention;
图 13为本发明实施例所提供的版式文件逻辑结构信息表示系统中逻辑结 构描述生成模块结构示意图;  13 is a schematic structural diagram of a logic structure description generation module in a logical structure information representation system of a layout file according to an embodiment of the present invention;
图 14为本发明实施例所提供的版式文件逻辑结构信息表示系统中逻辑结 构描述解析模块结构示意图。 具体实施方式  FIG. 14 is a schematic structural diagram of a logical structure description parsing module in a logical structure information representation system of a layout file according to an embodiment of the present invention. detailed description
下面结合具体实施例对本发明的技术方案进行描述:  The technical solution of the present invention is described below in conjunction with specific embodiments:
如图 3所示, 版式文件逻辑结构信息的表示方法, 包括如下步骤: 步骤 31、 获取版式文件的逻辑结构信息和内容参考序列;  As shown in FIG. 3, the method for representing the logical structure information of the layout file includes the following steps: Step 31: Obtain logical structure information and a content reference sequence of the layout file;
步驟 32、 根据所述逻辑结构信息将所述内容参考序列划分为多个内容参 考子序列, 并生成内容划分描述文件;  Step 32: Divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file;
步骤 33、 根据所述逻辑结构信息生成逻辑单元描迷文件;  Step 33: Generate a logical unit description file according to the logical structure information;
步驟 34、 将所述内容划分描述文件与逻辑单元描述文件进行关联。 相应的内容划分描述文件, 并且生成逻辑单元描述文件, 然后将所述内容划 分描述文件与逻辑单元描述文件关联起来, 使得逻辑结构信息与版式文件相 互分离, 可以对版式文件中的任意内容单独进行逻辑结构描述、 提取, 并可 以根据不同的文档逻辑结构模型进行描述, 描述范围更加准确, 逻辑结构信 息的表示更加灵活, 同时还可以对同一个版式文件添加多个文档逻辑结构信 息描述, 在添加或修改文档逻辑结构信息时, 不需要对版式文件的内容描述 进行修改, 减小了出错的可能, 且版式文件逻辑结构信息的这种灵活表示方 式可以对已经存在的大量的版式文件进行描述, 而不影响已有的系统, 提高 了兼容性。 其中, 在上述步骤 31中, 可以通过分析已经包含逻辑结构信息的电子文 档, 利用计算机应用程序对版式文件进行标注或者基于文档分析和文档理解 处理系统获取版式文件的逻辑结构信息。 Step 34: Associate the content division description file with a logical unit description file. Corresponding content division description file, and generating a logical unit description file, and then associating the content division description file with the logical unit description file, so that the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately performed. Logical structure description, extraction, and description according to different document logical structure models, the description range is more accurate, the logical structure information representation is more flexible, and at the same time, multiple document logical structure information descriptions can be added to the same layout file, Or modify the logical structure information of the document, it is not necessary to modify the content description of the layout file, which reduces the possibility of error, and the flexible representation of the logical structure information of the layout file can describe a large number of existing layout files. Improve compatibility without affecting existing systems. Wherein, in the above step 31, the logical file information of the layout file may be obtained by using the computer application to mark the layout file or the document analysis and the document understanding processing system by analyzing the electronic document that already contains the logical structure information.
如, 对于与版式文件所对应的, 已经包含逻辑结构信息的电子文档, 如 HTML, Microsoft Word, 可以利用该文档的文档处理系统,对其中的逻辑结构 信息进行提取, 如对 Microsoft Word文档可以利用 Office自动化对象来获得 逻辑结构信息。 另外, 用户可以通过一个带有图形界面的计算机应用程序, 对版式文件的逻辑单元进行标注。 还可以通过基于文档分析和文档理解的处 理系统获取其逻辑结构信息。  For example, for an electronic document corresponding to a layout file that already contains logical structure information, such as HTML, Microsoft Word, the document processing system of the document can be utilized to extract logical structure information therein, such as for Microsoft Word documents. Office automation objects to get logical structure information. In addition, the user can mark the logical unit of the layout file through a computer application with a graphical interface. It is also possible to obtain its logical structure information through a processing system based on document analysis and document understanding.
上述步骤 31中, 可以首先读取版式文件内容, 再按照版式文件内容中的 图元(如字符、 图片、 表格等)在内容数据流中出现的先后顺序或者是文档 树的遍历顺序生成内容参考序列。 内容参考序列就是指版式文件中多个有序 图元信息的集合。 如图 4所示的版式文件 43 , 这一个 CEB文件 Sample.ceb, 根据上述获取的逻辑结构信息, 生成逻辑单元描述文 41和内容划分描述文件 42, 本实施例中以 XML语言来描述版式文件 43中的逻辑单元以及各逻辑单 元之间的关系, 如 Document_structure.xml; 同样以 XML语言来描述内容划 分, 如 Piece.xml。 这里的逻辑单元描迷文件 41和内容划分描述文件 42也可 以采用其它的结构化标记语言来描述, 如采用 SGML语言等。  In the foregoing step 31, the content of the layout file may be read first, and then the content reference is generated according to the order in which the primitives (such as characters, pictures, tables, etc.) in the content of the layout file appear in the content data stream or the traversal order of the document tree. sequence. A content reference sequence is a collection of multiple ordered meta-information information in a layout file. The layout file 43 shown in FIG. 4, the CEB file Sample.ceb, generates a logical unit description file 41 and a content division description file 42 according to the logical structure information acquired above. In this embodiment, the layout file is described in an XML language. The logical unit in 43 and the relationship between each logical unit, such as Document_structure.xml; also describe the content partitioning in XML language, such as Piece.xml. The logical unit description file 41 and the content division description file 42 herein may also be described by other structured markup languages, such as the SGML language.
上述步驟 32中, 可以按照版式文件内容中的图元在内容参考序列的偏移 位置或者内容参考序列中的图元符号, 将内容参考序列划分为多个内容参考 子序列, 并为所述多个内容参考子序列分别赋予一个编号。 该编号可以保存 在该内容划分描述文件中。  In the foregoing step 32, the content reference sequence may be divided into multiple content reference sub-sequences according to the offset position of the content reference sequence in the content of the content file or the primitive symbol in the content reference sequence, and Each of the content reference subsequences is assigned a number. This number can be saved in the content partitioning description file.
如图 5、 图 6、 图 7、 图 8所示, 一个显示如 51的版式文件, 其文档内 容数据流描述为 52所示, 其中包含文本图元。 图 6是依照图 5中的版式文件 51逻辑结构的具体实施例。 其中 61是版式文件的内容参考序列, 该内容参 考序列是按照图元在内容描述 52中出现的先后顺序来排列的。 62表示了图元 在内容参考序列中的偏移位置。 71或者 81是一个内容划分描述文件,该描述 文件通过指定内容参考子序列在内容参考序列中的起始偏移位置以及子序列 长度来划分。每个划分赋予了一个唯一编号 PID,如图 7所示,编号 8对应"床 前明月光, "子序列, 编号 9对应"疑是地上霜, 举头望明月, ,,子序列。 在实 际应用中, 图 7和图 8所示的两种内容划分描述文件可以同时存在。 As shown in FIG. 5, FIG. 6, FIG. 7, and FIG. 8, a layout file such as 51 has a document content data stream description 52, which contains text primitives. Figure 6 is a specific embodiment of the logical structure in accordance with the layout file 51 of Figure 5. Wherein 61 is a content reference sequence of the layout file, and the content reference sequence is arranged according to the order in which the primitives appear in the content description 52. 62 represents the offset position of the primitive in the content reference sequence. 71 or 81 is a content division description file, the description The file is divided by specifying the starting offset position of the content reference subsequence in the content reference sequence and the length of the subsequence. Each division is given a unique number PID, as shown in Figure 7, number 8 corresponds to "before the bed, moonlight," subsequence, number 9 corresponds to "suspicious ground frost, head to see the moon,,, subsequence. In actual In the application, the two content division description files shown in FIG. 7 and FIG. 8 can exist at the same time.
图 9、 图 10、 图 11中的 91或者 101或者 111是采用 XML语言的逗辑单 元描述文件, 逻辑单元可以通过内容参考子序列的 PID来与内容参考子序列 相关联。图 9中的 <line="9,,/>为一个逻辑单元, <line="8,,/> 为一个逻辑单元, 从该图中还可以看出, 按照逻辑单元描述文件 91的前序遍历的顺序, 对应的 文档内容顺序将是图 5中版式文件 51的阅读顺序。 尽管图 5中所示的内容描 迷数据流 52并没有按照阅读顺序来进行输出。  91, 101 or 111 in Fig. 9, Fig. 10, Fig. 11 are comma element description files in XML language, and the logical unit can be associated with the content reference subsequence through the PID of the content reference subsequence. <line="9,, /> in Figure 9 is a logical unit, <line="8,, /> is a logical unit. It can also be seen from the figure that the preamble of the file 91 is described in terms of logical units. In the order of traversal, the corresponding document content order will be the reading order of the layout file 51 in FIG. Although the content shown in Figure 5 depicts the data stream 52 not being output in the reading order.
上述步骤 33中逻辑单元描述文件包括: 版式文件的逻辑单元以及各逻辑 单元之间的关系。 如图 9、 图 10、 图 11所示。 可以采用结构化描述语言来描 述逻辑单元及其之间的关系, 如采用 XML、 SGML语言, 且逻辑单元之间的 关系可以反映版式文件的阅读顺序。  The logical unit description file in the above step 33 includes: a logical unit of the layout file and a relationship between the logical units. As shown in Figure 9, Figure 10, Figure 11. Structured description languages can be used to describe logical units and their relationships, such as XML, SGML, and the relationship between logical units can reflect the reading order of layout files.
上述步骤 34中可以通过上述为内容参考子序列赋予的编号将内容划分描 述文件与逻辑单元描述文件进行关联。 具体的讲, 可以按照内容参考子序列 的编号将逻辑单元和其对应的内容参考子序列关联起来。 如通过图 9 中的编 号 8对应图 7中的偏移地址 113,该偏移地址 113对应图 6中的内容参考子序 列"床前明月光", 即通过编号 8将逻辑单元 <line="8"/>与"床前明月光" 内容 参考子序列关联起来。 ' 上述实施例中的内容划分描述文件或者逻辑单元描述文件可以为存储设 备上一个独立的文件, 这样使得逻辑结构信息与版式文件相互分离, 逻辑结 构信息的表示更加灵活。  In the above step 34, the content division description file may be associated with the logical unit description file by the number given above for the content reference subsequence. In particular, the logical unit and its corresponding content reference subsequence can be associated by the number of the content reference subsequence. For example, by the number 8 in FIG. 9 corresponding to the offset address 113 in FIG. 7, the offset address 113 corresponds to the content reference sub-sequence "before the moonlight" in FIG. 6, that is, the logical unit <line=" by the number 8. The 8"/> is associated with the "Before the Moon" content reference subsequence. The content division description file or the logical unit description file in the above embodiment may be a separate file on the storage device, so that the logical structure information and the layout file are separated from each other, and the representation of the logical structure information is more flexible.
当然, 上述实施例中的内容划分描述文件或者逻辑单元描述文件也可以 为版式文件中的一个数据块。  Of course, the content division description file or the logical unit description file in the above embodiment may also be a data block in the layout file.
如图 12所示, 与上迷版式文件逻辑结构信息的表示方法相应的, 本发明 实施例还提供一种版式文件逻辑结构信息的表示系统, 包括: 逻辑结构信息获取系统, 用于获取版式文件的逻辑结构信息; 逻辑结构描述生成模块, 用于从版式文件解析系统中获取内容参考序列, 并根据逻辑结构信息将其获取的内容参考序列划分为多个内容参考子序列, 生成内容划分描述文件和逻辑单元描述文件; As shown in FIG. 12, in accordance with the method for representing the logical structure information of the layout file, the embodiment of the present invention further provides a system for expressing the logical structure information of the layout file, including: a logical structure information acquiring system, configured to obtain logical structure information of the layout file; a logical structure description generating module, configured to obtain a content reference sequence from the layout file parsing system, and divide the content reference sequence obtained by the logical reference information into multiple Content reference subsequences, generating a content division description file and a logical unit description file;
逻辑结构描述解析模块, 用于对所述内容划分描述文件和所述逻辑单元 描述文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
如图 13所示, 上述图 12中的逻辑结构描述生成模块包括:  As shown in FIG. 13, the logical structure description generating module in FIG. 12 above includes:
内容参考序列生成模块, 用于读取版式文件内容, 按照指定顺序生成内 容参考序列; 指定顺序可以是版式文件内容中的图元在内容数据流中出现的 先后顺序, 也可以是文档树的遍历顺序。  a content reference sequence generating module, configured to read the content of the layout file, and generate a content reference sequence in a specified order; the specified order may be a sequence in which the primitives in the content of the layout file appear in the content data stream, or may be a traversal of the document tree order.
内容划分描述生成模块, 用于根据所述逻辑结构信息将所述内容参考序 列划分为多个内容参考子序列, 并生成内容划分描述文件; 所述划分方式可 以按照版式文件内容中的图元在内容参考序列的偏移位置或者内容参考序列 中的图元符号, 并为各个内容参考子序列赋予一个编号; 该编号可以保存在 该内容划分描述文件中。  a content division description generating module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file; the division manner may be in accordance with a primitive in the content of the layout file The content references the offset position of the sequence or the primitive symbol in the content reference sequence, and assigns a number to each content reference subsequence; the number can be saved in the content division description file.
逻辑单元描述生成模块, 用于根据所述逻辑结构信息生成 辑单元描述 文件, 这里的逻辑单元描述文件包括多个逻辑单元以及各逻辑单元之间的关 系,可以采用结构化描述语言来描述逻辑单元及其之间的关系,如采用 XML、 SGML语言, 且 i£辑单元之间的关系可以反映版式文件的阅读顺序。  a logical unit description generating module, configured to generate a unit description file according to the logical structure information, where the logical unit description file includes a plurality of logical units and a relationship between the logical units, and the logical description unit may be used to describe the logical unit And the relationship between them, such as the use of XML, SGML language, and the relationship between the units can reflect the reading order of the layout files.
上述逻辑结构描述生成模块还可包括: 存储设备, 用于存储内容参考序 列生成模块生成的内容参考序列, 或者内容划分描述生成模块划分的多个内 容参考子序列, 或者逻辑单元描述生成模块生成的逻辑单元描述文件。 上述 内容参考序列、 内容参考子序列, 可以保存在该存储设备中, 也可以不保存。  The foregoing logical structure description generating module may further include: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file. The above content reference sequence and content reference subsequence may or may not be stored in the storage device.
如图 14所示, 上述图 12中的逻辑结构描述解析模块包括:  As shown in FIG. 14, the logical structure description parsing module in FIG. 12 above includes:
逻辑单元描述解析模块, 用于读取并解析逻辑单元描述文件中的数据; 映射模块, 用于将所述内容划分描述文件与所述逻辑单元描述文件进行 关联。 具体的讲, 可以按照内容参考子序列的编号将 辑单元和其对应的内 容参考子序列关^ ¾来。 The logic unit describes a parsing module, configured to read and parse data in the logical unit description file, and a mapping module, configured to associate the content partitioning description file with the logical unit description file. Specifically, the unit can be edited according to the number of the content reference subsequence and its corresponding The reference subsequence is closed.
当上述逻辑结构描述生成模块中的内容参考序列生成模块生成的内容参 考序列, 或者内容划分描述生成模块生成的多个内容参考子序列没有保存在 存储设备中时, 该逻辑结构描述解析模块还应当包括以下模块:  When the logical structure describes the content reference sequence generated by the content reference sequence generating module in the generating module, or the plurality of content reference sub-sequences generated by the content partitioning description generating module are not saved in the storage device, the logical structure description parsing module should also Includes the following modules:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描述解析模块, 用于根据所述逻辑结构信息将所迷内容参考序 列划分为多个内容参考子序列, 并生成内容划分描述文件。  a content reference sequence generating module, configured to read a layout file content, and generate a content reference sequence; a content division description parsing module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate Content division description file.
当上述逻辑结构描述生成模块中的内容参考序列, 或者多个内容参考子 序列, 已经保存在上述存储设备中, 则可以直接读取, 不必再次生成。  When the content reference sequence in the above-mentioned logical structure description generation module, or a plurality of content reference sub-sequences, have been saved in the above storage device, they can be directly read without being generated again.
在实际应用中, 采用内容参考序列生成模块、 内容划分描述解析模块, 重新生成内容参考序列、 内容划分描述文件的方式, 相比于从存储器中读取 大量的内容参考序列、 内容划分描述文件数据的方式, 运算速度快、 效率高。  In an actual application, a content reference sequence generation module, a content division description parsing module, a content re-generating content reference sequence and a content division description file are used, and a large number of content reference sequences and content division description file data are read from the memory. The way, the operation speed is fast and the efficiency is high.
下面再结合图 13、图 14对本发明的版式文件的逻辑结构信息的处理系统 的工作过程进行描述:  The working process of the processing system for logical structure information of the layout file of the present invention will be described below with reference to FIG. 13 and FIG. 14 :
如图 13所示, 逻辑结构描述生成模块的工作过程如下:  As shown in Figure 13, the logical structure description generation module works as follows:
逻辑结构信息获取系统获得版式文件的逻辑结构信息。 对于与版式文件 所对应的, 巳经包含逻辑结构信息的电子文档, 例如 HTML, Microsoft Word, 可以利用该文档的文档处理系统, 对其中的逻辑结构信息进行提取, 例如对 Microsoft Word文档可以利用 Office自动化对象来获得逻辑结构信息。 另夕卜, 用户可以通过一个带有图形界面的计算机应用程序, 对版式文件的逻辑单元 进行标注。 还可以通过基于文档分析和文档理解的处理系统获取其逻辑结构 信息。  The logical structure information acquisition system obtains the logical structure information of the layout file. For the electronic document corresponding to the layout file, including the logical structure information, such as HTML, Microsoft Word, the document processing system of the document can be utilized to extract the logical structure information, for example, the Microsoft Word document can utilize Office. Automate objects to get logical structure information. In addition, the user can mark the logical unit of the layout file through a computer application with a graphical interface. It is also possible to obtain its logical structure information through a processing system based on document analysis and document understanding.
内容参考序列生成模块利用版式文件解析系统将版式文件的内容根据一 定的顺序排列为一个有序序列 , 得到版式文件的内容参考序列。  The content reference sequence generation module uses the layout file parsing system to arrange the contents of the layout file into an ordered sequence according to a certain order, and obtain a content reference sequence of the layout file.
内容划分描述生成模块根据上述逻辑结构信息获取系统中得到的逻辑结 构信息对内容参考序列进行划分, 输出内容划分描述文件。  The content division description generation module divides the content reference sequence according to the logical structure information obtained in the above-mentioned logical structure information acquisition system, and outputs a content division description file.
逻辑单元描述生成模块根据上述逻辑结构信息获取系统中得到的逻辑结 构信息输出逻辑单元描述文件。 The logic unit description generation module acquires the logic node obtained in the system according to the above logical structure information The information output logical unit description file.
内容划分描述文件和逻辑单元描述文件可以嵌入到版式文件之中或者单 独保存。  The content partitioning description file and the logical unit description file can be embedded in the layout file or saved separately.
如图 14所示, 逻辑结构描述解析模块的工作过程如下:  As shown in Figure 14, the logical structure description parsing module works as follows:
在内容参考序列、 内容参考子序列 (也可以认为是内容划分描述文件) 以及逻辑单元描述文件没有保存的情况下, 需要内容参考序列生成模块重新 利用版式文件解析系统将版式文件的内斜据一定的顺序排列为一个有序序 列, 得到内容参考序列。  In the case where the content reference sequence, the content reference subsequence (which may also be considered as the content division description file), and the logical unit description file are not saved, the content reference sequence generation module is required to reuse the layout file parsing system to set the inner oblique data of the layout file. The order is arranged as an ordered sequence, resulting in a content reference sequence.
内容划分描述解析模块, 读取内容划分描述文件, 对上述图 13中所示的 逻辑结构描迷生成模块中得到的内容参考序列进行划分。  The content division description parsing module reads the content division description file, and divides the content reference sequence obtained in the logic structure description generation module shown in FIG. 13 above.
逻辑单元描述解析模块读取上述图 13中所示的逻辑结构描述生成模块中 逻辑单元描述文件, 并验证其有效性。  The logical unit description parsing module reads the logical unit description file in the logical structure description generating module shown in Fig. 13 above and verifies its validity.
映射模块根据内容划分描述文件和逻辑单元描述文件中的内容参考子序 列编号, 将逻辑单元和内容参考子序列进行关联。  The mapping module associates the logical unit with the content reference subsequence according to the content reference sub-sequence number in the content partition description file and the logical unit description file.
作为对于处理版式文件中逻辑结构信息的系统的进一步说明, 与该系统 交互的外部系统可能有版式文件解析系统、 逻辑结构信息获取系统及其他文 档处理系统。 其他文档处理系统可以是格式转换系统、 版面重排系统等。 这 些系统利用逻辑结构信息再对版式文件进行处理, 例如信息提取、 重排页面、 转换为其他格式的文件等。  As a further illustration of a system for processing logical structure information in a layout file, an external system interacting with the system may have a layout file resolution system, a logical structure information acquisition system, and other document processing systems. Other document processing systems may be format conversion systems, layout rearrangements, and the like. These systems use logical structure information to process layout files, such as extracting information, rearranging pages, converting to other formats, and so on.
另外, 上述的内容划分描述文件和逻辑单元描述文件可以保存在版式文 档之中, 也可以作为单独文件与版式文件分开保存。 对于同一版式文件, 可 以拥有多个逻辑结构信息描述。  In addition, the content division description file and the logical unit description file described above may be saved in the layout document or may be separately saved as a separate file from the layout file. For the same layout file, you can have multiple logical structure information descriptions.
本发明实施例还提供一种版式文件逻辑结构信息的表示装置, 该装置包 括逻辑结构信息获取模块、逻辑结构描述生成模块和逻辑结构描述解析模块, 其中:  The embodiment of the present invention further provides a device for expressing logical structure information of a layout file, where the device includes a logical structure information acquiring module, a logical structure description generating module, and a logical structure description parsing module, where:
逻辑结构信息获取模块, 用于获取版式文件的逻辑结构信息;  a logical structure information obtaining module, configured to obtain logical structure information of the layout file;
逻辑结构描述生成模块, 用于获取内容参考序列, 并根据所述逻辑结构 信息将所述内容参考序列划分为多个内容参考子序列, 根据所述多个内容参 考子序列生成内容划分描述文件; 根据所述逻辑结构信息生成還辑单元描述 文件; a logical structure description generating module, configured to acquire a content reference sequence, and according to the logical structure The information is divided into a plurality of content reference sub-sequences, and a content division description file is generated according to the plurality of content reference sub-sequences; and a re-synthesis unit description file is generated according to the logical structure information;
逻辑结构描述解析模块, 用于对所述内容划分描迷文件和所述逻辑单元 描述文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content partitioning file and the logical unit description file.
仍参见图 13, 所述逻辑结构描述生成模块包括内容参考序列生成模块、 内容划分描述生成模块和逻辑单元描述生成模块, 其中:  Still referring to FIG. 13, the logical structure description generation module includes a content reference sequence generation module, a content division description generation module, and a logic unit description generation module, where:
内容参考序列生成模块, 用于读取版式文件内容, 按照指定顺序生成内 容参考序列; 指定顺序可以是版式文件内容中的图元在内容数据流中出现的 先后顺序, 也可以是文档树的遍历顺序。  a content reference sequence generating module, configured to read the content of the layout file, and generate a content reference sequence in a specified order; the specified order may be a sequence in which the primitives in the content of the layout file appear in the content data stream, or may be a traversal of the document tree order.
内容划分描述生成模块, 用于根据所述逻辑结构信息将所述内容参考序 列划分为多个内容参考子序列, 并生成内容划分描述文件; 所述划分方式可 以按照版式文件内容中的图元在内容参考序列的偏移位置或者内容参考序列 中的图元符号, 并为各个内容参考子序列赋予一个编号; 该编号可以保存在 该内容划分描述文件中。  a content division description generating module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generate a content division description file; the division manner may be in accordance with a primitive in the content of the layout file The content references the offset position of the sequence or the primitive symbol in the content reference sequence, and assigns a number to each content reference subsequence; the number can be saved in the content division description file.
逻辑单元描述生成模块, 用于根据所述逻辑结构信息生成逻辑单元描述 文件, 这里的逻辑单元描述文件包括多个逻辑单元以及各逻辑单元之间的关 系,可以采用结构化描迷语言来描述逻辑单元及其之间的关系,如采用 XML、 SGML语言, 且逻辑单元之间的关系可以反映版式文件的阅读顺序。  a logic unit description generating module, configured to generate a logical unit description file according to the logical structure information, where the logical unit description file includes a plurality of logical units and a relationship between the logical units, and the structured description language may be used to describe the logic Units and their relationships, such as XML, SGML language, and the relationship between logical units can reflect the reading order of the layout files.
上述逻辑结构描述生成模块还可包括: 存储设备, 用于存储内容参考序 列生成模块生成的内容参考序列, 或者内容划分描述生成模块划分的多个内 容参考子序列, 或者逻辑单元描述生成模块生成的逻辑单元描述文件。 上述 内容参考序列、 内容参考子序列, 可以保存在该存储设备中, 也可以不保存。  The foregoing logical structure description generating module may further include: a storage device, configured to store a content reference sequence generated by the content reference sequence generating module, or a plurality of content reference sub-sequences divided by the content partitioning description generating module, or generated by the logic unit description generating module Logical unit description file. The above content reference sequence and content reference subsequence may or may not be stored in the storage device.
仍参见图 14, 所述逻辑结构描述解析模块包括逻辑单元描述解析模块和 映射模块, 其中:  Still referring to FIG. 14, the logical structure description parsing module includes a logical unit description parsing module and a mapping module, where:
逻辑单元描述解析模块, 用于读取并解析逻辑单元描迷文件中的数据; 映射模块, 用于将所述内容划分描述文件与所述逻辑单元描述文件进行 关联。 具体的讲, 可以按照内容参考子序列的编号将逻辑单元和其对应的内 容参考子序列关联起来。 a logic unit description parsing module, configured to read and parse data in the logic unit description file; a mapping module, configured to perform the content division description file and the logic unit description file Association. In particular, the logical unit and its corresponding content reference subsequence may be associated by the number of the content reference subsequence.
当上述逻辑结构描述生成模块中的内容参考序列生成模块生成的内容参 考序列, 或者内容划分描述生成模块生成的多个内容参考子序列没有保存在 存储设备中时, 该逻辑结构描述解析模块还应当包括以下模块:  When the logical structure describes the content reference sequence generated by the content reference sequence generating module in the generating module, or the plurality of content reference sub-sequences generated by the content partitioning description generating module are not saved in the storage device, the logical structure description parsing module should also Includes the following modules:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描述解析模块, 用于才艮据所述逻辑结构信息将所述内容参考序 列划分为多个内容参考子序列, 并生成内容划分描述文件。  a content reference sequence generating module, configured to read the content of the layout file, to generate a content reference sequence, and a content division description parsing module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, And generate a content division description file.
当上述逻辑结构描迷生成模块中的内容参考序列, 或者多个内容参考子 序列, 已经保存在上述存储设备中, 则可以直接读取, 不必再次生成。  When the above logical structure describes the content reference sequence in the generation module, or the plurality of content reference sub-sequences have been saved in the storage device, they can be directly read without being generated again.
综上所述, 本发明的方法和系统、 通过将版式文件的内容参考序列划分 为多个内容参考子序列, 生成相应的内容划分描述文件, 并且生成逻辑单元 描述文件, 然后将所述内容划分描述文件与逻辑单元描述文件关联起来, 使 得逻辑结构信息与版式文件相互分离, 可以对版式文件中的任意内容单独进 行遲辑结构描述、 提取, 并可以根据不同的文档逻辑结构模型进行描迷, 描 述范围更加准确, 逻辑结构信息的表示更加灵活, 同时还可以对同一个版式 文件添加多个逻辑结构信息描述, 即同一个版式文件可以拥有多个内容划分 描述文件以及逻辑单元描述文件, 在添加或修改文档逻辑结构信息时, 不需 要对版式文件的内容描述进行修改, 减小了出错的可能, 且版式文件逻辑结 构信息的这种灵活表示方式可以对已经存在的大量的版式文件进行描述, 而 不影响已有的系统, 提高了兼容性。 发明的精神和范围。 这样, 倘若本发明的这些修改和变型属于本发明权利要 求及其等同技术的范围之内, 则本发明也意图包含这些改动和变型在内。  In summary, the method and system of the present invention divides a content reference sequence of a layout file into a plurality of content reference sub-sequences, generates a corresponding content division description file, and generates a logical unit description file, and then divides the content. The description file is associated with the logical unit description file, so that the logical structure information and the layout file are separated from each other, and any content in the layout file can be separately described and extracted, and can be described according to different document logical structure models. The description range is more accurate, the representation of logical structure information is more flexible, and multiple logical structure information descriptions can be added to the same layout file, that is, the same layout file can have multiple content division description files and logical unit description files, and is added. Or modify the logical structure information of the document, do not need to modify the content description of the layout file, reduce the possibility of error, and the flexible representation of the logical structure information of the layout file can be performed on a large number of existing layout files. Described later, without affecting the existing system to improve the compatibility. The spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of the inventions

Claims

权 利 要 求 Rights request
1、一种版式文件逻辑结构信息的表示方法,其特征在于, 包括如下步骤: 获取版式文件的逻辑结构信息和内容参考序列; A method for representing logical structure information of a layout file, comprising the steps of: acquiring logical structure information and a content reference sequence of a layout file;
根据所述逻辑结构信息将所述内容参考序列划分为多个内容参考子序 列, 并根据所述多个内容参考子序列生成内容划分描述文件;  And dividing the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and generating a content division description file according to the plurality of content reference sub-sequences;
根据所述逻辑结构信息生成逻辑单元描述文件;  Generating a logical unit description file according to the logical structure information;
将所述内容划分描述文件与所述逻辑单元描述文件进行关联。  The content partitioning description file is associated with the logical unit description file.
2、 根据权利要求 1所述的版式文件逻辑结构信息的表示方法, 其特征在 于, 所迷获取版式文件的逻辑结构信息的步驟包括:  2. The method for representing logical structure information of a layout file according to claim 1, wherein the step of obtaining logical structure information of the layout file comprises:
利用计算机应用程序对版式文件进行标注或者基于文档分析和文档理解 处理系统获取版式文件的逻辑结构信息。  Use a computer application to annotate the layout file or to obtain logical structure information for the layout file based on the document analysis and document understanding processing system.
3、 根据权利要求 1所述的版式文件逻辑结构信息的表示方法, 其特征在 于, 所迷获取版式文件的内容参考序列的步骤包括:  3. The method for representing logical structure information of a layout file according to claim 1, wherein the step of obtaining a content reference sequence of the layout file comprises:
读取版式文件内容, 按照版式文件内容中的图元在内容数据流中出现的 先后顺序或者是文档树的遍历顺序, 生成内容参考序列。  The content of the layout file is read, and the content reference sequence is generated according to the order in which the primitives in the content of the layout file appear in the content data stream or the traversal order of the document tree.
4、 根据权利要求 1所述的版式文件迻辑结构信息的表示方法, 其特征在 于, 所迷将内容参考序列划分为多个内容参考子序列的步骤包括:  4. The method for representing layout file structure information according to claim 1, wherein the step of dividing the content reference sequence into a plurality of content reference subsequences comprises:
按照版式文件内容中的图元在所述内容参考序列的偏移位置或者内容参 考序列中的图元符号, 将所述内容参考序列划分为多个内容参考子序列。  The content reference sequence is divided into a plurality of content reference sub-sequences according to the primitives in the content of the layout file at the offset position of the content reference sequence or the primitive symbols in the content reference sequence.
5、 根据权利要求 1所述的版式文件逻辑结构信息的表示方法, 其特征在 于, 该方法进一步包括: 为所述多个内容参考子序列分别赋予一个编号; 将所述内容划分描述文件与逻辑单元描述文件进行关联的步骤包括: 通过内容参考子序列的编号将内容划分描述文件与逻辑单元描迷文件进 行关联。  The method for expressing logical structure information of the layout file according to claim 1, wherein the method further comprises: assigning a number to the plurality of content reference subsequences; and dividing the content into description files and logic The step of associating the unit description file includes: associating the content division description file with the logic unit description file by the number of the content reference subsequence.
6、 根据权利要求 1所述的版式文件逻辑结构信息的表示方法, 其特征在 于, 所述内容划分描述文件或者逻辑单元描述文件为存储设备上一个独立的 文件或者版式文件中的一个数据块。  The method for expressing logical structure information of a layout file according to claim 1, wherein the content division description file or the logical unit description file is a data file in a separate file or a layout file on the storage device.
7、 根据权利要求 1所述的版式文件逻辑结构信息的表示方法, 其特征在 于, 所述内容划分描述文件或者逻辑单元描述文件采用结构化标记语言来描 述。 7. The method for representing logical structure information of a layout file according to claim 1, wherein The content partitioning description file or the logical unit description file is described in a structured markup language.
8、 一种版式文件 辑结构信息的表示系统, 其特征在于, 包括: 逻辑结构信息获取系统, 用于获取版式文件的逻辑结构信息;  8. A layout file structure system representation system, comprising: a logic structure information acquisition system, configured to obtain logical structure information of a layout file;
逻辑结构描述生成模块, 用于获取内容参考序列, 并根据所述逻辑结构 信息将所述内容参考序列划分为多个内容参考子序列, 才艮据所迷多个内容参 考子序列生成内容划分描述文件; 根据所述逻辑结构信息生成逻辑单元描述 文件;  a logical structure description generating module, configured to obtain a content reference sequence, and divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, to generate a content division description according to the plurality of content reference sub-sequences a file; generating a logical unit description file according to the logical structure information;
逻辑结构描述解析模块, 用于对所述内容划分描述文件和所述 ill辑单元 描迷文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content division description file and the ill unit description file.
9、 根据权利要求 8所述的版式文件逻辑结构信息的表示系统, 其特征在 于, 所述逻辑结构描述生成模块包括:  The system for expressing logical structure information of the layout file according to claim 8, wherein the logical structure description generating module comprises:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描述生成模块, 用于根据所述逻辑结构信息将所述内容参考序 列划分为多个内容参考子序列, 并根据所述多个内容参考子序列生成内容划 分描述文件;  a content reference sequence generating module, configured to read a layout file content, and generate a content reference sequence; a content division description generation module, configured to divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, and according to the Generating a content division description file by the plurality of content reference sub-sequences;
逻辑单元描述生成模块, 根据所述逻辑结构信息生成逻辑单元描述文件。 The logical unit description generation module generates a logical unit description file according to the logical structure information.
10、 根据权利要求 9所迷的版式文件還辑结构信息的表示系统, 其特征 在于, 所述逻辑结构描述生成模块还包括: 存储设备, 用于存储所述内容参 考序列生成模块生成的内容参考序列, 或者所迷内容划分描述生成模块划分 的多个内容参考子序列。 The system for expressing the structure information of the layout file according to claim 9, wherein the logic structure description generation module further comprises: a storage device, configured to store a content reference generated by the content reference sequence generation module The sequence, or the content partitioning, describes a plurality of content reference subsequences that are generated by the generation module.
11、 根据权利要求 8所迷的版式文件逻辑结构信息的表示系统, 其特征 在于, 所述逻辑结构描述解析模块包括:  11. The system for expressing logical structure information of a layout file according to claim 8, wherein the logical structure description parsing module comprises:
内容参考序列生成模块, 用于读取版式文件内容, 生成内容参考序列; 内容划分描述解析模块, 用于将所述内容参考序列划分为多个内容参考 子序列, 并生成内容划分描述文件。  a content reference sequence generating module, configured to read the layout file content, and generate a content reference sequence; the content division description parsing module is configured to divide the content reference sequence into a plurality of content reference sub-sequences, and generate a content division description file.
12、 根据权利要求 10或 11所述的版式文件逻辑结构信息的表示系统, 其特征在于, 所述逻辑结构描述解析模块还包括:  The system for expressing the logical structure information of the layout file according to claim 10 or 11, wherein the logical structure description parsing module further comprises:
逻辑单元描述解析模块, 用于读取并解析所述逻辑单元描述文件中的数 据; a logical unit description parsing module, configured to read and parse the number in the logical unit description file According to;
映射模块, 用于将所述内容划分描述文件与所述逻辑单元描述文件进行 关联。  And a mapping module, configured to associate the content division description file with the logical unit description file.
13、 一种版式文件逻辑结构信息的表示装置, 其特征在于, 包括: 逻辑结构信息获取模块, 用于获取版式文件的逻辑结构信息;  A device for displaying logical structure information of a layout file, comprising: a logical structure information acquiring module, configured to obtain logical structure information of a layout file;
逻辑结构描述生成模块, 用于获取内容参考序列, 并根据所述逻辑结构 信息将所述内容参考序列划分为多个内容参考子序列, 才艮据所述多个内容参 考子序列生成内容划分描述文件; 根据所述 i 辑结构信息生成逻辑单元描述 文件;  a logical structure description generating module, configured to obtain a content reference sequence, and divide the content reference sequence into a plurality of content reference sub-sequences according to the logical structure information, to generate a content division description according to the multiple content reference sub-sequences a file; generating a logical unit description file according to the i-series structure information;
逻辑结构描述解析模块, 用于对所述内容划分描述文件和所述逻辑单元 描述文件进行解析和关联。  The logical structure description parsing module is configured to parse and associate the content partitioning description file and the logical unit description file.
PCT/CN2008/000910 2007-06-22 2008-05-08 Representation method, system and device of layout file logical structure information WO2009000141A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710123338.6 2007-06-22
CN200710123338.6A CN101271463B (en) 2007-06-22 2007-06-22 Structure processing method and system of layout file

Publications (1)

Publication Number Publication Date
WO2009000141A1 true WO2009000141A1 (en) 2008-12-31

Family

ID=40005437

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/000910 WO2009000141A1 (en) 2007-06-22 2008-05-08 Representation method, system and device of layout file logical structure information

Country Status (2)

Country Link
CN (1) CN101271463B (en)
WO (1) WO2009000141A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116916047A (en) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887413B (en) * 2009-05-14 2012-07-04 北大方正集团有限公司 Structure processing method and system of plate type table
CN102087692B (en) * 2009-12-02 2013-11-06 北大方正集团有限公司 Data replication prevention method and system for layout file
CN102122280B (en) * 2009-12-17 2013-06-05 北大方正集团有限公司 Method and system for intelligently extracting content object
CN102541888A (en) * 2010-12-20 2012-07-04 鸿富锦精密工业(深圳)有限公司 Electronic patent file analysis system and electronic patent file analysis method
CN102567291B (en) * 2010-12-31 2014-09-10 北大方正集团有限公司 Method and device for deleting lace characters in format document
CN102411498A (en) * 2011-07-26 2012-04-11 中兴通讯股份有限公司 Method for realizing data model and graphical designer
CN103186655A (en) * 2011-12-31 2013-07-03 北大方正集团有限公司 Processing method and device for layout file
EP2875468A1 (en) 2012-07-20 2015-05-27 Microsoft Technology Licensing, LLC Color coding of layout structure elements in a flow format document
CN103970799B (en) * 2013-02-04 2019-04-26 百度在线网络技术(北京)有限公司 A kind of generation method of electronic document, device and client
CN104090920A (en) * 2014-06-17 2014-10-08 安徽教育网络出版有限公司 System for realizing digital content cross-terminal publishing
CN104199803B (en) * 2014-07-21 2017-10-13 安徽华贞信息科技有限公司 A kind of text information processing system and method based on combinatorial theory
CN105760358B (en) * 2014-12-19 2019-07-23 阿里巴巴集团控股有限公司 The method and device thereof that the e-book space of a whole page is reset and e-book is shown
CN105279254B (en) * 2015-10-12 2018-10-23 江苏中威科技软件系统有限公司 The implementation method of format data streamed file system and its operating device and its operating device
CN105701073A (en) * 2015-12-31 2016-06-22 北京中科江南信息技术股份有限公司 Layout file generation method and device
CN108287927B (en) * 2018-03-05 2019-10-22 北京百度网讯科技有限公司 For obtaining the method and device of information
CN109815243B (en) * 2019-02-18 2020-03-03 北京仁和汇智信息技术有限公司 Structured storage method and device during document interface modification
CN112612750A (en) * 2020-12-15 2021-04-06 北京天融信网络安全技术有限公司 File content processing method and device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6592628B1 (en) * 1999-02-23 2003-07-15 Sun Microsystems, Inc. Modular storage method and apparatus for use with software applications
CN1441929A (en) * 2000-07-10 2003-09-10 佳能株式会社 Delivering multimedia descriptions
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
US20050193327A1 (en) * 2004-02-27 2005-09-01 Hui Chao Method for determining logical components of a document
US20070092140A1 (en) * 2005-10-20 2007-04-26 Xerox Corporation Document analysis systems and methods

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100429643C (en) * 2005-12-07 2008-10-29 段君雷 Production of multi-media network electronic publication
CN100356372C (en) * 2005-12-31 2007-12-19 无锡永中科技有限公司 Generating method of computer format document and opening method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6592628B1 (en) * 1999-02-23 2003-07-15 Sun Microsystems, Inc. Modular storage method and apparatus for use with software applications
CN1441929A (en) * 2000-07-10 2003-09-10 佳能株式会社 Delivering multimedia descriptions
US20050193327A1 (en) * 2004-02-27 2005-09-01 Hui Chao Method for determining logical components of a document
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
US20070092140A1 (en) * 2005-10-20 2007-04-26 Xerox Corporation Document analysis systems and methods

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116916047A (en) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data
CN116916047B (en) * 2023-09-12 2023-11-10 北京点聚信息技术有限公司 Intelligent storage method for layout file identification data

Also Published As

Publication number Publication date
CN101271463B (en) 2014-03-26
CN101271463A (en) 2008-09-24

Similar Documents

Publication Publication Date Title
WO2009000141A1 (en) Representation method, system and device of layout file logical structure information
CN110083805B (en) Method and system for converting Word file into EPUB file
US7703009B2 (en) Extensible stylesheet designs using meta-tag information
CN101548273B (en) Method for demonstrating file
CN1801149B (en) Systems and methods for converting a formatted document to a web page
CN101937427B (en) Browser-based system and method for content edition and issue
CN101308488B (en) Document stream type information processing method based on format document and device therefor
CN102609400B (en) Method for converting file formats and conversion tool
CN101714133A (en) WEB-based mathematical formula editing system and method
CN103049439A (en) Processing method for markup language documents, browser and network operating system
CN105446946A (en) Format document resetting method and system, electronic reading terminal
CN112527291A (en) Webpage generation method and device, electronic equipment and storage medium
CN111881651A (en) Method for converting UOT streaming document into OFD format document
CN104090920A (en) System for realizing digital content cross-terminal publishing
WO2007081017A1 (en) Document processor
CN112433995B (en) File format conversion method, system, computer device and storage medium
CN102289497A (en) Document preview image generating system and method
US8930808B2 (en) Processing rich text data for storing as legacy data records in a data storage system
WO2001082121A2 (en) Pre-computing and encoding techniques for an electronic document to improve run-time processing
CN107066437B (en) Method and device for labeling digital works
CN107423271B (en) Document generation method and device
CN113239670A (en) Method and device for uploading service template, computer equipment and storage medium
KR20070120965A (en) Determining fields for presentable files and extensible markup language schemas for bibliographies and citations
JP2004529427A (en) Design of extensible style sheet using meta tag information
Hughes et al. Encoding and presenting interlinear text using XML technologies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08748468

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08748468

Country of ref document: EP

Kind code of ref document: A1