Ethnologue Automation

Yüklə 281 Kb.

tarix	30.10.2018
ölçüsü	281 Kb.
	#76632

Unicode Conversion Planning

Document last revised: December 5, 2006 by Jim Brase

Acknowledgement

This was adapted from documents prepared by Paul Frank for the Americas Area Guidance for Publication Strategies meeting, Catalina, AZ, May 27-29, 2006. It was adapted by Bill Mayes for general use by Wycliffe and SIL and their partners, and then edited by me for the Unicode Transition Workshop following the 2006 CTC.
Jim Brase

Vision for a desired future

Our vision is that

all computer files which will be needed or useful for ongoing language development or research have been converted to Unicode, and
language teams have Unicode-compliant tools with which to do their ongoing work.

Stakeholders

Determine who your stakeholders are. Some possibilities:

SIL and Wycliffe areas and entities
National language development or Bible translation organizations
Publishers with whom we cooperate
Government organizations
Universities
Other missions in translation

Problem analysis: why our desired future is not a reality today

Virtually all materials created by language programs in the past were created using “legacy fonts” (i.e., fonts created prior to the development of Unicode). There were two types of legacy fonts used:

fonts which were based on older, pre-Unicode encoding standards, which supported only a limited number of major languages, and
fonts which were based on custom encodings, each of which supported one or a small set of minority languages, but which did not follow any recognized international, national, or industry standard.

The second type of legacy font was used by most language programs.

This created a confusing situation in which the same character code often represented different characters in different fonts. Without the correct font, the computer file that was created using that font became unintelligible.

The Unicode standard solves the confusion by assigning a unique numerical character code to every character of every writing system in the world. The characters in a file in Unicode are always intelligible—a given numerical code always refers to the same thing, no matter what specific program or font is used to create the file.

The computer software and publishing industries are moving to the Unicode standard. It is necessary for SIL and our partner organizations to also change to Unicode. Yesterday’s programs only used legacy fonts. Today’s programs often use both legacy fonts and Unicode fonts. Tomorrow’s programs will only use Unicode fonts. When that time comes, materials created with legacy fonts will no longer be usable.

“Changing to Unicode” involves

determining how data should be encoded in Unicode,
upgrading to software (fonts, keyboards, and applications) which uses Unicode in creating, editing, and processing text, and
converting existing data to Unicode.

Goals: from problems to solutions

The fundamental problems outlined above will be solved and our desired future attained when the following goals are met:

Key entity administrators, archivists, orthography consultants, and teams are aware of Unicode, and entities have personnel able to give leadership to the Unicode transition process.
Computer support and publications personnel are trained in bringing Unicode text to final production.
All the characters in all the languages of the area/entity can be represented with Unicode fonts in a standardized way.
Needs for the Private Use Area have been identified, and a plan has been created to manage the use of PUA characters from birth to retirement.
All the legacy encodings (fonts) in past or current use have been identified.
Mapping tables have been created for each of the fonts identified in goal 5.
All materials that should be converted to Unicode have been identified and prioritized.
All the materials in goal 7 have been converted to Unicode.
Language projects that are ongoing have begun using Unicode-compliant computer programs.
Orthography consultants are able to guide language teams on Unicode-related orthography issues and evaluate how Unicode will impact a proposed orthography.

Objectives: how we will achieve our goals

The following measurable outcomes will lead to the achievement of the goals listed above. Identify parties responsible for each action in square brackets.

Key entity administrators, archivists, orthography consultants, and teams are aware of Unicode, and entities have personnel able to give leadership to the Unicode transition process.

Key entity administrators, archivists, and teams need to understand the issues regarding the need to change to Unicode. Typesetting will be transitioning to Unicode-based tools in the very near future. Archiving repurposable files will require Unicode. Operating systems and new general software will soon function only with Unicode.

Computer consultants communicate this reality to entity language team administrators. [Assigned to: ]

A “Unicode for Administrators” course is under development on DistantCourses.org. (The content is finished, but we still wish to add a feedback mechanism to it.)

Promote and inform field teams about Unicode transition in each workshop. [Assigned to: ]

Include Unicode awareness in orientation and training for archivists. (See Objective 4 regarding archiving and the PUA.) [Assigned to: ]

Include Unicode awareness in knowledge and skills required for orthography consultants. (See Objective 10.) [Assigned to: ]

Computer support and publications personnel are trained in bringing Unicode text to final production.

Computer support and publications personnel need to have the skills to work with text in Unicode and publishing tools that are Unicode-compliant.

Determine whether it is appropriate to develop expertise at the Area level, and help entities who lack adequately trained personnel rather than expecting each entity to develop all the expertise it needs.

Brief computer support and publications personnel on the Unicode conversion methods (See Appendix C) and any resources for the entity. [Assigned to: ]

Work with International Publishing Services (IPub) to identify and implement computer checking and typesetting programs that work with Unicode materials at typesetting centers. [Assigned to: ]

Ensure that publications personnel receive the training they need. (Recurrency training offered by JAARS and IPUB will train publications personnel in how to work with text in Unicode and how to use Unicode-compliant programs.) [Assigned to: ]

Train volunteers to do some of the work of conversion.

Progress:

The following entities have personnel who know how to assess Unicode needs within their entity and develop conversion resources for the fonts used in their entity:

The following entities lack trained personnel:

The following steps need to be taken so that all entities will have adequately trained personnel:

All the characters in all the languages of the area/entity can be represented with Unicode fonts in a standardized way.

The Non-Roman Script Initiative has made an effort to identify all of the character needs around the world and to request that they be included in the Unicode standard if they are not in it already. It is important that all of the stakeholders within the area/entity verify that they have made known all of their character needs.

Survey each of the stakeholders to find out whether they have identified all of their character needs. Assist any stakeholder that is uncertain about this issue to carry out the necessary research. [Assigned to: ]

Each character should be classified according to its status in Unicode and its status in the SIL PUA.

Status in Unicode:

already in Unicode
in the Unicode pipeline
a Unicode proposal needs to be initiated
rejected by Unicode

Status in the SIL PUA:

already deprecated in the PUA
- currently an active character in the PUA
- in the PUA pipeline
- needs to be submitted to the PUA
- not needed in the PUA (already part of Unicode)

Identify characters that may require alternate glyphs for proper rendering.

Identify who is responsible for making the unmet needs known to NRSI. Work with NRSI to develop the required Unicode and PUA proposals, and to add alternate glyphs to SIL fonts. [Assigned to: ]

Identify problem characters or other potentially ambiguous or confusing encoding issues and establish appropriate entity guidelines or standards. [Assigned to: ]

Example: Many orthographies use some form of apostrophe for a glottal stop. Unicode candidates for this character may include

U+0027 APOSTROPHE
U+02BC MODIFIER LETTER APOSTROPHE
U+02B9 MODIFIER LETTER PRIME
U+02C8 MODIFIER LETTER VERTICAL LINE
U+F21D MODIFIER LETTER STRAIGHT APOSTROPHE

What are the advantages and disadvantages of each of these? Is it appropriate to establish a policy in your entity as to which should be used?

Progress:

The following entities/stakeholders have surveyed all the orthographies of the languages for which legacy fonts have been used and there are existing materials to convert:

The following characters have been identified but are not in Unicode:

Needs for the Private Use Area have been identified, and a plan has been created to manage the use of PUA characters from birth to retirement.

It is sometimes necessary to use the PUA for characters that are not yet in Unicode or which have been rejected by Unicode, but there are negative affects to doing so. Data which uses PUA characters does not conform to any standard, and can only be shared with and read by people who are aware of SIL’s PUA usage and have SIL fonts. In some situations, files may be created with mixed encodings. Careful planning is required to mitigate the repercussions of using the PUA.

Ideally, once a PUA character has been added to the Unicode Standard, the PUA codepoint should be deprecated and all data that used the PUA codepoint should be converted to use the new Unicode codepoint. In practice, it may not be possible to do this with archived data.

Identify and document which PUA characters each language project has used, is using, or will use. [Assigned to: ]

Identify restrictions on font technology and software imposed by the PUA. (See Appendix B.) [Assigned to: ]

Outline the “life cycle” for each PUA character: [Assigned to: ]

When was the PUA character created?
- When did each language project start to use it?
- What data has been created or converted using the PUA character?
- When was (or when do you expect) the PUA character (to be) added to Unicode and deprecated in the PUA?
- Who will do the necessary updating of mapping tables and keyboards when the new Unicode character becomes available?
- Who will convert data that uses the PUA codepoint to the new Unicode Standard?
- Will there be data using the PUA that can no longer be converted (e.g. archived data, data that is no longer under our control)?

Ensure that metadata for archived material contains the appropriate documentation regarding PUA usage. [Assigned to: ]

Progress:

The following PUA characters are or have been in use:

Key to Status column:

C in Current use

Dw Deprecated; waiting for conversion of data to new Unicode code point to start.

Dp Deprecated; conversion of data in progress.

Dc Deprecated; conversion of all data complete.

Dx Deprecated; conversion is complete as far as possible. Some data using the PUA code point still exists, but is no longer in our control.

PUA code point	PUA name	Unicode code point	Unicode name	Users	Status

All the legacy encodings (fonts) in past or current use have been identified.

This will include all fonts currently in use by language teams as well as all fonts that are used in materials no longer under development but that we want to preserve for future use. (See also Goal 7).

Note that this differs from Goal 3. Goal 3 surveys orthographies to determine Unicode character needs. This goal surveys legacy encodings to determine what mapping tables will be needed. Two different orthographies may have used the same legacy encoding (if an entity standardized their fonts), or the data of one orthography may be represented by more than one encoding, if a project switched data midstream.

Identify the legacy encodings and secure the fonts that represent them. [Assigned to: ]

Progress:

The following entities/stakeholders have identified all of the fonts and encodings that they use:

The following entities/stakeholders still need to inventory their fonts/encodings:

Mapping tables have been created for each of the fonts identified in goal 5.

The mapping table is the essential element for actually converting a given computer file. It changes the numerical code for a character as used in the legacy font to the numerical code for that same character in Unicode.

Identify the personnel who will prepare the mapping tables and get them the appropriate training, as needed. [Assigned to: ]

Write a mapping table, using NRSI’s tools, for each of the fonts and test the conversion. [Assigned to: ]

Post and manage current mapping tables so anyone who needs them can get the current version. NRSI has an internet site to post mapping tables, legacy fonts, and associated files. See http://scripts.sil.org/ConversionMaps . The mapping tables should be sent to NRSI. [Assigned to: ]

Maintain the mapping tables so that when characters are accepted into the Unicode Standard, the mapping tables are updated to reflect the new codepoints. [Assigned to: ]

Write or modify keyboards so that their output is consistent with the character encodings used in the relevant mapping tables. [Assigned to: ]

Progress:

The following entities/stakeholders have created mapping tables for all of their encodings:

The following entities/stakeholders still need to create mapping tables for some of their encodings:

All materials that should be converted to Unicode have been identified and prioritized.

We will not know when we are done with Unicode conversion until we know what needs to be converted. Not all materials are worth converting to Unicode. Therefore, a selection process is necessary. Also, it is very difficult to convert materials created by some programs, as the conversion tools cannot work with files in some proprietary formats.

Establish criteria or guidelines for what materials should be converted. (See Appnedix G: “Flowchart for processing materials to archive” for a decision-making process. (What document is this referring to? j.b.) See also “Appendix A for more guidelines. Essentially, documents that we would want to repurpose (e.g., lexical data files, paradigms, primary texts, etc.) should be converted. Documents that we only want to read in the future should not be maintained as text documents and therefore do not need Unicode conversion.) [Assigned to: ]

Determine which file formats are feasible to convert and create a short summary for use by field teams. (Note that some obsolete file formats can be updated using file format conversion tools like “Conversions Plus” (http://www.dataviz.com/products/conversionsplus/) and then can be converted to Unicode).[Assigned to: ]

Establish an official statement from the Language Program Manager that defines expectations on what needs to be done within a given timeframe. [Assigned to: ]

Take an inventory of existing computer materials in each language project that are in acceptable file formats and that match the criteria/guidelines. [Assigned to: ]

Determine how the accuracy of the conversion process will be verified for materials that will be converted only for archiving, and establish appropriate policies. [Assigned to: ]

Do not convert archive materials and put them on the shelf unchecked! If the accuracy of the conversion cannot be certified, the original legacy data, along with metadata defining the legacy encoding, should be retained as a backup.

Progress:

The following entities/stakeholders have established criteria and priorities on what to convert:

The following entities/stakeholders still need to establish criteria and priorities on what to convert:

All the materials in Goal 7 have been converted to Unicode.

Who does the actual conversion and how much time it will take will vary greatly. It will depend on the computer aptitude of the language personnel, the availability of computer support, the variety of data formats, and the complexity of the data. Identify resources needed to do the conversions and establish priorities.

Determine the method of Unicode conversion to be used for specific scenarios. (See Appendix C.) [Assigned to: ]

Identify the personnel who will train personnel and have them prepare to train others. [Assigned to: ]

Identify who in the area/entity could give consultant leadership to this effort. [Assigned to: ]

Identify the personnel who will do the conversions and get them the appropriate training, as needed. –Assigned to: ]

Some issues to consider:

Will any language teams do their own conversion? Will it be more efficient to train them, or have someone do the conversion for them?

Field teams with a small amount of data to convert may expect to obtain assistance with conversion from entity computer support.
Field teams with large amounts of data should be given the tools and training to convert their own data. Computer support personnel need to be available to assist them.
Allowances need to be made for the computer aptitude of the teams.

Might volunteers be able to help? This can be done remotely if needed. Volunteers could be utilized to convert data to Unicode if they are supplied with the following:

Source files designated for conversion
Fonts used in the source files
Specific mapping tables for the fonts used
Conversion application programs/tools
Outline of how to perform the conversion steps)

Establish priorities for conversion at the Area and/or Entity levels. [Assigned to: [

Initial ideas:

Closed programs should receive the highest priority for evaluation and conversion of selected materials.
Programs on the verge of closing/handing off to another agency should be next.
Active programs should convert all selected material when conversion aligns with the needs of the program. (See Goal 7.)

Do the conversions. [Assigned to: ]

Monitor progress. [Assigned to: ]

Language projects that are ongoing have begun using Unicode-compliant computer programs.

There is a new generation of computer software that is Unicode compliant. These include Paratext 6, Office 2003, FieldWorks, Toolbox, and Open Office 2. Language teams in ongoing projects need to make the transition to these new software tools. Every active team will transition to Unicode before closure of their program. The timing of transition will vary from team to team depending on their circumstances. New field teams should use Unicode from the beginning.

Identify the Unicode-compliant software that will meet the needs of each team. This will be affected by the platform the team uses. (See Appendix C.) [Assigned to: ]

Establish recommendations for when teams should shift to Unicode. [Assigned to: ]

Field teams should convert their working data when one or more of the following occur:

They begin using Paratext 6.
A problem is evident with the Legacy font in any computer program.
The field team needs a modification to their existing Legacy font.
The field team desires to use features only available in Unicode.
The field team purchases a new computer. (While not a requirement, it is often an opportune time to make a transition to Unicode.)

Help teams obtain and learn how to use the new programs and import their converted data into these programs. [Assigned to: ]

Monitor progress. [Assigned to: ]

Orthography consultants are able to guide language teams on Unicode-related orthography issues and evaluate how Unicode will impact a proposed orthography.

Language personnel should not expect Unicode to add ad hoc characters to the standard. In general, new characters are added only when it can be established that they are already in use. Thus, new orthographies will be constrained by factors such as the set of characters available in Unicode for a given script and the properties of the characters. Characters from different scripts should not be mixed. (E.g. Do not borrow a Cyrillic character to use in a Latin orthography, even if it looks nice.) Orthography consultants need to be aware of these restrictions. (This is analogous to restrictions imposed 50 years ago by available typewriters and local print shops.)

Unicode training materials for orthography consultants are under development.

Determine what level of training is required for orthography consultants in your entity. [Assigned to: ]

Determine who needs the training. [Assigned to: ]

Ensure that they receive the required training. [Assigned to: ]

Determine if policies concerning the approval process for an orthography need to be adjusted to allow for Unicode issues. [Assigned to: ]

Appendix A: Identify and prioritize materials to convert

In theory, all materials should eventually be converted to Unicode, as materials in legacy fonts will someday become unusable. In practical terms, however, only some materials are worth converting, only some file formats lend themselves to conversion, and the conversion of some materials is a higher priority than other materials.

Which materials to convert

The primary criterion for deciding whether materials need to be converted is the likelihood or certainty that the computer files will need to be used in the future. One-time materials (calendars and other dated material; provisional editions of publications) do not warrant conversion, as there is no anticipated need to use the files in the future.

Other materials definitely warrant conversion because of the likelihood that they will be revised in the future:

Scripture
Other Christian literature
Educational materials
Vernacular literature
Dictionary files

Between these two extremes are many kinds of materials that may or may not warrant conversion:

Academic papers and monographs
Cultural notes
Resources for language work, such as key terms lists in a particular language

Which file formats lend themselves to conversion?

Some file formats are much easier to convert than others. Conversions should only be done on materials in the following file formats, based on technical issues and practicality:

Plain text files
Standard format files
Word files
Publisher files (Conversion can be done but is difficult. Should this format be on the list?)
Should other formats be added to this list?

If there are important materials which are not in one of these formats, conside putting them into one of the above formats so that they can be converted.

Timing of conversion

If a language project has recently been completed, it is important to consider what computer files should be converted and preserved for future use. Anytime following the completion of a program is a good time for conversion. (And the sooner, the better.)

It is harder to decide when to convert the materials being used in active language projects. We are still in between the era of legacy fonts and the Unicode era. Therefore, Unicode-aware software may not be available for all tasks a team might need to perform, and some of the computers being used in the language project may not be capable of working with Unicode. Therefore, the situation of the language project must be evaluated prior to undertaking conversion.

The following are indicators that it might be best for a team to move their materials to Unicode:

The language team’s software needs can be met with Unicode-aware programs. (See Appendix B.)
Unicode language materials only occasionally need to be converted to legacy fonts for use by others in that form (e.g., for typesetting/layout, or use by others without Unicode-aware software).
The team/project will be active for an estimated ___ or more years.

The following are indicators that it might be best for a team to continue using legacy fonts in their everyday work:

The team needs to do tasks for which there is no adequate Unicode-aware software.
Language materials would frequently need to be converted to legacy fonts.
The team is within ____ years of completion.
The team is in the middle of a major push in some aspect of their program. Examples of such a major push include translation completion, dictionary completion, actively developing the literacy program, or needing to focus exclusively on some language learning or analysis task because of the availability of language help.^¹

A possible strategy for determining when a team should convert is to:

Set a goal for the team to convert now.
If the team cannot convert now, require justification for the delay.
Establish a plan to eliminate the roadblocks identified in step 2.

Priorities in conversion

Even after selecting only some types of materials to convert and limiting the task to certain file types, there is still a huge amount of material to convert, sooner or later.

If language teams are provided with the means for converting their own material and using it in Unicode day to day (conversion tables, Unicode fonts, new software, training), the workload can be distributed and each team can handle its more modest amount of material.

If language projects do not convert their materials while the project is active, the entity will have to convert the materials once the programs are complete.

Appendix B:
The PUA, rendering technology, and software

It is necessary to carefully weigh the benefits and costs of using PUA characters. Among the costs, the use of PUA characters may place some restrictions on what software a project can use. While AAT and Graphite will render PUA characters correctly, the Open Type rendering technology does not work with these characters. This becomes a problem when the PUA character is either a combining mark or the base for a combining mark. For example:

Char sequence	Works in AAT	Works in Graphite	Works in OT	Resulting image with MS Word
U+F210	yes	yes	yes	
U+F25D	yes	yes	yes	
U+F21A	yes	yes	yes	
U+F25D, U+0301	yes	yes	no	́
U+F21A, U+0301	yes	yes	no	́
U+0069, U+F174	yes	yes	no	i
U+0065, U+F174	yes	yes	no	e

Compare this chart with Appendix C: Language-related software that supports Unicode, to see what software may be unavailable to you as a result using the PUA.

Appendix C:
Language-related software that supports Unicode

The following chart indicates various language-related tasks and some software available for that task that supports Unicode.^²

Language task	Unicode-aware software	Platform	Rendering Technology
Fonts (see Appendix E)	Doulos SIL, Charis SIL	L, M, W	AAT, Gr, OT
Keyboarding	Keyman 6.2 MS Keyboard Layout Creator SCIM + KMFL	W W L	NA
Scripture editing	Paratext 6 FieldWorks Translation Editor Bible Edit	W W L	OT Gr, OT ?
Dictionary work	Toolbox “Flex” (FieldWorks Language Explorer, soon to be released) LexiquePro Shoebox Utils	W W W L, W	OT Gr, OT OT NA
General text processing	Microsoft Word/Office 2003 WordPad WorldPad Open Office 2	W W W L, W	OT OT Gr Gr, partial OT
Book layout	InDesign XeTex Microsoft Publisher 2003 Scribus	W L, M, W W L	partial OT AAT, OT OT ?
Anthropology data	FieldWorks Data Notebook	W	?
Scripture Adaptation	Adapt It Unicode CarlaStudio Unicode	(L), W W	OT ?

Appendix D:
Ways that materials may be converted to Unicode

The recommended processes for converting language materials to Unicode use a TECkit map or table. (See http://scripts.sil.org/TECkit for more details.) Once a map has been created for a given legacy font, a number of options are available for converting computer files. Some of these may be easy for an end user to use. Others will be more suitable for computer support personnel to use.

SILConverters includes a Word macro that will convert an open Word file. The changes can be limited to certain Word styles, fonts, or standard format markers. (See http://scripts.sil.org/EncCnvtrs). It will support not only TECKit mappings, but also CC tables and other conversion tools.
SILConverters can also be used to convert Publisher files. (Select a text box in Publisher; edit in Word; run the Word conversion macro.)
SILConverters includes a “Clipboard Converter” that can change all the characters copied to the Windows clipboard. When pasted, the text will be in the Unicode encoding. (See http://scripts.sil.org/EncCnvtrs).
FieldWorks applications include a provision for converting legacy data when importing files.
Paratext 6 can convert data to Unicode when importing it.
Shoebox/Toolbox Utilities (http://scripts.sil.org/SHUtils-manual) can convert legacy Shoebox files into Unicode.

6a. Sh2sh will preserve interlinear formatting:

Run Shoebox. Go to the language settings dialogue. Enter into the comment field:

\codepage=pathfromsettingsfolder/mappingfile.tec

Use forwardslashes in the path. Sh2sh will do the rest with the data. You then have to redo your Shoebox language setup to use Unicode.

Use

sh2sh –h

to get help.
6b. Sh2xml will convert Shoebox files to XML.
6c. Sh2odt formats a Shoebox database in Open Office. Handles interlinear text.
6d. Shint + Sh_rtf. Shint processes the interlinear text to create an intermediate file. Sh_rtf converts it to RTF which can be imported into Word.

DropTEC is a graphical, drag-and-drop interface for TECkit which can convert plain text files one at a time. “Plain text file” includes any text file that uses a consistent encoding throughout the file, including SFM files if the format markers use the same encoding as the text.
TxtConv is the command line interface for TECKit. It can be used in batch files for doing mass conversions of similarly encoded plain text files in a fraction of the time required for file-by-file conversions.
SFConv is a command line tool that can convert SFM files in which the encoding varies from field to field.
Other conversion options are located at http://scripts.sil.org/ConversionUtilities.

Appendix E:
Fonts and related tools

Font naming conventions
If the name of an SIL font begins with “SIL”, it is a legacy font.

If the name ends with “SIL”, it is a Unicode font.

Current SIL Unicode fonts
See http://scripts.sil.org/SILFontList for the primary list of Unicode fonts. In addition, the following Roman fonts are available:

Name	Purpose	Available from
Charis SIL Literacy	The default forms for ‘a’ and ‘g’ have been changed to those often preferred by literacy specialists—single-story ‘a’ and single-bowl ‘g’.	http://scripts.sil.org/CharisSIL_download
Doulos SIL Literacy	See Charis SIL Literacy.	http://scripts.sil.org/CharisSIL_download
Andika	Design review of new literacy font.	http://scripts.sil.org/andika

Font and Mapping tools
The following is a list of tools to help in creating mapping files and convert legacy fonts to Unicode.
Legacy Mapping Workbook (http://scripts.sil.org/UTTLegacyMap)

Encore2Unicode

Editing mapping files:

TECKit Map Unicode Editor (distributed with SIL Converters 2.5)

TECKit Mapping Editor (distributed with TECKit)

any text editor

Unicode reference:

The Unicode Standard 5.0 (book)

Unicode web site

Unibook

http://scripts.sil.org/EncConvRes

Unicode 4.1 Character Properties Excel Workbook

Unicode 4.1 Latin and Cyrillic Characters—sorted

ViewGlyph

TECKit Compiler

UTR22 Compiler

Reprise

Appendix F:
Unicode transition progress chart

The following chart shows where each stakeholder in the area/entity is with regard to the Unicode transition process.

Empty cells show tasks not yet begun. Cells with a check mark show tasks completed.

Task	name of stakeholder	name of stakeholder	name of stakeholder	name of stakeholder	name of stakeholder	name of stakeholder
Unicode awareness has been promoted among administrators and leadership)
Trained personnel are available
Character needs have been identified
PUA plan is in place
Fonts have been identified
Mapping tables have been made
Mapping tables have been tested
Keyboards have been made and tested
Materials to convert have been identified
Conversion priorities have been established
New software that each team will need to use has been identified.
Data has been converted.
Teams have begun using Unicode.

Appendix G:
Flowcharts

The flowcharts on the following pages come from my revision of Paul Frank’s work (G.1) and the Americas Area Guidance for Publication Strategies (G.2 and G.3).

G.1. Flowchart for converting materials to Unicode

G.2. Flowchart for processing unpublished manuscripts and books that have been out of print for a long time

(If the decision is to archive, see “Flowchart for processing materials to archive”)

G.3. Flowchart for processing materials to archive

(Takes the output of the “Flowchart forprocessing unpublished manuscripts”)

Additional explanations

1 Jim Brase has expressed a concern about waiting until a program has completed its work to do Unicode conversion, as that reduces the likelihood that the team will actually work with the files that have been converted to ensure that the conversion worked as it should. So, if possible, he suggests materials be converted to Unicode, published, then archived, rather than waiting to do the conversion after publishing.

2 See http://scripts.sil.org/UnicodeSupport for more details.

Page of

Yüklə 281 Kb.

Dostları ilə paylaş:

Ethnologue Automation

Acknowledgement

Vision for a desired future

Stakeholders

Problem analysis: why our desired future is not a reality today

Goals: from problems to solutions

Objectives: how we will achieve our goals

Computer consultants communicate this reality to entity language team administrators. [Assigned to: ]

A “Unicode for Administrators” course is under development on DistantCourses.org. (The content is finished, but we still wish to add a feedback mechanism to it.)

Promote and inform field teams about Unicode transition in each workshop. [Assigned to: ]

Include Unicode awareness in orientation and training for archivists. (See Objective 4 regarding archiving and the PUA.) [Assigned to: ]

Include Unicode awareness in knowledge and skills required for orthography consultants. (See Objective 10.) [Assigned to: ]

Determine whether it is appropriate to develop expertise at the Area level, and help entities who lack adequately trained personnel rather than expecting each entity to develop all the expertise it needs.

Brief computer support and publications personnel on the Unicode conversion methods (See Appendix C) and any resources for the entity. [Assigned to: ]

Work with International Publishing Services (IPub) to identify and implement computer checking and typesetting programs that work with Unicode materials at typesetting centers. [Assigned to: ]

Ensure that publications personnel receive the training they need. (Recurrency training offered by JAARS and IPUB will train publications personnel in how to work with text in Unicode and how to use Unicode-compliant programs.) [Assigned to: ]

Train volunteers to do some of the work of conversion.

Progress:

The following entities have personnel who know how to assess Unicode needs within their entity and develop conversion resources for the fonts used in their entity:

The following entities lack trained personnel:

The following steps need to be taken so that all entities will have adequately trained personnel:

Survey each of the stakeholders to find out whether they have identified all of their character needs. Assist any stakeholder that is uncertain about this issue to carry out the necessary research. [Assigned to: ]

Each character should be classified according to its status in Unicode and its status in the SIL PUA.

Status in Unicode:

already in Unicode

in the Unicode pipeline

a Unicode proposal needs to be initiated

rejected by Unicode

Status in the SIL PUA:

already deprecated in the PUA

currently an active character in the PUA

in the PUA pipeline

needs to be submitted to the PUA

not needed in the PUA (already part of Unicode)

Identify characters that may require alternate glyphs for proper rendering.

Identify who is responsible for making the unmet needs known to NRSI. Work with NRSI to develop the required Unicode and PUA proposals, and to add alternate glyphs to SIL fonts. [Assigned to: ]

Identify problem characters or other potentially ambiguous or confusing encoding issues and establish appropriate entity guidelines or standards. [Assigned to: ]

Example: Many orthographies use some form of apostrophe for a glottal stop. Unicode candidates for this character may include

U+0027 APOSTROPHE

U+02BC MODIFIER LETTER APOSTROPHE

U+02B9 MODIFIER LETTER PRIME

U+02C8 MODIFIER LETTER VERTICAL LINE

U+F21D MODIFIER LETTER STRAIGHT APOSTROPHE

What are the advantages and disadvantages of each of these? Is it appropriate to establish a policy in your entity as to which should be used?

Progress:

The following entities/stakeholders have surveyed all the orthographies of the languages for which legacy fonts have been used and there are existing materials to convert:

The following characters have been identified but are not in Unicode:

Identify and document which PUA characters each language project has used, is using, or will use. [Assigned to: ]

Identify restrictions on font technology and software imposed by the PUA. (See Appendix B.) [Assigned to: ]

Outline the “life cycle” for each PUA character: [Assigned to: ]

When was the PUA character created?

When did each language project start to use it?

What data has been created or converted using the PUA character?

When was (or when do you expect) the PUA character (to be) added to Unicode and deprecated in the PUA?

Who will do the necessary updating of mapping tables and keyboards when the new Unicode character becomes available?

Who will convert data that uses the PUA codepoint to the new Unicode Standard?

Will there be data using the PUA that can no longer be converted (e.g. archived data, data that is no longer under our control)?

Ensure that metadata for archived material contains the appropriate documentation regarding PUA usage. [Assigned to: ]

Progress:

The following PUA characters are or have been in use:

Key to Status column:

C in Current use

Dw Deprecated; waiting for conversion of data to new Unicode code point to start.

Dp Deprecated; conversion of data in progress.

Dc Deprecated; conversion of all data complete.

Dx Deprecated; conversion is complete as far as possible. Some data using the PUA code point still exists, but is no longer in our control.

PUA code point

PUA name

Unicode code point

Unicode name

Users

Status

Identify the legacy encodings and secure the fonts that represent them. [Assigned to: ]

Progress:

The following entities/stakeholders have identified all of the fonts and encodings that they use:

The following entities/stakeholders still need to inventory their fonts/encodings:

Identify the personnel who will prepare the mapping tables and get them the appropriate training, as needed. [Assigned to: ]

Write a mapping table, using NRSI’s tools, for each of the fonts and test the conversion. [Assigned to: ]

Post and manage current mapping tables so anyone who needs them can get the current version. NRSI has an internet site to post mapping tables, legacy fonts, and associated files. See http://scripts.sil.org/ConversionMaps . The mapping tables should be sent to NRSI. [Assigned to: ]

Maintain the mapping tables so that when characters are accepted into the Unicode Standard, the mapping tables are updated to reflect the new codepoints. [Assigned to: ]

Appendix B:
The PUA, rendering technology, and software

Appendix C:
Language-related software that supports Unicode

Appendix D:
Ways that materials may be converted to Unicode

Appendix E:
Fonts and related tools

Appendix F:
Unicode transition progress chart

Appendix G:
Flowcharts