Building a Word Importer Test Bed
At the current time there are about 2500 word documents that have been successfully imported. This number is a bit inflated due to word imports from our own staff, which are likely never published.
Determining the quality of the word imports our authors are experiencing is a very difficult task. By building a Word Importer test bed, we hope to be able to quantify the type of Word documents that are being imported.
From a previous effort (see ticket 4444) we had to determined the size of Word docs that were being imported (since we wanted to set a reasonable limit on the Word doc page count, in order to prevent "books" be imported as CNX modules).

The authors were actually better behaved than I expected. A six page Word document makes a reasonable sixed CNX module.
The chart below contains our authors ranked by the number of successful word imports (aka our Word Importer power users) :
| 231 | cosborne |
| 190 | ncpea |
| 126 | wavelets |
| 089 | fmoody |
| 078 | cbm |
| 071 | hminer |
| 063 | mmichaelides |
| 060 | uccp |
| 056 | KCP |
| 055 | mmchale |
| 049 | allenlu |
| 036 | amp4290 |
| 035 | xiornik |
| 032 | cburrus |
| 029 | southwood |
| 029 | amacneil |
| 027 | lstefanovska |
| 024 | janeslee |
| 022 | nhphuong |
| 019 | doering |
| 018 | TaHongHa |
| 016 | votrunghung |
| 016 | thangnguyentien |
| 016 | rdaniel |
| 014 | thornhillj |
| 014 | jmk1 |
| 014 | ADVANCE |
| 013 | tamhuynhngoc |
| 013 | snevatia |
| 013 | driskila |
| 013 | bpang |
| 012 | nguyenlehongsinh |
| 012 | aitrivra |
Proposal for Test Bed
Examining all 2500 successful Word imports in detail would be optimal, but is untenable. Thus, we are left with choosing a "representative" subset. Some decisions will need to be made to filter the 2500 imports into a workable subset of say 100.
We might want to reward our Word import power users by giving their successful imports a target weight in the test bed. ANy benefits derived from having a test bed will thus impact our power users.
Empirically working with a large-ish Word document is hard. Large-ish Word documents are also not quality candidates for CNX modules. Thus, we will concentrate our efforts on Word documents which are 10 pages (as seen in OpenOffice.org) or shorter.
Test Bed Documents
First cut ... 165 documents. Documents longer than 10 pages are noted.
cosborne__IMP_Testing_Text.dot
cosborne__IMP_TG_Overland_1.3.2.3_To_Kearney_by_Equation
cosborne__Overland_Trail_Pacing.doc
ncpea__11_Friedland_final.doc (20 pages)
ncpea__#139final.doc (17 pages)
ncpea__3_Elvis_Arterbury_-_Assessing_public_opinion.doc
ncpea__1_Betty_Joe_Monk_-_Working_with_the_media_betty.doc
ncpea__7_tod_allen_farmer_-_Politics_of_local_education_module.doc
ncpea__8_mark_weber_-_The_imporance_of_interpersonal_relationships_module.doc
ncpea__adequacy_litigation_and_student_achievement_6_08.doc (46 pages)
ncpea__Alday.03.doc
ncpea__Assessment_5_School_Culture.doc
ncpea__Berry_vision_draft_3rd.doc
ncpea__boske.doc
ncpea__Copy_of_NCPEAhomebound.doc (16 pages)
ncpea__Culture_audit-_staff.doc
ncpea__frazer_final.doc (14 pages)
ncpea__FreeTrade.doc (22 pages)
ncpea__Gahungu-PrincipalCareer-Survey-Report-6-10-08.doc
ncpea__hickeyfinal.doc (23 pages)
ncpea__Jonesreflection.doc (11 pages)
ncpea__LOGIC_MODELS_final.doc (17 pages)
ncpea__missouri.doc (18 pages)
ncpea__National_Standards_Submission_#168_Final.doc (22 page)
ncpea__NCPEA_CONNEXIONS_submission_#87_FINAL.doc (33 pages)
ncpea__obesity.doc (29 pages)
ncpea__Online_Simulation_Manuscript.doc (20 page)
ncpea__papabrown.doc
ncpea__youngfinal.doc (15 pages)
wavelets__101_psm_reports_rev.doc
wavelets__121_Syll_drill_down.doc
wavelets__157PosterAssn.doc
wavelets__160Inventory.doc
wavelets__156NumericDataAssn.doc
wavelets__17GroupLeadbook.doc
wavelets__196stem_cell_debate.doc
wavelets__20xCEVE203PaperAssign_v2.doc
wavelets__215_DiscBWikifinal.doc
wavelets__55WritingMentorManual.doc (19 pages)
wavelets__comp482RubricF07.doc
wavelets__example4.doc
wavelets__TA_Prep_guide.doc
fmoody__02Chap.doc (29 pages)
fmoody__Bolton_Preston.doc
fmoody__Adickes_DavidPryor.doc
fmoody__Case_152.doc
fmoody__ChFourCNX.doc
fmoody__Collins_Lowell.doc
fmoody__Essay.doc
fmoody__Moorepoem.doc
fmoody__Preface.doc
fmoody__Rice_University_Press_Why_Now.doc
cbm__1.2_using_a_CMS.odt
cbm__2.1_promote_active_learning.dot
cbm__2.6_support.dot
cbm__3.5_blended_learning.odt
cbm__Background_?_Author_Bios.doc
cbm__DickMoore_Post.odt
cbm__Mackintosh_article.dot (38 pages)
cbm__oo_dauphin.odt
cbm__OO_Farideh.odt
cbm__OO_Masson.odt
cbm__speakers.odt
hminer__Graphing_Quadratic_Functions_II.dot
hminer__Homework_Multiplying_Binomials.dot
mmchale__Buffers08mm.doc
mmichaelides__Algebraic_Generalizations.doc
mmichaelides__ch5_Continuous_Random_Variables_Teachers_Guide.doc
mmichaelides__ch13_F_Distribution_and_Anova_Teachers_Guide.doc
mmichaelides__Data_Sets.doc
mmichaelides__Graphing.doc
mmichaelides__Homework_Ch.10.doc
mmichaelides__Homework_Ch.11.doc
mmichaelides__Lab_Ch.4_1.doc
mmichaelides__Review_Ch.11.doc
mmichaelides__ch11_Chi-Square_Distribution_Teachers_Guide.doc
mmichaelides__Stats_Book_Homework_Ch.8.doc
uccp__APEnvSci_chapter1.doc
uccp__chp4.doc
uccp__chapter5lessontext.doc
KCP__IMP_TG_Pit_1.4.2.1_Whats_Normal?.doc
KCP__IMP_TG_Pit_1.4.2.10_Standard_Deviation_Basics_
KCP__IMP_TG_Pig_1.2.2.1_Rug_Games.doc
KCP__IMP_TG_Pit_1.4.4.3_So_Little_Data,_So_Many_Rules
amp4290__quad_funtions.doc
xiornik__tragedy.doc (46 pages)
cburrus__FFTm2.doc
cburrus__LF00.doc (47 pages)
southwood__Practice_Final_Exam2.doc
southwood__SUMMARY_OF_FORMULAS.doc
southwood__Lesson_2.doc (14 pages)
amacneil__Frequently_Asked_Questions.doc
amacneil__Set_high_expectation.doc
lstefanovska__27.doc
lstefanovska__31-poim_za_izvod.doc
lstefanovska__Transformacija_pogleda.doc
janeslee__Algebraic.doc
janeslee__Inequalities.doc
janeslee__Piecewise_Functino.doc
janeslee__Domain_and_Range.doc
nhphuong__CHAPTER_7.doc (14 pages)
nhphuong__GIOI_THIEU_CHUYEN_NGANH_DT_SAU_DH.doc (18 pages)
nhphuong__CH2_ANALYSIS_IN_TIME_DOMAIN.doc (28 pages)
nhphuong__connexions_(upload).doc
doering__midi_MP-parseSMF.doc
doering__midi_messages.doc
TaHongHa__Performance_in_Noise.doc
TaHongHa__15_.1_Fading_Channel.doc
votrunghung__SoftwareProcess.doc
thangnguyentien__Bai_1_The_Foundation_of_Biochemistry.doc (11 pages)
thangnguyentien__Bai_2_Water.doc (16 pages)
thangnguyentien__Bai_6_Enzyme.doc (28 pages)
thangnguyentien__Bai_11_BIOSIGNALING.doc (15 pages)
thangnguyentien__Bai_13_Glycolysis,_Gluconeogenesis,_and_the_Pentose_Phosphate_Pathway.doc
thangnguyentien__Bai_14_The_Citric_Acid_Cycle.doc
thangnguyentien__Bai_16_Amino_Acid_Oxidation_and_the_Production_of_Urea.doc
thangnguyentien__Bai_18_Carbohydrate_Biosynthesis_in_Plants_and_Bacteria.doc (12 pages)
thangnguyentien__Bai_24_RNA_Metabolism.doc
rdaniel__Important_Terms.doc
rdaniel__Native_Americans.doc
thornhillj__09_a_rehearsal_guide.doc
thornhillj__ch_08_module_2.doc
jmk1__An_Introduction_to_Stem_Cells.doc
jmk1__Overview.doc
jmk1__World_Human_Cloning_Policies.doc
ADVANCE__Understanding_the_Promotion_and_Tenure_Process.doc
ADVANCE__How_to_Stand_Out_In_an_Interview.dot
tamhuynhngoc__Full_paper-ISEE2007-MicrostripAntenna.doc
tamhuynhngoc__lecture10_updated.doc
snevatia__T2_Controls.doc (19 pages)
driskila__M3_IntroduceV1.doc
driskila__M2_HealProj_V3.dot (12 pages)
bpang__Lab_4.doc (12 pages)
nguyenlehongsinh__Free_Distance_and_Coding_Gain.doc
nguyenlehongsinh__Synchronization_with_Continuous-Phase_Modulation_(CPM).doc
aitrivra__ewrs1.doc
aitrivra__frequency_of_discrete_time_signals.doc
andryn__MSScript1_KJ-SN-AJM.doc (14 pages)
beatriz_pena__Tema8-GestionDeFlujosInternos-PROFESOR.doc (11 pages)
dmaduike__Project.doc
dothingocthanh__AMPLITUDE_QUANTIZING.doc
duybt__Function.doc
eluther__Lab8_Final.doc
farooqmohammad__Collections.doc (18 pages)
jshutch__CDS_Chemical_Bonding_for_CNX_-_v2.doc (18 pages)
larry__BayesianactiveSonarDetectionV2.doc
lichen_games117__200704康城百货系统实施工作总结.doc
lizebeth__wordpress-instructions.doc (11 pages)
marknewlyn__Bishops_Grade_9_Ad_Nov_2007.doc (14 pages)
NASA__foodandwater.dot (16 pages)
okaram__SQL_IntroCNX.odt (15 pages)
pleiku__Chapter_7.doc
prat__Objected_Oriented_Programming.doc (19 pages)
rcarslake__Gr8_Algebra_Revision_May_2007.doc
shuns__Lesson_03[1]._Simultaneous_Equations.doc (15 pages)
slegeny__0104pp.doc
SShearman__What_is_LabVIEW.doc
space64__Chuong_02-NgonNguJava.doc
s_omid__Error_Analysis.doc
sultans__lecture_No_1.doc
TexasInstruments__CSULab4_CNXversion.doc
thon0022__davidthonglyvong_2835134_Final_Module2.doc
wyx1999__1_-_Stepping_Up_Windows_Mobile_Development_with_Visual_Studio_2005.doc (15 pages)
zameno__Session_4_V051107J.doc (23 pages)

~jenn/svn/testing/mechanize/BriansTestbedDocs