The radio scripts were sourced from the Generic Radio Workshop, where there were downloadable plain text (.txt) files that includes the body contents of the scripts. For the purpose of this research, metadata that was excluded from the plain text files at the top of the webpage display were also included into the files. There was a regular pattern of "series", "show", "date", and "cast", which led to the decision of strictly limiting the Relax NG schema code in this specific order. There were some discrepancies within the files, most notably being with "Murder in Casbah".
The content of the radio scripts had to be regularized during mark-up so that queries could be performed easily on them in the future, taking into consideration that any changes would not affect the contextual or structural analysis of the project.
<lineGrp>
element was added to
wrap around the spilt lines<line>
element was added to wrap
around the lines in <lineGrp>
to preserve the original
structure of the script<ln>
for
consistency<sound>
and <music>
served the
same purpose. The first step of analysis for the corpus was to mark the changes to the story from
Arthur Conan Doyle's publication for the radio scripts. Text from Arthur Conan Doyle's
original publication was sourced from Project Gutenberg, which allowed me to copy the
text needed for the specific chapters into an XML file. For the initial stage of the
analysis portion, I focused on the story "A Scandal in Bohemia", where there was a radio
recording, the radio script, and original publication are available online for comparison
purposes. XSLT was used to plant <xml:id>
s in the Arthur Conan Doyle
texts.
<xsl:template match="p">
<p xml:id="{ancestor::xml/@xml:id}-p{count(preceding::p) + 1}">
<xsl:apply-templates/>
</p>
</xsl:template>
These <xml:id>
tags were used to stitch together the
<ln>
elements in the radio script to show the correlation between the
two files. The XSLT allowed automatic tagging for the Arthur Conan Doyle texts, which
then gave a pointer for me to use to manually tag segments of specifc paragraphs from
the text
Manual tagging was used for the segmented portions of the story where the changes between the two source files were not similar enough to be stitched together by the original paragraphs.
<xml:id>
's are structured as
followed:
<p xml:id="SIB-p4"
>His manner was not effusive. It seldom was; but he was glad, I think,
to see me. With hardly a word spoken, but with a kindly eye, he waved me to an
armchair, threw across his case of cigars, and indicated a spirit case and a
gasogene in the corner. Then he stood before the fire and looked me over in his
singular introspective fashion.</p>
<seg>
tags are structured as followed:
<p xml:id="SIB-p7">“Indeed, I should have thought a little more. Just a trifle more, I
fancy, Watson.
<seg xml:id="SIB-p7-s1">
And in practice again, I observe. You did not tell me that you
intended to go into harness.”</seg>
</p>
<seg>
tags are put in place, I then stitch the radio
script <ln>
tags together manually after comparing the texts
side-by-side, by using the @pull
attribute, as followed:
<ln pull="#SIB-p7-s1">
<speaker>HOLMES</speaker> And in practice again, I see. You didn't tell me that you'd gone back into harness.</ln>
<ln>
elements, it matches to several
<seg>
elements. The multiple <seg>
elements are separated by white spaces in the following structure:
<ln pull="#SIB-p44-s1 #SIB-p44-s2 #SIB-p46-s1">
<speaker>HOLMES</speaker>
This is my friend and colleague, <mention ref="watson">Dr.Watson</mention>. You may say anything before him that you can say to me. Whom have I the honor to address?</ln>
<ln>
tags to signify the
nature of the relationship between the two text files. @change
is used to show the nature of the change by using the
values "wording", "interpel", and "sig".
<ln pull="#SIB-p55" change="wording"
>
<speaker>HOLMES</speaker>
That had not escaped me either, sir. In fact, if you will state your case, I shall be better able to advise you -- your Majesty.</ln>
@type
is used to show unmatched lines that were significant to
the radio script but do not match up to any part of Arthur Conan Doyle's
publication, by using the value "unmatched". <ln type="unmatched"
change="interpel sig"><speaker>WATSON</speaker> How do you do, sir?</ln>