Uploaded On May 4, 2012, 5:29 p.m.
Uploaded By matts
Status Production (approved on May 4, 2012, 5:29 p.m. by matts)
<map>
  <entry>
    <string>au_name</string>
    <string>&quot;Virginia Tech ETD&apos;s - %d&quot;, year</string>
  </entry>
  <entry>
    <string>au_start_url</string>
    <string>&quot;%slockss/manifest.html&quot;, base_url</string>
  </entry>
  <entry>
    <string>plugin_name</string>
    <string>ETD&apos;s@VT</string>
  </entry>
  <entry>
    <string>plugin_notes</string>
    <string>The expected base URL is http://scholar.lib.vt.edu/theses/
The configuration parameter is a 4 digit year.

The ETD&apos;s@VT collection is separated into two different collections within Conspectus due to the complexity of requiring two different plugins for a complete crawl. The first collection, ETD&apos;s@VT, is divided into AU&apos;s by year, and harvests available, restricted and withheld ETD&apos;s from 1997 on. Each individual ETD has a folder in either theses/available or theses/withheld.  Restricted ETD&apos;s are in the available folder but only readable by Virginia Tech IP&apos;s and MA nodes. Only IP&apos;s on the MA list of servers are allowed to get a directory listing of theses/withheld; this is updated by a cron job.

The naming convention for the majority of past ETD&apos;s and future ETD&apos;s follows a variant of the format /etd-mmddyyyy-tttttt based on the timestamp they are added to the collection. Unpadded months and days as well as 2 and 4 digit years are all correctly harvested into correct AU&apos;s by year.

Here&apos;s an example of how the &quot;meat&quot; of the crawl rules work:
Rule %savailable/etd-[0-9]+%02d-[0-9]+/.*
%s is the base URL, in this case http://scholar.lib.vt.edu/theses/
available includes the available directory, there is an identical rule that replaces available with withheld
/etd- is the standard prefix for all ETD&apos;s
[0-9]+ means one or more of the digits 0-9, ie. any group of at least one number
%02 takes the two digit version of the AU&apos;s year.  Seperate rules for 2 and 4 year formats are not needed as the previous rule is &quot;greedy&quot; is already looks for any size +1 group of numbers
- is part of the naming convention
[0-9]+ once again looks for any group of at least one number
/ ends the directory naming convention
.* get everything from that directory

So to extend this example, say we have the config parameter of year=1999
Items like this would be harvested into that particular AU for the year 1999 (these examples show the variety in the naming conventions, the upper most is how all new additions are structured):
available/etd-01011999-123456/
available/etd-02031999-12345678/
available/etd-3499-123456/
withheld/etd-01011999-123456/
withheld/etd-02031999-12345678/
withheld/etd-3499-123456/

Not included would be anything like:
available/etd-1234567891234/ (this would go into the early unsorted collection)
available/etd-12232000-123456/ (this would go into the year 2000 AU)
available/etd-122398-123456/ (this would go into the year 1998 AU)
...and so on

Anything that does not match the above structure is harvested by a separate collection and plugin, called ETD&apos;s@VT - pre 2000 unsorted and edu.vt.library.thesesearly, respectively. This second collection is static as no new ETD&apos;s are being added with the old naming conventions. This second collection also harvests the non-ETD content in the /theses directory as it merely excludes pages that follow the above format.</string>
  </entry>
  <entry>
    <string>plugin_config_props</string>
    <list>
      <org.lockss.daemon.ConfigParamDescr>
        <key>base_url</key>
        <displayName>Base URL</displayName>
        <description>Usually of the form http://&lt;journal-name&gt;.com/</description>
        <type>3</type>
        <size>40</size>
        <definitional>true</definitional>
        <defaultOnly>false</defaultOnly>
      </org.lockss.daemon.ConfigParamDescr>
      <org.lockss.daemon.ConfigParamDescr>
        <key>year</key>
        <displayName>Year</displayName>
        <description>Four digit year (e.g., 2004)</description>
        <type>4</type>
        <size>4</size>
        <definitional>true</definitional>
        <defaultOnly>false</defaultOnly>
      </org.lockss.daemon.ConfigParamDescr>
    </list>
  </entry>
  <entry>
    <string>plugin_version</string>
    <string>7</string>
  </entry>
  <entry>
    <string>au_crawl_depth</string>
    <int>5</int>
  </entry>
  <entry>
    <string>au_def_new_content_crawl</string>
    <long>31536000000</long>
  </entry>
  <entry>
    <string>au_def_pause_time</string>
    <long>6000</long>
  </entry>
  <entry>
    <string>plugin_identifier</string>
    <string>edu.vt.library.theses</string>
  </entry>
  <entry>
    <string>au_crawlrules</string>
    <list>
      <string>4,&quot;^%s&quot;, base_url</string>
      <string>1,&quot;%slockss/manifest.html$&quot;, base_url</string>
      <string>1,&quot;^%swithheld/?$&quot;, base_url</string>
      <string>1,&quot;%sbrowse/by_author/all.html&quot;, base_url</string>
      <string>2,&quot;/\?&quot;</string>
      <string>1,&quot;%savailable/etd-[0-9]+%02d-[0-9]+/.*&quot;, base_url, au_short_year</string>
      <string>1,&quot;%swithheld/etd-[0-9]+%02d-[0-9]+/.*&quot;, base_url, au_short_year</string>
    </list>
  </entry>
  <entry>
    <string>plugin_crawl_type</string>
    <string>HTML Links</string>
  </entry>
</map>