Skip to content

Commit 89eeba6

Browse files
committed
HTML cleanup
1 parent f7d6935 commit 89eeba6

File tree

6 files changed

+51
-28
lines changed

6 files changed

+51
-28
lines changed

caltech_thesis.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
import xmltodict
22
from datacite import schema40
3-
import glob,json,datetime
3+
import glob,json,datetime,re
4+
5+
def cleanhtml(raw_html):
6+
cleanr = re.compile('<.*?>')
7+
cleantext = re.sub(cleanr, '', raw_html)
8+
return cleantext
49

510
#Parse subjects file
611
infile = open('thesis-subjects.txt','r')
@@ -44,7 +49,7 @@
4449
"Dissertation ("+eprint['thesis_degree']+")",'resourceTypeGeneral':"Text"}
4550
metadata['identifier'] = {'identifier':eprint['doi'],'identifierType':"DOI"}
4651
metadata['descriptions'] =[{'descriptionType':"Abstract",\
47-
'description':eprint['abstract']}]
52+
'description':cleanhtml(eprint['abstract'])}]
4853
metadata['formats'] = ['PDF']
4954
metadata['version'] = 'Final'
5055
metadata['language'] = 'English'

examples/10271_datacite.xml

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,22 +16,31 @@
1616
<publicationYear>2017</publicationYear>
1717
<subjects>
1818
<subject>Catalysis</subject>
19-
<subject> Reaction Development</subject>
20-
<subject> Methodology</subject>
21-
<subject> Ni catalysis</subject>
22-
<subject> Cu catalysis</subject>
23-
<subject> Asymmetric Catalysis</subject>
19+
<subject>Reaction Development</subject>
20+
<subject>Methodology</subject>
21+
<subject>Ni catalysis</subject>
22+
<subject>Cu catalysis</subject>
23+
<subject>Asymmetric Catalysis</subject>
24+
<subject>Chemistry</subject>
2425
</subjects>
2526
<dates>
26-
<date dateType="Issued">2018-04-16</date>
27-
<date dateType="Accepted">2017-05-25</date>
27+
<date dateType="Issued">2018-04-19</date>
28+
<date dateType="Accepted">2017-06-06</date>
2829
</dates>
30+
<language>English</language>
2931
<resourceType resourceTypeGeneral="Text">Dissertation (PHD)</resourceType>
32+
<formats>
33+
<format>PDF</format>
34+
</formats>
35+
<version>Final</version>
36+
<rightsList>
37+
<rights>No commercial reproduction, distribution, display or performance rights in this work are provided.</rights>
38+
</rightsList>
3039
<descriptions>
31-
<description descriptionType="Abstract">&lt;p&gt;Chapters 1 and 2 describe the development of photoinduced, Cu-catalyzed coupling reactions of unactivated secondary alkyl halides with amide and cyanide nucleophiles. These reactions may be conducted at room temperature under operationally simple conditions. Mechanistic studies are consistent with the intermediacy of alkyl radicals in these processes.&lt;/p&gt;
40+
<description descriptionType="Abstract">Chapters 1 and 2 describe the development of photoinduced, Cu-catalyzed coupling reactions of unactivated secondary alkyl halides with amide and cyanide nucleophiles. These reactions may be conducted at room temperature under operationally simple conditions. Mechanistic studies are consistent with the intermediacy of alkyl radicals in these processes.
3241

33-
&lt;p&gt;Chapter 3 describes progress toward the development of the first enantioselective Ni-catalyzed cross coupling of racemic alkyl halides and heteroatom nucleophiles. Borylation of secondary benzylic chlorides with B&lt;sub&gt;2&lt;/sub&gt;(pin)&lt;sub&gt;2&lt;/sub&gt; may be achieved in good yield and promising levels of enantioselectivity.&lt;/p&gt;
42+
Chapter 3 describes progress toward the development of the first enantioselective Ni-catalyzed cross coupling of racemic alkyl halides and heteroatom nucleophiles. Borylation of secondary benzylic chlorides with B2(pin)2 may be achieved in good yield and promising levels of enantioselectivity.
3443

35-
&lt;p&gt;Chapter 4 describes enantioselective Ni-catalyzed couplings of α-substituted lactam enolates with benzonitrile derivatives resulting in formal intermolecular C- acylation via in situ hydrolysis of an imine intermediate.&lt;/p&gt;</description>
44+
Chapter 4 describes enantioselective Ni-catalyzed couplings of α-substituted lactam enolates with benzonitrile derivatives resulting in formal intermolecular C- acylation via in situ hydrolysis of an imine intermediate.</description>
3645
</descriptions>
3746
</resource>
File renamed without changes.

examples/10292_datacite.xml

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,20 +16,29 @@
1616
<publicationYear>2017</publicationYear>
1717
<subjects>
1818
<subject>Boosting</subject>
19+
<subject>Electrical Engineering</subject>
1920
</subjects>
2021
<dates>
21-
<date dateType="Issued">2018-04-16</date>
22-
<date dateType="Accepted">2017-05-22</date>
22+
<date dateType="Issued">2018-04-19</date>
23+
<date dateType="Accepted">2107-06-06</date>
2324
</dates>
25+
<language>English</language>
2426
<resourceType resourceTypeGeneral="Text">Dissertation (PHD)</resourceType>
27+
<formats>
28+
<format>PDF</format>
29+
</formats>
30+
<version>Final</version>
31+
<rightsList>
32+
<rights>No commercial reproduction, distribution, display or performance rights in this work are provided.</rights>
33+
</rightsList>
2534
<descriptions>
26-
<description descriptionType="Abstract">&lt;p&gt;Machine learning is becoming prevalent in all aspects of our lives. For some applications, there is a need for simple but accurate white-box systems that are able to train efficiently and with little data.&lt;/p&gt;
35+
<description descriptionType="Abstract">Machine learning is becoming prevalent in all aspects of our lives. For some applications, there is a need for simple but accurate white-box systems that are able to train efficiently and with little data.
2736

28-
&lt;p&gt;"Boosting" is an intuitive method, combining many simple (possibly inaccurate) predictors to form a powerful, accurate classifier. Boosted classifiers are intuitive, easy to use, and exhibit the fastest speeds at test-time when implemented as a cascade. However, they have a few drawbacks: training decision trees is a relatively slow procedure, and from a theoretical standpoint, no simple unified framework for cost-sensitive multi-class boosting exists. Furthermore, (axis-aligned) decision trees may be inadequate in some situations, thereby stalling training; and even in cases where they are sufficiently useful, they don't capture the intrinsic nature of the data, as they tend to form boundaries that overfit.&lt;/p&gt;
37+
"Boosting" is an intuitive method, combining many simple (possibly inaccurate) predictors to form a powerful, accurate classifier. Boosted classifiers are intuitive, easy to use, and exhibit the fastest speeds at test-time when implemented as a cascade. However, they have a few drawbacks: training decision trees is a relatively slow procedure, and from a theoretical standpoint, no simple unified framework for cost-sensitive multi-class boosting exists. Furthermore, (axis-aligned) decision trees may be inadequate in some situations, thereby stalling training; and even in cases where they are sufficiently useful, they don't capture the intrinsic nature of the data, as they tend to form boundaries that overfit.
2938

30-
&lt;p&gt;My thesis focuses on remedying these three drawbacks of boosting.
31-
Ch.III outlines a method (called QuickBoost) that trains identical classifiers at an order of magnitude faster than before, based on a proof of a bound. In Ch.IV, a unified framework for cost-sensitive multi-class boosting (called REBEL) is proposed, both advancing theory and demonstrating empirical gains. Finally, Ch.V describes a novel family of weak learners (called Localized Similarities) that guarantee theoretical bounds and outperform decision trees and Neural Nets (as well as several other commonly used classification methods) on a range of datasets. &lt;/p&gt;
39+
My thesis focuses on remedying these three drawbacks of boosting.
40+
Ch.III outlines a method (called QuickBoost) that trains identical classifiers at an order of magnitude faster than before, based on a proof of a bound. In Ch.IV, a unified framework for cost-sensitive multi-class boosting (called REBEL) is proposed, both advancing theory and demonstrating empirical gains. Finally, Ch.V describes a novel family of weak learners (called Localized Similarities) that guarantee theoretical bounds and outperform decision trees and Neural Nets (as well as several other commonly used classification methods) on a range of datasets.
3241

33-
&lt;p&gt;The culmination of my work is an easy-to-use, fast-training, cost-sensitive multi-class boosting framework whose functionality is interpretable (since each weak learner is a simple comparison of similarity), and whose performance is better than Neural Networks and other competing methods. It is the tool that everyone should have in their toolbox and the first one they try.&lt;/p&gt;</description>
42+
The culmination of my work is an easy-to-use, fast-training, cost-sensitive multi-class boosting framework whose functionality is interpretable (since each weak learner is a simple comparison of similarity), and whose performance is better than Neural Networks and other competing methods. It is the tool that everyone should have in their toolbox and the first one they try.</description>
3443
</descriptions>
3544
</resource>

examples/5410_datacite.xml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@
2121
<subject>Workman Hill</subject>
2222
<subject>Whittier Fault</subject>
2323
<subject>Workman Fault</subject>
24-
<subject>geol</subject>
24+
<subject>Geology</subject>
2525
</subjects>
2626
<dates>
27-
<date dateType="Issued">2018-04-18</date>
27+
<date dateType="Issued">2018-04-19</date>
2828
<date dateType="Available">2009-12-03 21:23:38</date>
2929
</dates>
3030
<language>English</language>
@@ -40,10 +40,10 @@
4040
<rights>No commercial reproduction, distribution, display or performance rights in this work are provided.</rights>
4141
</rightsList>
4242
<descriptions>
43-
<description descriptionType="Abstract">&lt;p&gt;The rocks of the Tertiary described in this report are in the Puente Hills which are just south of the town of Puente, southeast from the town of El Monte, north of the town of Whittier, and approximately thirteen miles in a southeasternly direction from Los Angeles. The area investigated lies between Turnbull Canyon Road on the west and Hudson Avenue on the east.&lt;/p&gt;
43+
<description descriptionType="Abstract">The rocks of the Tertiary described in this report are in the Puente Hills which are just south of the town of Puente, southeast from the town of El Monte, north of the town of Whittier, and approximately thirteen miles in a southeasternly direction from Los Angeles. The area investigated lies between Turnbull Canyon Road on the west and Hudson Avenue on the east.
4444

45-
&lt;p&gt;The maps which cover this district are the Los Angeles
45+
The maps which cover this district are the Los Angeles
4646
County maps, La Habra and Whittier Quadrangles. The scale is
47-
1/2400, and the contour interval is five and twenty-five feet. The maps are accurate and adequate in every detail.&lt;/p&gt;</description>
47+
1/2400, and the contour interval is five and twenty-five feet. The maps are accurate and adequate in every detail.</description>
4848
</descriptions>
4949
</resource>

examples/9981_datacite.xml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@
1818
<subjects>
1919
<subject>ITCZ</subject>
2020
<subject>Climate Dynamics</subject>
21-
<subject>envreng</subject>
22-
<subject>appliedmath</subject>
23-
<subject>compsci</subject>
21+
<subject>Environmental Science and Engineering</subject>
22+
<subject>Applied And Computational Mathematics</subject>
23+
<subject>Computer Science</subject>
2424
</subjects>
2525
<dates>
26-
<date dateType="Issued">2018-04-18</date>
26+
<date dateType="Issued">2018-04-19</date>
2727
<date dateType="Accepted">2016-12-09</date>
2828
</dates>
2929
<language>English</language>

0 commit comments

Comments
 (0)