dfanalyzer/index.html at master · hpcdb/dfanalyzer · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
<!doctype html>
<html>

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>DfAnalyzer</title>
    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="stylesheets/github-dark.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
    <script src="javascripts/main.js"></script>
    <!--[if lt IE 9]>
      <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">

</head>

<body>

    <header>
        <h1>DfAnalyzer</h1>
        <p></p>
    </header>

    <div id="banner">
    </div>
    <!-- end banner -->

    <div class="wrapper">
        <nav>
            <ul><a href="index.html"><strong>DfAnalyzer</strong></a></ul>
            <ul><a href="https://github.com/hpcdb/DfAdapter" target="_blank"><strong>DfAdapter</strong></a></ul>
            <ul><a href="prov-df.html"><strong>PROV-Df</strong></a></ul>
            <ul><a href="applications.html"><strong>Applications</strong></a></ul>
            <ul><a href="team.html"><strong>Our team</strong></a></ul>
            <ul><a href="publications.html"><strong>Publications</strong></a></ul>
            <ul><a href="acknowledgments.html"><strong>Acknowledgements</strong></a></ul>
        </nav>
        <section>
            <h3>
                <p><a href="index.html">Home</a> >> <a href="dfanalyzer.html">DfAnalyzer</a></p>
                <a id="dfanalyzer" class="anchor" href="#dfanalyzer" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a> DfAnalyzer tool
            </h3>
            <p>
                DfAnalyzer tool is an instantiation of <a href="https://hpcdb.github.io/armful/" target="_blank">ARMFUL architecture</a> to provide dataflow analysis to
                scientific applications. By coupling DfAnalyzer components to the source codes of scientific applications, scientific data is extracted and related along a dataflow
                ready for queries at runtime. DfAnalyzer has features that help monitoring, debugging, provenance registration and steering, all at runtime.
                Its light components do not harm the parallel performance of the scientific applications. As opposed
                to SWMS, DfAnalyzer keeps all the parallel execution control at the scientific application code.
                Therefore, applications may use highly efficient libraries or invoke black-box parallel code.
                DfAnalyzer can also be used by computational and computer specialists, so that they can incorporate user steering support.
                When coupled to <a href="https://github.com/hpcdb/DfAdapter" target="_blank">DfAdapter</a> provenance of human interaction data is collected when users adapt a dataflow online (while it is in execution), relates with other data of interest and makes the provenance database available for online analytical queries. Example of possible adaptations are: parameter tuning, dataset reduction, loop changes, etc.
                The following table shows the components of DfAnalyzer:
            </p>
            <table style="width:100%">
                <tr>
                    <th>DfAnalyzer component</th>
                </tr>
                <tr>
                    <td>Provenance Data Extractor</td>
                </tr>
                <tr>
                    <td>Raw Data Extractor</td>
                </tr>
                <tr>
                    <td>Raw Data Indexer</td>
                </tr>
                <tr>
                    <td>Dataflow Viewer</td>
                </tr>
                <tr>
                    <td>Query Interface</td>
                </tr>
            </table>
            <p align="center">
                <b>Table 1. Components of DfAnalyzer</b>
            </p>

            <h3>
                Components
            </h3>
            <h4>
                Provenance Data Extractor
            </h4>
            <p>
                Provenance Data Extractor (PDE) consists of a RESTful API that manages dataflow at physical (i.e., file flow) and logical (i.e., data flow) levels. By receiving HTTP requests, PDE captures provenance and scientific data from scientific application codes
                or workflows modeled on SWMS, generating a JSON object (kept in memory) with the gathered data and their data dependencies. We highlight that the JSON object generated by PDE presents a hierarchical structure to be in accordance with the
                dataflow concepts presented in <a href="http://onlinelibrary.wiley.com/doi/10.1002/cpe.3616/abstract" target="_blank">our CCPE paper</a> published in 2016. Besides provenance and scientific data capture, PDE also loads
                these data in DfAnalyzer's provenance database for enabling online query processing.
            </p>
            <p>
                More information about PDE can be found <a href="https://github.com/hpcdb/armful/tree/gh-pages/src/dfanalyzer/pde" target="_blank">here</a>.
            </p>

            <h4>
                Raw Data Extractor
            </h4>
            <p>
                Before to send retrospective provenance data (scientific data from the execution of scientific application) using PDE program, users have to access and extract scientific data from raw data files. Therefore, Raw Data Extractor (RDE) is a binary program
                that provides raw data extraction from files. To run this program, users have to define the following input arguments:
                <ul>
                    <li>
                        Cartridge for raw data extraction;
                    </li>
                    <li>
                        Name of the raw data extractor;
                    </li>
                    <li>
                        Directory where raw data files can be found;
                    </li>
                    <li>
                        Raw data files that have to be investigated; and
                    </li>
                    <li>
                        Attributes (<i>i.e.</i>, scientific data contents) that have to be accessed and extracted from the raw data files.
                    </li>
                </ul>
            </p>
            <p>
                Since DfAnalyzer is based on the component-based architecture ARMFUL, users are able to develop new cartridges (<i>i.e.,</i> new algorithms) for extracting scientific data from raw data files. Considering these input arguments, RDE program
                accesses each raw data file each time and extracts the values of the attributes specified by the user. All extracted raw data (from several files) are stored in an output file, which follows a Comma-Separated Values (CSV) file format with
                a header (<i>i.e.</i>, selected attributes).
            </p>
            <p>
                More information about RDE can be found <a href="https://github.com/hpcdb/armful/tree/gh-pages/src/dfanalyzer/rde" target="_blank">here</a>.
            </p>

            <h4>
                Raw Data Indexer
            </h4>
            <p>
                Besides the raw data extraction, there are some approaches that index raw data from files to reduce the volume of data to be loaded into an external repository or a database, such as
                <a href="https://sdm.lbl.gov/fastbit/" target="_blank">FastBit</a>,
                <a href="http://stratos.seas.harvard.edu/publications/nodb-efficient-query-execution-raw-data-files-0" target="_blank">NoDB</a>, <a href="https://infoscience.epfl.ch/record/200346/files/raw.pdf" target="_blank">RAW</a>, and <a href="https://sdm.lbl.gov/~sbyna/research/papers/2013-PDSW2013-SDS.pdf"
                    target="_blank">SDS/Q</a>. Raw Data Indexer (RDI) has the same input arguments of RDE program, however this program generates indexes to the extracted scientific data from files. RDI binary program presents some cartridges that use indexing
                techniques from existing approaches, such as FastBit tool. This program also has a cartridge based on the positional indexing, which considers the RAW’s implementation. In addition, users can also develop new cartridges for applying other
                indexing techniques in their raw data file formats.
            </p>
            <p>
                More information about RDI can be found <a href="https://github.com/hpcdb/armful/tree/gh-pages/src/dfanalyzer/rdi" target="_blank">here</a>.
            </p>
            <h4>
                Dataflow Viewer or DfViewer
            </h4>
            <p>
                DfAnalyzer delivers a graphical interface with Dataflow Viewer (DfViewer) to provide a dataset perspective view from dataflow specification registered in DfAnalyzer's database. Basically, DfViewer provides a list of dataflows registered in provenance
                database and users can choose which dataflow specification they would like to visualize.
            </p>
            <p>
                More details about DfViewer can be found <a href="https://github.com/hpcdb/armful/tree/gh-pages/src/dfanalyzer/dfviewer" target="_blank">here</a>.
            </p>
            <h4>
                Query Interface
            </h4>
            <p>
                Since provenance and raw data are already stored in a relational database, Query Interface (QI) can be invoked to query those stored data. Basically, QI consists of a RESTful API that receives HTTP requests with input arguments in their messages. These
                input arguments contemplate the source and target datasets of the dataflow fragment to be analyzed, the attributes to be accessed from the visited datasets, the conditions to be applied in the dataflow fragment to select specific data
                elements, and the datasets that can be included or excluded from the dataflow fragment. After the submission of query specification, QI generates a SQL-based query to our MonetDB's database automatically, and runs this query. Then, as
                a result, QI returns a CSV file with the obtained results from the query processing.
            </p>
            <p>
                More details about QI can be found <a href="https://github.com/hpcdb/armful/tree/gh-pages/src/dfanalyzer/qi" target="_blank">here</a>.
            </p>
        </section>
        <footer>
            <p>Project maintained by <a href="https://github.com/vssousa">vssousa</a></p>
            <p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://twitter.com/michigangraham">mattgraham</a></small></p>
        </footer>
    </div>
    <!--[if !IE]><script>fixScale(document);</script><![endif]-->
</body>

</html>