Log Pre-Processor (in Java) takes DOI resolution logs (e.g. from CNRI) and processes them into aggregates for use elsewhere. Works with normal files (some gzipped), no external serivce or library dependencies. Crossref log files weigh in 100GB per year compressed, 500 GB uncompressed. Processing a month's worth of files, which adds up to 400,000,000 lines can be done in a few hours. A whole year (5 billion lines) can be done overnight.
LogPPJ has three stages, and once one stage is completed the input files for that stage can be removed.
Output is in one of two formats:
.csv- normal CSV with dates as first column and other values as other series.csv-chunk- many CSV tables in one file, separated by blank spaces. First line of each is the header line that identifies that chunk. Forday-domain.csv-chunksthe header line would be the domain.
Log files come from CNRI and look something like access_log_201501_ec2.gz. They correspond almost to the given month, but the servers are in arbitrary timezones, so there's often some spill into the month before or after. When the preprocess command is run the log files are converted into the processed format. Output files are not overwritten, but the all the input files are read every time.
Input: Gzipped CNRI log files in /logs, roughly corresponding to one month each. For Crossref, input is roughly 100GB per year. Input file format is one entry per resolution.
Output: Gzipped processed files in /processed corresponding exactly to one month per file. Output is roughly 30GB per year. Output format is one entry per resolution, tab-separated:
date doi code full domain name subdomain domain
2015-01-01 10.1016/j.jcrimjus.2010.04.019 H sfx.carli.illinois.edu sfx.carli illinois.edu
Code classifies the refferer type. It is one of:
Hfor HTTP protocolSfor HTTPS protocolFfor FTP protocol.Lfor file protocol.Ufor unknown protocol, but domain suppliedNfor no information.Wfor weird (e.g. readcube)
Hostname can be unknown.special to represent an unknown domain.
Weirdnesses:
- Dates are in local timezones, so a log file for Feburary may contain dates from January or March UTC.
- Log files are found in at least 2 different formats, one missing referrers.
- Dates are round in at least 3 different formats.
- Refferers are up to the browser. So all kinds of weird stuff is found in there (take a look at
Parser.javafor examples).
The processed log files, which correspond to one per month, are aggregated to count various features. This creates an output file, one per month per type.
Input: Gzipped processed files in /processed
Output: CSV chunk files in /aggregated that look like 2015-01-day-code.csv-chunks. The output is small, of the over of 100MB for Crossref.
CSV Chunk files have a header line followed by date,count CSV lines followed by a blank line. The types of data are currently:
code, referrer code, defined abovefulldomain, the full domain, e.g.en.wikipedia.orgdomain, the effective top level domain, e.g.wikipedia.org
The aggreated files are combined into usable outputs, stored in /analysis. This a loosely defined stage, but where aggregated files are per month, analysis output files refer to 'all time'. Files are always over-written, but this stage is the cheapest (under a minute for a year's input).
The raw data is useful, but sometimes we want to exclude certain domains (e.g. data fed into Crossref Chronograph). In this case you can put the list of domains and/or subdomains in filter-domain-names.txt and filter-full-domain-names.txt in /path/to/base/dir/. Logppj automatically ignores domains like 'unknown.special' and 'doi.org' in the filtered analyzer output files.
You can grab a copy of the above files for Crossref from http://destinations.labs.crossref.org/data/full-domain-names.txt and http://destinations.labs.crossref.org/data/domain-names.txt.
day-code.csv- a big table of referrer codes per day. Codes are headers. One table because set of codes is known and small.day-domain-filtered.csv-chunks- CSV chunks of count per day, one per domain (e.g.wikiedia.org). Chunks because the set of domains is unknown and large.day-domain-unfiltered.csv-chunks- CSV chunks of count per day, one per domain (e.g.wikiedia.org). Chunks because the set of domains is unknown and large. Filters excluded domains.day-top-10-unfiltered-domains.csv- a big table of the top 10 domains per day. More than 10 columns because it takes the union of all domains that were in the top 10 on any day.day-top-10-filtered-domains.csv- a big table of the top 10 domains per day. More than 10 columns because it takes the union of all domains that were in the top 10 on any day. Filters excluded domains.month-code.csv- a big table of referrer codes per month.month-top-10-unfiltered-domains.csv- asday-top-10-domains.csvper monthmonth-top-10-filtered-domains.csv- asday-top-10-domains.csvper month, excluding filtered domains.
The analysis files are chopped up for use downstream, for example Chronograph. This stage doesn't get you anything except chopped up files. See Main.java.
- Make sure Java 1.8 and ant are installed.
- Run
ant jar - Put
http://destinations.labs.crossref.org/data/full-domain-names.txtin/path/to/base/dir/ignore-full-domain-names.txtif you want. - Put
http://destinations.labs.crossref.org/data/domain-names.txtin/path/to/base/dir/ignore-domain-names.txtif you want.
Then:
- Put at least 3 months' log files in
/path/to/base/dir/logs - Run
java -jar dist/Main.jar process /path/to/base/dirto parse those log files. Takes a couple of hours. - Run
java -jar dist/Main.jar aggregate /path/to/base/dir. Takes a couple of hours. - Run
java -jar dist/Main.jar analyze /path/to/base/dir. Takes 12 seconds. - Enjoy output in
/path/to/base/dir/analyzed - If more log files come in:
1. Put new month's log files in in
/path/to/base/dir/logs2. Remove log files you've already processed. Remember to keep one month either side of the one you're intersted in. 3. Remove spill-over files from/path/to/base/dir/processedfor partial months
Note:
- Always include a month either side of the month you're interested in because of timezone spillover.
- Processed files aren't overwritten, so delete relevant files in
/processedif they exist. NB last month's spill-over files. - Once a month has been processed, you can remove the input log file, but remember if you want to process last month's log files you'll need to have kept the files from the month before and after it.
- Aggregated files aren't overwritten. You can run the
aggregatedcommand at any time and it will only recalculate the data that hasn't already been calculated. - Don't delete aggregated files, they're all needed for the next stage.
Referrer domains (when they are known), e.g. "http://en.wikipedia.org" are split into domain, e.g. "wikipedia.org" and subdomain, e.g. "en". These are represented in the output as "full domain", e.g. "en.wikipedia.org" and "domain", e.g. "wikipedia.org". This is done using the public_suffix_list.dat which is included here and available at https://publicsuffix.org/list/public_suffix_list.dat. This list is useful rather than precise. It will split "abc.def.co.uk" into "abc.co.uk", but also has some convenience entries, so that e.g. "github.io" and "blogspot.com" are treated as eTLDs. This should be borne in mind.
The voracious reader might want to consult Handle manual, section 3.7.1 .
Can be run directly by ant: time ant run -Darg0=analyze -Darg1=/data-in/logs.
For both the Aggregator and the Analyzer stages, the input is split into partitions. For every Aggregator/Analyzer Strategy, the input is run through as many times as there are partitions. These are done serially rather than in parallel because the partition size is designed to use as much RAM as possible without swapping. It's better to do a good chunk at once that's as large as possible than lots of smaller ones in parallel.
Heuristics based on Crossref's data. For DataCite's data the numbers will be different, but the amount of data will be much lower anyway, so it doesn't much matter.
The process stage is also done in serial because due to the timezones input files are multiplexed to output files. As initial processing is a low-volume activity there was no need to make everything threadsafe. Many classes are stateful and not threadsafe (because they don't have to be) in order to achieve speedups:
LineParserchooses the best format for parsing the lines and remembers itDateParserchooses the best time format and remembers itDateParserremembers the last date in order to avoid having to re-parse each in a series of dates that occur on the dayETLDhas a cache of domain lookupsAggregatorStrategyis stateful in that it has a counter object and the above parsers. They have a 'reset' function. Parallelizing would require creation of a further layer ofAggregatorStrategyAbstractFactorys...
Dates are stored as Strings. There is a whole constellation of types of date representations available in the JRE but we're only interested in YYYY-MM-DD in UTC.