Application that retrieves papers metadata from the OnePetro website. Thousands of papers on oil and gas live in OnePetro. By retrieving metadata from the search queries, a summary of papers that match the query words, can be retrieved for further analysis and text mining. There are some statistics and data mining provided such as word cloud plots, keywords frequency, conversion to corpus document, and removal of common usage words. OnePetro link: < https://www.onepetro.org/>.
The goal of petro.One is providing a reproducible platform for acquiring and analyzing metadata while searching papers on oil and gas in the OnePetro website.
The standard way of searching for papers in OnePetro is using a web browser to enter the search terms for a particular paper we are looking for. The result will come in web pages with which could be dozens, hundreds or thousand of paper titles. We will need to browse all the resulting pages to find papers that have a better match with the subject we are researching.
By using some statistical tools available through
R, the search could turn in highly profitable in terms of time, matching quality and selection of the papers.
The search keywords are entered thorugh the R console and the papers will return in a shape of a dataframe, which is identical to a spreadsheet: rows of paper titles and columns with details from the metadata extracted from the web page.
With the dataframe already in our computers we could perform a thorough search and narrow down to the most ideal papers.
You can install petro.One from github with:
# install from the *master* release branchdevtools::install_github("f0nzie/petro.One")
# install from the *develop* branchdevtools::install_github("f0nzie/petro.One", ref = "develop")
or from CRAN with:
A typical OnePetro search URL would look like this:
which could be explained like this:
q=: parameter that holds the query words. In the example above, it would be
q=neural+network. As it is shown, it means search
peer_reviewed=: parameter switch to get papers than have been only peer reviewed. When it has the value
on means that is activated.
published_between=: parameter switch that activates when
to_year have numeric entries.
from_year=: parameter to enter the starting year of the search
to_year=: parameter to enter the end year of the search.
There are additional parameters such as:
start=: parameter to indicate the starting page if the resulting search has several pages.
rows=: parameter to indicate the number of rows (papers) to display per page. In the web browser, the options are 10, 50 and 100. Off-browser it could be a number up to 1000.
sort=: parameter related to the selector
Sort By with options
Most recent and
dc_type: parameter that indicates what type of document the paper is. These are the type of documents:
chapter conference-paper general journal-paper presentation media other standard
There are few additional parameters but they will not be used as often as the ones already described.
They key is build a search URL that is recognizable by OnePetro. To do that I wrote a function
make_search_url that does just that. Instead of entering the search keywords, how will they be searched, year and type of paper, we enter them from the R console.
Below some examples:
how = "any" means to search for papers that contain the word
neural or the word
Let's take a look at the difference in returning results with
all for the same keywords
Here we make of of two functions of petro.One:
library(petro.One)# search any word like "neural" or "network"url_any <- make_search_url(query = "neural network", how = "any")url_any#>  ""get_papers_count(url_any)#>  3400# search for papers that have "neural" and "network" at the same timeurl_all <- make_search_url(query = "neural network", how = "all")url_all#>  "\"neural+network\"&peer_reviewed=&published_between=&from_year=&to_year="get_papers_count(url_all)#>  3111
We can send a query where we specify the starting year and the end year. Use the parameters as in the example below.
In this example the option
how = "all" means to search papers that contain exactly the words
neural network as a difference to
any which means search for
any occurrence of the words. Of course, using
any rather than
all will yield many more results.
We use two petro.One functions:
make_search_url to build the OnePetro search URL and
onepetro_page_to_dataframe to put the papers in a table.
library(petro.One)# neural network papers from 1990 to 2000. Exact phrasemy_url <- make_search_url(query = "neural network",from_year = 1990,to_year = 1999,how = "all")df <- onepetro_page_to_dataframe(my_url)df#> # A tibble: 10 x 6#> title_data#> <chr>#> 1 Deconvolution Using Neural Networks#> 2 Neural Network Stacking Velocity Picking#> 3 Neural Networks And Paper Seismic Interpretation#> 4 Drill-Bit Diagnosis With Neural Networks#> 5 Seismic Principal Components Analysis Using Neural Networks#> 6 First Break Picking Using Neural Networks#> 7 Reservoir Characterization Using Feedforward Neural Networks#> 8 Seismic Attribute Calibration Using Neural Networks#> 9 Neural Networks For Primary Reflection Identification#> 10 Conductive fracture identification using neural networks#> # ... with 5 more variables: paper_id <chr>, source <chr>, type <chr>,#> # year <int>, author1_data <chr>
And these are the terms that repeat more freqently:
term_frequency(df)#> # A tibble: 26 x 2#> word freq#> <chr> <int>#> 1 neural 10#> 2 networks 9#> 3 seismic 3#> 4 identification 2#> 5 picking 2#> 6 analysis 1#> 7 attribute 1#> 8 break 1#> 9 calibration 1#> 10 characterization 1#> # ... with 16 more rows
We can also get paper by the type of document. In OnePetro it is called
In this example we are requesting only
Here we add to
make_search_url the parameter
Note also that we are adding another parameter
rows to get 1000 rows instead of 10, 50 or 100 as the browser allows.
# specify document type = "conference-paper", rows = 1000my_url <- make_search_url(query = "neural network",how = "all",dc_type = "conference-paper",rows = 1000)get_papers_count(my_url)#>  2770df <- onepetro_page_to_dataframe(my_url)df#> # A tibble: 1,000 x 6#> title_data#> <chr>#> 1 Deconvolution Using Neural Networks#> 2 Neural Networks And AVO#> 3 Neural Network Stacking Velocity Picking#> 4 Neural Networks And Paper Seismic Interpretation#> 5 Seismic Principal Components Analysis Using Neural Networks#> 6 Neural networks approach to spectral enhancement#> 7 Predicting Wax Formation Using Artificial Neural Network#> 8 Estimation of Welding Distortion Using Neural Network#> 9 First Break Picking Using Neural Networks#> 10 Minimum-variance Deconvolution Using Artificial Neural Networks#> # ... with 990 more rows, and 5 more variables: paper_id <chr>,#> # source <chr>, type <chr>, year <int>, author1_data <chr>
plot_wordcloud(df, max.words = 100, min.freq = 10)
In this other example we are requesting for
journal-paper type of papers. We are also specifying to get the maximum number of rows that OnePetro permits: 1000.
# specify document type = "journal-paper", rows = 1000my_url <- make_search_url(query = "neural network",how = "all",dc_type = "journal-paper",rows = 1000)get_papers_count(my_url)#>  307df <- onepetro_page_to_dataframe(my_url)df#> # A tibble: 307 x 6#> title_data#> <chr>#> 1 Drill-Bit Diagnosis With Neural Networks#> 2 Artificial Neural Networks Identify Restimulation Candidates#> 3 Implicit Approximation of Neural Network and Applications#> 4 Application of Artificial Neural Network to Pump Card Diagnosis#> 5 Application of Artificial Neural Networks to Downhole Fluid Analysis#> 6 Pseudodensity Log Generation by Use of Artificial Neural Networks#> 7 Neural Networks for Predictive Control of Drilling Dynamics#> 8 Neural Network Approach Predicts U.S. Natural Gas Production#> 9 An Artificial Neural Network Based Relative Permeability Predictor#> 10 Characterize Submarine Channel Reservoirs: A Neural- Network-Based Approach#> # ... with 297 more rows, and 5 more variables: paper_id <chr>,#> # source <chr>, type <chr>, year <int>, author1_data <chr>
plot_wordcloud(df, max.words = 100, min.freq = 50)
For this example we want to know about conference papers where the words well and test are found together in the papers.
library(petro.One)my_url <- make_search_url(query = "well test",dc_type = "conference-paper",how = "all")get_papers_count(my_url)#>  9440df <- read_multidoc(my_url)term_frequency(df)#> # A tibble: 9,871 x 2#> word freq#> <chr> <int>#> 1 reservoir 1817#> 2 well 1667#> 3 gas 1447#> 4 field 1289#> 5 production 1101#> 6 analysis 1042#> 7 pressure 947#> 8 reservoirs 894#> 9 wells 881#> 10 data 825#> # ... with 9,861 more rows
# plot the 500 most freqent termsplot_bars(df, min.freq = 400)
Now, it is not enough for us to know what terms are the more repeating but how those freqent terms relate to each other.
In the following plot you will see that the strength of the relationship between terms is reflected by the thickness of the connection lines.
plot_relationships(df, min.freq = 400, threshold = 0.075)
We can see that wells and well are connected stringly to horizontal, transient, pressure, flow, testing, reservoirs, fracture, and analysis. The rest of the words are frequent but not very much connected.
For instance, if you are looking for papers that have stronger relationship between of well test and permeability, it would wise to add that term to the search.
library(petro.One)my_url <- make_search_url(query = "well test permeability",dc_type = "conference-paper",how = "all")get_papers_count(my_url)#>  190df <- read_multidoc(my_url)term_frequency(df)#> # A tibble: 697 x 2#> word freq#> <chr> <int>#> 1 reservoir 86#> 2 permeability 42#> 3 well 38#> 4 field 32#> 5 carbonate 31#> 6 fractured 27#> 7 integrated 21#> 8 modeling 21#> 9 simulation 21#> 10 reservoirs 20#> # ... with 687 more rowsplot_bars(df, min.freq = 10)
In this example, we can see the effect of refining our search by including the term permeability.
plot_relationships(df, min.freq = 15, threshold = 0.05)
This has the advantage of improving the search and narrow down the papers we are more interested in.
The summary functions allow us to group the papers by a preferred group:
This will give you a summary of the count not the papers themselves.
Here is an example of summaries. In this case, we want papers that contain the exact words "well test".
library(petro.One)my_url <- make_search_url(query = "well test",how = "all")
|American Petroleum Institute||42|
|American Rock Mechanics Association||64|
|Carbon Management Technology Conference||1|
|International Petroleum Technology Conference||364|
|International Society for Rock Mechanics||34|
|International Society for Rock Mechanics and Rock Engineering||5|
|International Society of Offshore and Polar Engineers||15|
|National Energy Technology Laboratory||8|
|10th North American Conference on Multiphase Technology||1|
|10th World Petroleum Congress||1|
|11th ISRM Congress||1|
|11th World Petroleum Congress||4|
|12th ISRM Congress||1|
|12th International Conference on Multiphase Production Technology||2|
|12th World Petroleum Congress||3|
|13th ISRM International Congress of Rock Mechanics||1|
|13th International Conference on Multiphase Production Technology||1|
|13th World Petroleum Congress||3|
In this other example, we want papers that containg the word "well" or "test".
library(petro.One)my_url <- make_search_url(query = "well test",how = "any")by_doctype <- papers_by_type(my_url)
In this example we get the total number of papers by document type.
sum(by_doctype$value)#>  105032
Or use the R base function
summary to give us a quick statistics of the papers:
# r-base function summarysummary(by_doctype)#> name value#> Length:8 Min. : 9.00#> Class :character 1st Qu.: 50.25#> Mode :character Median : 180.00#> Mean :13129.00#> 3rd Qu.: 4664.00#> Max. :87790.00