Section 3 Data Processing

3.1 Tidy Data

I followed Tidy Data principles. Those principles are closely tied to those of relational databases and Codd’s relational algebra. I only followed the principles exposed in [2] and [3] above all matters related to performing code and coding style.

Data pipeline

Figure 3.1: Data pipeline

3.2 Filling gaps in our data

I use mirrored flows to cover gaps in raw data. Some countries report zero exports for some products, but I can inspect what their trade partners reported. If country A reported zero exports (imports) of product B to (from) country C, then I searched what country C reported of imports (exports) of product B from (to) country A.

This approach has one major drawback and is that exports are reported FOB (free on board) while imports are reported CIF (cost, insurance and freight). There are different approaches to solve this difficulty, and in particular [4], [5] and [6] discuss this in detail and propose that an 8% CIF/FOB ratio is suitable to discount costs and compare imports and exports.

Let \(x_{c,c',p}\) represent the exports of country \(c\) to country \(c'\) in product \(p\) and \(m_{c',c,p}\) the imports of country \(c'\) from country \(c\). Under this notation I defined corrected flows as:

\[\hat{x}_{c,c',p} = \max\left\{x_{c,c',p}, \frac{m_{c',c,p}}{1.08}\right\}\] \[\hat{m}_{c,c',p} = \max\left\{x_{c',c,p}, \frac{m_{c,c',p}}{1.08}\right\}\]

After symmetrization all observations are rounded to zero decimals.

3.3 Countries not included in rankings and indicators

The curated data includes all the countries available from UN Comtrade data. However, RCA based calculations such as ECI, PCI, Proximity and Density explained in Chapter 4 consider 128 countries that account for 99% of world trade, 97% of the world’s total GDP and 95% of the world’s population according to [7].

I considered simultaneously:

  • Countries with population greater or equal to 1.2 million
  • Countries whose traded value is greater or equal than 1 billion
Schematic of the procedure used to determine the countries that were included in the Atlas

Figure 3.2: Schematic of the procedure used to determine the countries that were included in the Atlas

3.4 GitHub repositories

3.4.1 Getting and cleaning data from UN COMTRADE

3.4.2 Scraping data in The Atlas of Economic Complexity

3.4.3 Product space layouts

3.4.4 Product and country codes

3.4.5 R packages (for reproducibility)

3.5 Software versions

At the moment I am using R 3.4.3 and RStudio Server Pro 1.1 on Ubuntu Server 16.04.

I built R from binaries in order to obtain a setup linked with multi-threaded BLAS/LAPACK libraries. This build is linked to Intel MKL 2017 but my output can be reproduced if your R setup is linked to OpenBLAS, only performance differences should be noticed for some hardware.

3.6 Hardware information

Our server features Intel© Xeon 2.27GHz (eight cores) processor and 32 GB (four DDR3 cards of eight gigabytes each).

The functions are executed using parallelization on four cores because empirically I detected and overhead due to data communication with the cores when using more cores.

Please notice that running our scripts with parallelization demands more RAM than the amount you can find on an average laptop.

3.7 Reproducibility notes

To guarantee reproducibility I provided Packrat snapshot and bundles. This prevents changes in syntax, functions or dependencies.

My R installation is isolated from apt-get to avoid any accidental updates that can alter the data pipeline and/or the output.

The projects are related to each other. In order to avoid multiple copies of files some projects read files from other projects. For example, OEC Yearly Indicators input is the output of OEC Yearly Datasets and OEC Atlas Data.

The only reproducibility flaw of this project lies in data downloading. Obtaining raw datasets from UN COMTRADE demands an API key that can only be obtained from institutional access.

This project will not work on Windows without editing parts of the code. Multicore functionality supports multiple workers only on those operating systems that support the fork system call, and this excludes Windows.

3.8 Coding style and performant code

I used Tidyverse Style Guide. As cornerstone references for performant code I followed [8] and [9].

Some matrix operations are written in Rcpp to take advantage of C++ speed. To take full advantage of hardware and numerical libraries I am using sparse matrices as it is explained in [10].

3.9 Materials of interest

UN COMTRADE product classifications may be of interest as it includes all trade classifications levels with detailed description of: (i) Harmonized System revisions 1992 (H0/HS92), 1996 (H1), 2002 (H2), 2007 (H3), 2012 (H4), 2017 (H5); Standard International Trade classification revisions 1 (S1), 2 (S2), 3 (S3), (iii) and 4 (S4); and Broad Economic Categories (BE).

References

[2] H. Wickham, “Tidy data,” Journal of Statistical Software, Articles, vol. 59, no. 10, pp. 1–23, 2014.

[3] H. Wickham and G. Grolemund, R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media, Inc., 2016.

[4] J. E. Anderson and E. Van Wincoop, “Trade costs,” Journal of Economic literature, vol. 42, no. 3, pp. 691–751, 2004.

[5] D. Hummels, “Toward a geography of trade costs,” 1999.

[6] G. Gaulier and S. Zignago, “Baci: International trade database at the product-level (the 1994-2007 version),” 2010.

[7] C. Hidalgo, R. Hausmann, S. Bustos, M. Coscia, A. Simoes, and M. Yildirim, The atlas of economic complexity: Mapping paths to prosperity. Mit Press, 2014.

[8] H. Wickham, Advanced r. CRC Press, 2014.

[9] R. Peng, S. Kross, and B. Anderson, Mastering software development in R. Leanpub, 2017.

[10] B. Ni, D. Selivanov, D. Eddelbuettel, and Q. Kou, “RcppArmadillo: Sparse matrix support,” Comprehensive R Archive Network, 2018.