Why Eigen C++ with MKL doesn't use multi-threading for this large matrix multiplication?

I am doing some calculations that include QR decomposition of a large number(~40000 in each execution) of 4x4 matrix with complex double elements (chosen from a random distribution). I started with directly writing the code using Intel MKL functions. But after some research, it seems like working with Eigen will be much simpler and result in easier to maintain code. (Partly because I find it difficult to work with 2d arrays in intel MKL & care needed for memory management).

Before shifting to Eigen, I started with some performance checks. I took a code(from an earlier similar question on SO) for multiplications of a 10000x100000 matrix with another 100000x1000 matrix (large size chosen to have the effect of parallelization ). I run it on a 36 core node . When I checked the stat, Eigen without Intel MKL directive (but compiled with -O3 -fopenmp) used all the cores and completed the task within ~7 sec.

On the other hand with,

#define EIGEN_USE_MKL_ALL
#define EIGEN_VECTORIZE_SSE4_2

the code takes 28 s & uses an only a single core. Here is my compilation instruction

g++ -m64 -std=c++17 -fPIC -c -I. -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 -o eigen_test.o eigen_test.cc
g++ -m64 -std=c++17 -fPIC -I.  -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 eigen_test.o -o eigen_test  -L/apps/IntelParallelStudio/linux/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_rt -lmkl_core -liomp5 -lpthread

The code is here,

//#define EIGEN_USE_MKL_ALL // Determine if use MKL
//#define EIGEN_VECTORIZE_SSE4_2

#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{

  int n_a_rows = 10000;
  int n_a_cols = 10000;
  int n_b_rows = n_a_cols;
  int n_b_cols = 1000;

  MatrixXi a(n_a_rows, n_a_cols);

  for (int i = 0; i < n_a_rows; ++ i)
      for (int j = 0; j < n_a_cols; ++ j)
        a (i, j) = n_a_cols * i + j;

  MatrixXi b (n_b_rows, n_b_cols);
  for (int i = 0; i < n_b_rows; ++ i)
      for (int j = 0; j < n_b_cols; ++ j)
        b (i, j) = n_b_cols * i + j;

  MatrixXi d (n_a_rows, n_b_cols);

  clock_t begin = clock();

  d = a * b;

  clock_t end = clock();
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "Time taken : " << elapsed_secs << std::endl;

}

In the previous question related to this topic, the difference in speed was found to be turbo boost (& difference was not such huge). I know for small matrices Eigen may work better than MKL. But I can't understand why Eigen+MKL refuses to use multiple cores even when I pass -liomp5 during compilation.

Thank you in advance . (CentOS 7 with GCC 7.4.0 eigen 3.4.0)



from Recent Questions - Stack Overflow https://ift.tt/340T2FL
https://ift.tt/eA8V8J

Comments

Popular posts from this blog

Today Walkin 14th-Sept

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation