Skip to main content

Full text of "Heterogeneous computing with graphical processing unit: improvised back-propagation algorithm for water level prediction"

See other formats

International Journal of Electrical and Computer Engineering (IJECE) 
Vol. 12, No. 4, August 2022, pp. 4090~4098 
ISSN: 2088-8708, DOI: 10.1159 I/ijece.v12i4.pp4090-4098 O 4090 

Heterogeneous computing with graphical processing unit: 
improvised back-propagation algorithm for water level 


Neeru Singh, Supriya Priyabadini Panda 

Department of Computer Science, Manav Rachna International Institute of Research and Studies, Faridabad, India 

Article Info 


Article history: 

Received Mar 17, 2021 
Revised Mar 24, 2021 
Accepted Apr 14, 2022 

A multitude of research has been rising for predicting the behavior of 
different real-world problems through machine learning models. An erratic 
nature occurs due to the augmented behavior and inadequacy of the 
prerequisite dataset for the prediction of water level over different 
fundamental models that show flat or low-set accuracy. In this paper, a 

powerful scaling strategy is proposed for improvised back-propagation 

algorithm using parallel computing for groundwater level prediction on 
Keywords: graphical processing unit (GPU) for the Faridabad region, Haryana, India. 
This paper aims to propose the new streamlined form of a back-propagation 
algorithm for heterogeneous computing and to examine the coalescence of 
artificial neural network (ANN) with GPU for predicting the groundwater 
level. twenty years of data set from 2001-2020 has been taken into 
consideration for three input parameters namely, temperature, rainfall, and 
water level for predicting the groundwater level using parallelized back- 
propagation algorithm on compute unified device architecture (CUDA). This 
employs the back-propagation algorithm to be best suited to reinforce 
learning and performance by providing more accurate and fast results for 
water level predictions on GPUs as compared to sequential ones on central 
processing units (CPUs) alone. 

Back-propagation network 
Graphical processing unit 
Ground water level 
Heterogeneous computing 
Unified memory 

This is an open access article under the CC BY-SA license. 

Corresponding Author: 

Neeru Singh 

Department of Computer Science, Manav Rachna International Institute of Research and Studies 
Faridabad, India 

Email: neeruksingh123 


Water is an essential resource for the survival of life on the planet. Enlargement in demands of water 
due to increasing population, irrelevant usage, and acceleration of new commercial industry, moderately 
degrade the level of water. To prevent the dearth of water, it is crucial steps for the hydrological researchers 
to measure the quantity of water available and to act immediately to overcome the forthcoming danger [1]. 
Due to the enhancement in artificial neural network (ANN), it acts as a powerful machine approach for 
modeling water-related activity [2]. The deficit in the arbitrary large dataset will tend to fail in prediction 
with high precision on one core processor i.e., central processing unit (CPU), to improve the efficiency of big 
data set substantial hardware to team up with the CPU. The graphical processing unit (GPU) structure 
comprises thousands of cores and each core will act as a computation unit, which will emend the use of 
parallel structure and proffers very high-level thread parallelism [3]—[5]. The present computing structure of 
CPUs and GPUs does not promote the adequate improvement of performance over heterogeneous computing. 
To overcome this issue, a joint approach has been used by combining both CPUs of multi-core environments 

Journal homepage: 

Int J Elec & Comp Eng ISSN: 2088-8708 O 4091 

and GPUs [6], [7]. Due to the demand for an accelerated high computational environment, an algorithm is 
required to decrease the execution time and improves performance [8]. Rather than doing shifting and 
allocating the memory to the host and device allocate a special pointer that can be used by both CPU and 
GPU, this is the concept of unified memory allocation [9]. According to recent advancements in unified 
memory employment, a huge extent of features has been added like page fault handling for GPUs, 
transferring of data when requested, extra memory allotment for GPUs, and counters for accessing the data 
[10]. In the past, two distinct AutoSwap and SmartPool strategies have been applied to minimize GPU 
consumption and it prevents any human intervention [11]. In previous work, the different standard algorithm 
has been tested concerning a parallel version of ANN on compute unified device architecture (CUDA) and 
results shown before results in favor of parallel implementation on GPUs [12]. Matrix multiplication is the 
most time-consuming task when training a large dataset. To minimize computing time and to accelerate the 
processes during preparation, a parallelized matrix multiplication algorithm has been used [13]. 

In comparison to CPUs, substantial work has been undertaken to take advantage of the GPUs for 
tremendous computational functions. As GPUs are the most powerful approach to solve complex problems, 
there is a need to accelerate the hardware for ANN to improve the performance of training. The multicore 
environment of GPU’s structure helps in attaining optimized neural network design for increasing throughput 
[14]. A GPU-based effective computation has been done for optimizing join-order operation for decreasing 
the execution time for complicated queries [15]. Stabilizing the assignment of allocated work on both CPU 
and GPU will improve the efficiency of the static system [16]. Although a massive amount of work has been 
done in the past to improve the matrix multiplication processing speed, the research association is focusing 
on implementing new hardware and pushing past the limit [17]. Training of deep recurrent network (DRN) 
has been evaluated for half-precision floating-point on CUDA [18]. A parallelized version of the back- 
propagation neural network (BPN) algorithm has been implemented on CUDA for GPU to predict the 
fluctuation rate for the foreign exchange market and compared with CPU for overall performance 
improvement [19]. This research work proposed a new parallel BPN algorithm to predict the level of 
groundwater level for the Faridabad zone on 20 years of data with GPU using CUDA framework. 


Multilayer back-propagation network works in two phases: forward and backward. In the forward 
phase, inputs are propagated through the input layer to the network and then the resulting vector is produced. 
Now this actual result is compared to the target result, if the results are distinct then an error is generated. In 
the Backward phase, the error generated from the feed-forward phase is used to update the values of weights 
until both the output matches. The machine learning approach provides an assistant to a variety of 
engineering fields [20]. Ina sequential back-propagation network; weight adaptation was contrived to the 
framework based on a spontaneous deviation of error [21]. BPN algorithms have been applied to many 
prediction problems and have become a successful tool for engineers [22]. Traditional or sequential BPN 
algorithmscan improve the convergence rate for better training [23]. 

In parallelized environment multiplication of matrices is executed on GPU to improve its 
acceleration. A function called kernel is used for defining the code on GPU. A kernel is executed by one or 
more threads in that kernel, which implies initiates kernel after splitting into different GPU threads [24]. Each 
thread in the kernel is having its unique id called threadId and it also defines the type of data processed. 
There are one or more blocks available in each kernel, and each block has one or more threads. But before 
the backward pass, the delta function kernel is launch so that it can be used to updates weights and bias in 
simultaneous accessing mode using the multithreaded environment of GPU. Figure 1 shows the parallelism 
in the GPU grid for artificial neural networks, representing the number of blocks in one grid and the number 
of threads in one block [25]. 

2.1. Tiling technique 

The tiling technique is used to solve the square matrix multiplication problem, as in standard square 
multiplication algorithm one thread calculates the one component of the resultant matrix, and both the square 
matrices are stored on global memory, whereas in the tiling technique all the threads in block work together 
to replicate the two tile matrices for multiplication from global to shared memory. The structure of matrices 
is breaking down into tiles, which simplifies the operation of complex matrix multiplication and improves the 
concurrency rate [26]. Figure 2 portrays an example of tile multiplication. 

Here Aix, Bxj are the two given matrices, Cijis the product matrix and w is the width of one tile. As 
every cell in matrix Cis liberated from each other, the parallel calculation can be done for the value of the 
cell. While multiplying the two tile matrices Aj; and Bjxa __syncthreads() function is required to synchronize 
the threads being executed in separate blocks concurrently. As the overall result relies on computation done 
on parallel blocks, there must be a synchronism between different blocks of threads. A systematic approach 

Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 

4092 O ISSN: 2088-8708 
has been applied to augment the size of tile for matrix multiplication on different kernels i.e., sparse matrix- 

dense (SpMM) and sampled dense-dense (SDDMM) [27]. 

Input layer Hidden Layer GPU Grid 
Thread 1 Block 1 

Thread 1 Block m pr 


Thread n Block] 

Output Layer 

PEENE | eee: 

Cig= Auk. Braj 



Figure 2. Tile multiplication [28] 

2.2. Unified memory prefetching 

Another technique used to overcome the overhead of transferring the data from host to device is 
unified memory prefetching; where, the data is fetched before launching the kernel. 
cudaMemPrefetchAsync() is the function used to prefetch the data from unified memory. While evaluating 
unified memory efficiency, an average set of delegate members, coextensive utilization is required [29]. 
Function cudaMalloc() used standard memory allocation for GPU, it returns a pointer that points to the 
starting of GPU memory location. But in unified memory allocation, a new function called 
cudaMallocManaged() is used that will return a pointer and is accessed by both host and device. 

2.3. Coalescing technique 

While executing or computing in parallel, different threads of the same block access the dynamic 
random access memory (DRAM) at the same time and taking together all the access and united to achieve the 
highest memory bandwidth is the work of coalesced technique [30]. In the coalescing technique, the row- 
wise method and column-wise method are used to access the elements of the matrix, i.e., row after row 
execution or column after column. The column-wise method is the best-suited format for GPU to provide the 
maximum usage ratio of 100%, when any associated column is examined, all values will influence to match 
the access pattern of coalesced memory. Given below coalescing technique is shown in Figure 3. In the 
coalescing technique address space is breached into small burst segments. When loading instruction for 

Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 

Int J Elec & Comp Eng ISSN: 2088-8708 O 4093 

execution, all threads of a warp are required and if all the thread accessing lies in the same burst segment, 
then that memory access coalesces as it required only one DRAM, shown in Figure 3, whereas in the un- 
coalescing technique, accessing the location through thread lies in different burst sections. In this work, in 
addition to the tiling technique on shared memory, the coalescing technique is also used. 

Coalescing First Load Coalescing Second Load 
T1 T2 T3 T4 ] Tl T2 T3 T4 

Figure 3. Coalescing technique [31] 


Figure 4 shows the flow chart of the proposed parallelized back-propagation algorithm. For adapting 
the sequential nature of the BPN algorithm, there is a need to parallelize the whole algorithm. A parallelized 
BPN algorithm was implemented for this work to produce the ground water level prediction. Input variables 
are the number of substantial parameters that prevails the predicted output parameters i.e. temperature, 
rainfall, and ground water level has been used for input layer. Generally, network training deploys on one 
hidden layer. Depth of groundwater has been taken as output and all the parameters are normalized between 
(0.1-0.9). Activation function used was sigmoid function as it ranges between (0-1) and exclusively helpful 

for prediction. 

Initialize Weights and Biases 

a | 

Unified Memory Pre-fetching: Forward Pass 

Forward Pass: Call Matrix Multiplication Kernel 

No J 

Iteration Completed Yes 

No J 
_ Unified Memory Pre-fetching: Backward Pass 
| aime) 

Backward Pass: Call Matrix Multiplication Kernel 

= Update Weights and Biases 


Figure 4. Flow chart for parallelized BPN algorithm 

Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 

40944 O ISSN: 2088-8708 

Xi - Input Matrix () 

Wi, j - Connected weights between layers 

Tj - Target Output (Future Groundwater Level) 
Oj - Actual Output 

Ej- Calculated error 

r - Learning rate 

Maximum number of epochs(Iteration) - 100 
8j- Threshold 

Algorithm: proposed algorithm for parallel backward propagation on GPU 

Initialize all weights and bias typically between 0 and 1 

Stepl: for i=l to no of iteration do{//repeat for every number of iteration 

for j=l to pattern do { // for every pattern in the training set 

for each input LayerNetwork j{ Oj=Netj; 

Step 2: for each hidden/outputLayerNetwork j { 

cudaMallocManaged(&X, N*sizeof(float)); 

cudaMallocManaged(&W, N*sizeof(float)); 

Step3: initialize data on CPU for input pattern and weights using function 
cudaMemAdvise (X, count, advice, CPUdeviceld) ; 
cudaMemAdvise (W, count, advice, CPUdeviceld) ; 

Step 4: unified memory prefetching for forward pass from host to GPU using functions 
CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; 

CudaMemPrefetchAsync(W, N*sizeof (float), device,NULL) ; 

Step5: define grid and blocks before calling a kernel 
NetSumj=MatrixMultKernel<<<blocks per grid, threads per block>>> (Oj, Xi) ; 
//While configuring the blocks, 16 threadsperblock and 100 blockspergrid has been used 
Step 6: calculate the weight sum of the inputs to the node by launching MatrixMultKernel ( 
to multiply the two matrix using tiling technique with coalescing shared memory; 
Step 7: add the threshold to the sum& calculate the activation for the node 
Netj=NetSumj +0j ; Ojf=1/+eN%) ; } 

Step 8: propagate the errors backward through the network 

for every node j in the output layer, calculate the error for the output layer 
Ej = Oj(1 — Oj) (Tj — 0j); 

Step 9: prefetch memory from GPU to hostby using the function 
CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; 

CudaMemPrefetchAsync(W, N*sizeof(float), device,NULL); 

Stepl0:Save results on GPU by using function 
cudaMemAdvise (E, count, advice, GPUdeviceld) ; 

Stepll: repeat step 2 to step 7 for the hidden layer 

Step 12: update weights and bias for each weight and bias 

for each weight Wi,j and bias @j 

AWi,j = rEjxj; 

Wi,j =Wi,j+AWi,j; Oj =rEj; 

Oj = Oj + ABs; } }}} 

Step 13: calculate Global Error E = 1/2X(È(Tk — 0k)? ) 

Step 14: prefetch Memory from GPU to host and save results back on GPU 
CudaMemPrefetchAsync (E, N*sizeof (float), device,NULL); 

CudaMemPrefetchAsync (W, N*sizeof (float), device,NULL); 
cudaMemAdvise (E, count, advice, GPUdeviceld) ; 

Step 14: while ((maximum no_ of iteration < than specified) AND (E > than specified) ) 
End of Algorithm 


Implementation of parallelized back-propagation algorithm has been done on CUDA version 10.1 
using Google Collab. Data set has been taken from [32] where total data taken into account comprises 120 
rows; from 2001-2020, i.e., six annual readings skipping one month between two readings. The number of 
rows considered for data training was 90, while the number of rows considered for testing was 30. The 
prediction has been done for the next seven readings. Google collab is a data science research tool from 
Google. It is an open source that offers Jupyter Notebook for assessment. Users can access a variety of 
machine learning libraries as well as stimulating hardware [33]. Google is removing the barriers to entry into 
deep learning for users. Many researchers who do not have access to a large quantity of GPU resources can 
benefit from this tool. It allows GPU access for 12 hours at a time. 

Perform the following steps in the case of a GPU-enabled notebook backend: Go to Google 
collab—click on runtime—change the runtime type by clicking on hardware accelerator—change the run 
time to GPU. An NVIDIA Tesla T4 with 2560 CUDA Cores and CUDA Version of 11.2 was used to 
investigate the results. The NVIDIA system management interface is depicted in Figure 5. 

Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 

Int J Elec & Comp Eng ISSN: 2088-8708 O 4095 

This segment deals with the different outcomes and the interpretation of various resultant graphs for 
training execution time, accuracy, error, model loss, and prediction graph over GPU. GPUs deployment is 
distinguishable over CPUs results. Figure 6 shows the plot for the dataset of 120 readings. Here X-axis 
represents the observed months concerning Groundwater level in meters at the Y-axis. Where the blue line 
represents the complete 120 input dataset, the orange line shows the training done by the model on the first 
90 readings and the green line represents the predicted test data by model for the last 30 readings. 
Figures 7(a) and 7(b) shows the execution time and mean squared error (MSE) with the increasing number of 
epochs for both CPU and GPU. Parallelized algorithm with GPU displays better performance with a 
minimum error rate and execution time. 

(> Wed Jun 9 64:48:23 2021 

$----------------------------------------------------------------------------- + 
NVIDIA-SMI 465.27 Driver Version: 4680.32.03 CUDA Version: 11.2 
------------------------------- $----------------------4----------------------+ 
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC 
Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. 


ee ee ee 
© Tesla T4 off | eeeeeeee:e0:64.8 OFF | 8 
N/A 55C PO 29W / 70W | 104MiB / 151@9MiB | o% Default 
l | N/A 

$------------------------------- t---------------------- t---------------------- + 

$----------------------------------------------------------------------------- + 

GPU GI cI PID Type Process name GPU Memory 
ID ID Usage 
+----------------------------------------------------------------------------- + 

Figure 5. NVIDIA system management interface 

Training and Testing Plot 
H 35 
: 25 
5 20 
15 i r r : : 7 : 
o 20 40 60 80 100 120 
Observed Months 
Figure 6. Training and prediction 
10 o1 
—+—cPU ä 
9 ea =e CPU 
5 E- GPU 
7 £ 0.07 
= = 0.06 
ie E 
o 5 s 0.05 
= 4 2 0.04 
3 É 0.03 
2 0.02 4 
1 0.01 
o Esg = _—as o ai ee a 
1 10 20 30 40 50 60 70 80 90 100 1 10 20 30 40 50 60 70 80 90 100 
Epochs Epochs 
(a) (b) 

Figure 7. Comparing results for CPU vs. GPU (a) execution time comparison and (b) mean squared error 

Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 

4096 O ISSN: 2088-8708 

Table 1 represents the type of error calculated while predicting the value of groundwater level for 
different parameters to evaluate the performance of different learning algorithms. The value of mean absolute 
error (MAE) and mean squared error (MSE) is used to check the efficiency of regression value. Whereas root 
mean square error (RMSE) is the error that shows the standard deviation while predicting based on data set 
records. To evaluate the efficiency of different standards in weather sciences, predicting atmospheric 
conditions, RMSE would be the regular analytical method, while MAE is good at the assessment of different 
models [34]. 

Table 1. Computational error 
Error Type Value 
Mean Absolute Error 0.0696268 
Mean Squared Error 0.0051229 
Root Mean Squared Error _0.0715743 

Figure 8 shows the execution time taken by both CPU and GPU to predict the level of groundwater 
level for twenty years of a dataset. It is clear from the figure that the time taken by GPU using parallelized 
BPN algorithm is less than the time taken by CPU alone for the same data set. The comparison between CPU 

vs. GPU for total execution time, average time per epoch, and memory used has been shown below in 
Table 2. 

- Os 7ms/step - loss: 1.8031le-04 - val loss: 

] - Os 7ms/step - loss: 1.5398e-04 - val_loss: 

- Os 7ms/step - loss: 2.1770e-04 - val loss: 
with CPU time taken in seconds: 10.076101181000013 
with GPU time taken in seconds: 0.7781454485542905 

Figure 8. CPU vs. GPU execution time 

Table 2. CPU vs. GPU 

Package CPU GPU 
Total Time [sec]: 10.07610 0.77814 
Average Seconds/Step: 0.014 0.006 
Memory Used: 0.99GB 1.54 GB 


Based on the results of the aforesaid research, it can be concluded that the suggested parallelized 
back-propagation method on GPU predicts groundwater levels in the Faridabad region faster than the CPU 
alone. It should also be noted that the CPU execution time is approximately 10.08 seconds while training and 
testing the network and in contrast, GPU execution time reduces to approximately 0.78 seconds, which is 
approximately a 90.3% improvement. It can be referred from above that parallelized implementation of the 
GPU produces an improved performance compared to CPUs with a minimum error rate of 0.0696268. 


Future work includes the extension of parallelized back-propagation algorithm to other real-world 
problems to boost the acceleration of different hardware for ANN research and for faster GPUs; the power of 
various algorithms must be increased by parallelization. 


[1] K. A. N. Adiat, O. F. Ajayi, A. A. Akinlalu, and I. B. Tijani, “Prediction of groundwater level in basement complex terrain using 
artificial neural network: a case of Ijebu-Jesa, southwestern Nigeria,” Applied Water Science, vol. 10, no. 1, Nov. 2020, doi: 

[2] T. Roshni, M. K. Jha, and J. Drisya, “Neural network modeling for groundwater-level forecasting in coastal aquifers,” Neural 
Computing and Applications, vol. 32, no. 16, pp. 12737-12754, Jan. 2020, doi: 10.1007/s00521-020-04722-z. 

Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 

Int J Elec & Comp Eng ISSN: 2088-8708 O 4097 






















Y. Go, M. Jamshed, Y. G. Moon, C. Hwang, and K. S. Park, “Apunet: revitalizing GPU as packet processing accelerator,” in 
Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, 2017, pp. 83-96. 

Z. Zheng et al., “Gen: A GPU-accelerated elastic framework for NFV,” in ACM International Conference Proceeding Series, 
2018, pp. 57—64., doi: 10.1145/3232565.3234510. 

I. M. Coelho, V. N. Coelho, E. J. d. S. Luz, L. S. Ochi, F. G. Guimaraes, and E. Rios, “A GPU deep learning metaheuristic based 
model for time series forecasting,” Applied Energy, vol. 201, pp. 412-418, Sep. 2017, doi: 10.1016/j.apenergy.2017.01.003. 

K. Raju and N. N. Chiplunkar, “A survey on techniques for cooperative CPU-GPU computing,” Sustainable Computing: 
Informatics and Systems, vol. 19, pp. 72-85, Sep. 2018, doi: 10.1016/j.suscom.2018.07.010. 

M. Dashti and A. Fedorova, “Analyzing memory management methods on integrated CPU-GPU systems,” in International 
Symposium on Memory Management, ISMM, Jun. 2017, vol. Part F1286, pp. 59—69., doi: 10.1145/3092255.3092256. 

D. T. V. D. Rao and K. V Ramana, “Accelerating training of deep neural networks on GPU using CUDA,” International Journal 
of Intelligent Systems and Applications, vol. 11, no. 5, pp. 18-26, May 2019, doi: 10.5815/ijisa.2019.05.03. 

J. Jung, D. Park, Y. Do, J. Park, and J. Lee, “Overlapping host-to-device copy and computation using hidden unified memory,” in 
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Feb. 2020, pp. 
321-335., doi: 10.1145/3332466.3374531. 

H. Xu, M. Emani, P.-H. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 
IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), Nov. 2019, pp. 64-70., doi: 

J. Zhang, S. H. Yeung, Y. Shu, B. He, and W. Wang, “Efficient memory management for GPU-based deep learning systems,” 
arXiv preprint arXiv:1903.06631, 2019 

X. Sierra-Canto, F. Madera-Ramirez, and V. Uc-Cetina, “Parallel training of a back-propagation neural network using CUDA,” in 
2010 Ninth International Conference on Machine Learning and Applications, Dec. 2010, pp. 307-312., doi: 

A. O. Jimale, F. Ridzuan, and W. M. N. Wan Zainon, “Square matrix multiplication using CUDA on GP-GU,” Procedia 
Computer Science, vol. 161, pp. 398—405, 2019, doi: 10.1016/j.procs.2019.11.138. 

S. Shi, Q. Wang, and X. Chu, “Performance modeling and evaluation of distributed deep learning frameworks on GPUs,” in 2018 
IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 
4th Intl Conf on Big Data _ Intelligence and Computing and Cyber Science and Technology 
Congress(DASC/PiCom/DataCon/CyberSciTech), Nov. 2017, pp. 949-957., doi: 

A. Meister and G. Saake, GPU-accelerated dynamic programming for join-order optimization GPU-accelerated dynamic 
programming for join-order optimization. Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg, 2020. 

C. Yang et al., “Adaptive optimization for petascale heterogeneous CPU/GPU computing,” in 2010 IEEE International 
Conference on Cluster Computing, Sep. 2010, pp. 19-28., doi: 10.1109/CLUSTER.2010.12. 

B. Chen, T. Medini, J. Farwell, C. Tai, and A. Shrivastava, “Slide : in defense of smart algorithms over hardware acceleration for 
large-scale deep learning systems,” in Proceedings of the 3rd MLSys Conference, 2020, vol. 91, no. 4., doi: 1903.03129. 

A. Svyatkovskiy, J. Kates-Harbeck, and W. Tang, “Training distributed deep recurrent neural networks with mixed precision on 
GPU clusters,’ in Proceedings of the Machine Learning on HPC Environments, Nov. 2019, pp. 1-8., doi: 

K. Ganeshamoorthy and N. Ratnarajah, “On the performance of parallel back-propagation neural network implementations using 
CUDA,” in Proceedings of the 32nd International Conference on Computers and Their Applications, CATA 2017, 2017, pp. 85- 

Q. H. Nguyen et al., “A novel hybrid model based on a feedforward neural network and one step secant algorithm for prediction 
of load-bearing capacity of rectangular concrete-filled steel tube columns,” Molecules, vol. 25, no. 15, Jul. 2020, doi: 
10.3390/molecules25 153486. 

V. K. Ojha, P. Dutta, H. Saha, and S. Ghosh, “Detection of proportion of different gas components present in manhole gas 
mixture using backpropagation neural network,” in International Proceedings Of Computer Science and Information Technology, 
2012, vol. 37, no. Icint, pp. 11-15. 

T.-A. Nguyen, H.-B. Ly, and B. T. Pham, “Backpropagation neural network-based machine learning model for prediction of soil 
friction angle,” Mathematical Problems in Engineering, vol. 2020, pp. 1-11, Dec. 2020, doi: 10.1155/2020/8845768. 

F. Izhari, M. Zarlis, and Sutarman, “Analysis of backpropagation neural neural network algorithm on student ability based 
cognitive aspects,” IOP Conference Series: Materials Science and Engineering, vol. 725, no. 1, Jan. 2020, doi: 10.1088/1757- 

M. Gupta et al., “Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading,” in Proceedings 
of the 54th Annual Design Automation Conference 2017, Jun. 2017, pp. 1—6., doi: 10.1145/3061639.3062212. 

N. Singh and S. P. Panda, “Enhancing the proficiency of artificial neural network on prediction with GPU,” in Proceedings of the 
International Conference on Machine Learning, Big Data, Cloud and Parallel Computing: Trends, Prespectives and Prospects, 
COMITCon 2019, Feb. 2019, pp. 67—71., doi: 10.1109/COMITCon.2019.8862440. 

G. Bansal, C. J. Newburn, and P. Besl, “Fast matrix computations on heterogeneous streams,” in High Performance Parallelism 
Pearls, vol. 2, Elsevier, 2015, pp. 271-304., doi: 10.1016/B978-0-12-803819-2.0001 1-2. 

S. E. Kurt, A. Sukumaran-Rajam, F. Rastello, and P. Sadayyapan, “Efficient tiled sparse matrix multiplication through matrix 
signatures,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, 
pp. 1-14., doi: 10.1109/SC41405.2020.00091. 

T. Athil, R. Christian, and Y. B. Reddy, “CUDA memory techniques for matrix multiplication on quadro 4000,” in 20/4 11th 
International Conference on Information Technology: New Generations, Apr. 2014, pp. 419—425., doi: 10.1 109/ITNG.2014.24. 
M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel 
CUDA applications on NVIDIA pascal and volta GPUs,” The Journal of Supercomputing, vol. 75, no. 11, pp. 7625-7645, Nov. 
2019, doi: 10.1007/s11227-019-02966-8. 

S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens, “GPU multisplit: an extended study of a parallel algorithm,’ ACM 
Transactions on Parallel Computing, vol. 4, no. 1, pp. 1-44, Jan. 2017, doi: 10.1145/3108139. 

X. Sun, L.-F. Lai, P. Chou, L.-R. Chen, and C.-C. Wu, “On GPU Implementation of the island model genetic algorithm for 
solving the unequal area facility layout problem,” Applied Sciences, vol. 8, no. 9, Sep. 2018, doi: 10.3390/app809 1604. 

IWRIS, “India water resources information system,” 2022. Accessed: Sep. 06 2021 [Online]. Available: 

T. Carneiro, R. V. M. Da Nobrega, T. Nepomuceno, G. Bin Bian, V. H. C. De Albuquerque, and P. P. R. Filho, “Performance 

Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 

4098 O ISSN: 2088-8708 

analysis of google colaboratory as a tool for accelerating deep learning applications,” [EEE Access, vol. 6, pp. 61677-61685, 
2018, doi: 10.1109/ACCESS.2018.2874767. 

[34] T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? — arguments against avoiding RMSE 
in the literature,” Geoscientific Model Development, vol. 7, no. 3, pp. 1247—1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014. 


Neeru Singh Or P currently pursuing a Ph.D. in the Department of computer science 
from Manav Rachna International Institute of Research and Studies (MRIIRS), Faridabad, 
India. She is working as an assistant professor at Rawal Institute of Engineering and 
Technology, Faridabad. She has 5 years of teaching experience. She has done her M.Tech. 
from Maharishi Dayanand University, Rohtak. Her areas are artificial neural network and 
heterogeneous computing. Contact her at email: neeruksingh123 @ 

Supriya Priyabadini Panda © ki P’ she is working as a professor and Head of the 
Department of Computer Science and engineering at Manav Rachna International Institute of 
Research and Studies (MRIIRS), Faridabad, India. She is having 35+ years of experience in 
the teaching field. She guidesM.Tech. and Ph.D. Students in a variety of fields. She has done 
her Ph.D. from Ohio University, USA. Contact her at email: supriya.fet@ 

Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098