# Full text of "Heterogeneous computing with graphical processing unit: improvised back-propagation algorithm for water level prediction"

## See other formats

International Journal of Electrical and Computer Engineering (IJECE) Vol. 12, No. 4, August 2022, pp. 4090~4098 ISSN: 2088-8708, DOI: 10.1159 I/ijece.v12i4.pp4090-4098 O 4090 Heterogeneous computing with graphical processing unit: improvised back-propagation algorithm for water level prediction Neeru Singh, Supriya Priyabadini Panda Department of Computer Science, Manav Rachna International Institute of Research and Studies, Faridabad, India Article Info ABSTRACT Article history: Received Mar 17, 2021 Revised Mar 24, 2021 Accepted Apr 14, 2022 A multitude of research has been rising for predicting the behavior of different real-world problems through machine learning models. An erratic nature occurs due to the augmented behavior and inadequacy of the prerequisite dataset for the prediction of water level over different fundamental models that show flat or low-set accuracy. In this paper, a powerful scaling strategy is proposed for improvised back-propagation algorithm using parallel computing for groundwater level prediction on Keywords: graphical processing unit (GPU) for the Faridabad region, Haryana, India. This paper aims to propose the new streamlined form of a back-propagation algorithm for heterogeneous computing and to examine the coalescence of artificial neural network (ANN) with GPU for predicting the groundwater level. twenty years of data set from 2001-2020 has been taken into consideration for three input parameters namely, temperature, rainfall, and water level for predicting the groundwater level using parallelized back- propagation algorithm on compute unified device architecture (CUDA). This employs the back-propagation algorithm to be best suited to reinforce learning and performance by providing more accurate and fast results for water level predictions on GPUs as compared to sequential ones on central processing units (CPUs) alone. Back-propagation network Graphical processing unit Ground water level Heterogeneous computing Unified memory This is an open access article under the CC BY-SA license. Corresponding Author: Neeru Singh Department of Computer Science, Manav Rachna International Institute of Research and Studies Faridabad, India Email: neeruksingh123 @gmail.com 1. INTRODUCTION Water is an essential resource for the survival of life on the planet. Enlargement in demands of water due to increasing population, irrelevant usage, and acceleration of new commercial industry, moderately degrade the level of water. To prevent the dearth of water, it is crucial steps for the hydrological researchers to measure the quantity of water available and to act immediately to overcome the forthcoming danger [1]. Due to the enhancement in artificial neural network (ANN), it acts as a powerful machine approach for modeling water-related activity [2]. The deficit in the arbitrary large dataset will tend to fail in prediction with high precision on one core processor i.e., central processing unit (CPU), to improve the efficiency of big data set substantial hardware to team up with the CPU. The graphical processing unit (GPU) structure comprises thousands of cores and each core will act as a computation unit, which will emend the use of parallel structure and proffers very high-level thread parallelism [3]—[5]. The present computing structure of CPUs and GPUs does not promote the adequate improvement of performance over heterogeneous computing. To overcome this issue, a joint approach has been used by combining both CPUs of multi-core environments Journal homepage: http://ijece.iaescore.com Int J Elec & Comp Eng ISSN: 2088-8708 O 4091 and GPUs [6], [7]. Due to the demand for an accelerated high computational environment, an algorithm is required to decrease the execution time and improves performance [8]. Rather than doing shifting and allocating the memory to the host and device allocate a special pointer that can be used by both CPU and GPU, this is the concept of unified memory allocation [9]. According to recent advancements in unified memory employment, a huge extent of features has been added like page fault handling for GPUs, transferring of data when requested, extra memory allotment for GPUs, and counters for accessing the data [10]. In the past, two distinct AutoSwap and SmartPool strategies have been applied to minimize GPU consumption and it prevents any human intervention [11]. In previous work, the different standard algorithm has been tested concerning a parallel version of ANN on compute unified device architecture (CUDA) and results shown before results in favor of parallel implementation on GPUs [12]. Matrix multiplication is the most time-consuming task when training a large dataset. To minimize computing time and to accelerate the processes during preparation, a parallelized matrix multiplication algorithm has been used [13]. In comparison to CPUs, substantial work has been undertaken to take advantage of the GPUs for tremendous computational functions. As GPUs are the most powerful approach to solve complex problems, there is a need to accelerate the hardware for ANN to improve the performance of training. The multicore environment of GPU’s structure helps in attaining optimized neural network design for increasing throughput [14]. A GPU-based effective computation has been done for optimizing join-order operation for decreasing the execution time for complicated queries [15]. Stabilizing the assignment of allocated work on both CPU and GPU will improve the efficiency of the static system [16]. Although a massive amount of work has been done in the past to improve the matrix multiplication processing speed, the research association is focusing on implementing new hardware and pushing past the limit [17]. Training of deep recurrent network (DRN) has been evaluated for half-precision floating-point on CUDA [18]. A parallelized version of the back- propagation neural network (BPN) algorithm has been implemented on CUDA for GPU to predict the fluctuation rate for the foreign exchange market and compared with CPU for overall performance improvement [19]. This research work proposed a new parallel BPN algorithm to predict the level of groundwater level for the Faridabad zone on 20 years of data with GPU using CUDA framework. 2. RESEARCH METHOD Multilayer back-propagation network works in two phases: forward and backward. In the forward phase, inputs are propagated through the input layer to the network and then the resulting vector is produced. Now this actual result is compared to the target result, if the results are distinct then an error is generated. In the Backward phase, the error generated from the feed-forward phase is used to update the values of weights until both the output matches. The machine learning approach provides an assistant to a variety of engineering fields [20]. Ina sequential back-propagation network; weight adaptation was contrived to the framework based on a spontaneous deviation of error [21]. BPN algorithms have been applied to many prediction problems and have become a successful tool for engineers [22]. Traditional or sequential BPN algorithmscan improve the convergence rate for better training [23]. In parallelized environment multiplication of matrices is executed on GPU to improve its acceleration. A function called kernel is used for defining the code on GPU. A kernel is executed by one or more threads in that kernel, which implies initiates kernel after splitting into different GPU threads [24]. Each thread in the kernel is having its unique id called threadId and it also defines the type of data processed. There are one or more blocks available in each kernel, and each block has one or more threads. But before the backward pass, the delta function kernel is launch so that it can be used to updates weights and bias in simultaneous accessing mode using the multithreaded environment of GPU. Figure 1 shows the parallelism in the GPU grid for artificial neural networks, representing the number of blocks in one grid and the number of threads in one block [25]. 2.1. Tiling technique The tiling technique is used to solve the square matrix multiplication problem, as in standard square multiplication algorithm one thread calculates the one component of the resultant matrix, and both the square matrices are stored on global memory, whereas in the tiling technique all the threads in block work together to replicate the two tile matrices for multiplication from global to shared memory. The structure of matrices is breaking down into tiles, which simplifies the operation of complex matrix multiplication and improves the concurrency rate [26]. Figure 2 portrays an example of tile multiplication. Here Aix, Bxj are the two given matrices, Cijis the product matrix and w is the width of one tile. As every cell in matrix Cis liberated from each other, the parallel calculation can be done for the value of the cell. While multiplying the two tile matrices Aj; and Bjxa __syncthreads() function is required to synchronize the threads being executed in separate blocks concurrently. As the overall result relies on computation done on parallel blocks, there must be a synchronism between different blocks of threads. A systematic approach Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 4092 O ISSN: 2088-8708 has been applied to augment the size of tile for matrix multiplication on different kernels i.e., sparse matrix- dense (SpMM) and sampled dense-dense (SDDMM) [27]. Input layer Hidden Layer GPU Grid Thread 1 Block 1 Thread 1 Block m pr — <? Thread n Block] Output Layer PEENE | eee: m/w Cig= Auk. Braj k=1 P Figure 2. Tile multiplication [28] 2.2. Unified memory prefetching Another technique used to overcome the overhead of transferring the data from host to device is unified memory prefetching; where, the data is fetched before launching the kernel. cudaMemPrefetchAsync() is the function used to prefetch the data from unified memory. While evaluating unified memory efficiency, an average set of delegate members, coextensive utilization is required [29]. Function cudaMalloc() used standard memory allocation for GPU, it returns a pointer that points to the starting of GPU memory location. But in unified memory allocation, a new function called cudaMallocManaged() is used that will return a pointer and is accessed by both host and device. 2.3. Coalescing technique While executing or computing in parallel, different threads of the same block access the dynamic random access memory (DRAM) at the same time and taking together all the access and united to achieve the highest memory bandwidth is the work of coalesced technique [30]. In the coalescing technique, the row- wise method and column-wise method are used to access the elements of the matrix, i.e., row after row execution or column after column. The column-wise method is the best-suited format for GPU to provide the maximum usage ratio of 100%, when any associated column is examined, all values will influence to match the access pattern of coalesced memory. Given below coalescing technique is shown in Figure 3. In the coalescing technique address space is breached into small burst segments. When loading instruction for Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 Int J Elec & Comp Eng ISSN: 2088-8708 O 4093 execution, all threads of a warp are required and if all the thread accessing lies in the same burst segment, then that memory access coalesces as it required only one DRAM, shown in Figure 3, whereas in the un- coalescing technique, accessing the location through thread lies in different burst sections. In this work, in addition to the tiling technique on shared memory, the coalescing technique is also used. Coalescing First Load Coalescing Second Load T1 T2 T3 T4 ] Tl T2 T3 T4 Figure 3. Coalescing technique [31] 3. THE PROPOSED ALGORITHM Figure 4 shows the flow chart of the proposed parallelized back-propagation algorithm. For adapting the sequential nature of the BPN algorithm, there is a need to parallelize the whole algorithm. A parallelized BPN algorithm was implemented for this work to produce the ground water level prediction. Input variables are the number of substantial parameters that prevails the predicted output parameters i.e. temperature, rainfall, and ground water level has been used for input layer. Generally, network training deploys on one hidden layer. Depth of groundwater has been taken as output and all the parameters are normalized between (0.1-0.9). Activation function used was sigmoid function as it ranges between (0-1) and exclusively helpful for prediction. l Initialize Weights and Biases a | Unified Memory Pre-fetching: Forward Pass ca Forward Pass: Call Matrix Multiplication Kernel No J Iteration Completed Yes No J _ Unified Memory Pre-fetching: Backward Pass | aime) U Backward Pass: Call Matrix Multiplication Kernel | = Update Weights and Biases | Figure 4. Flow chart for parallelized BPN algorithm Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 40944 O ISSN: 2088-8708 Xi - Input Matrix () Wi, j - Connected weights between layers Tj - Target Output (Future Groundwater Level) Oj - Actual Output Ej- Calculated error r - Learning rate Maximum number of epochs(Iteration) - 100 8j- Threshold Algorithm: proposed algorithm for parallel backward propagation on GPU Initialize all weights and bias typically between 0 and 1 Stepl: for i=l to no of iteration do{//repeat for every number of iteration for j=l to pattern do { // for every pattern in the training set for each input LayerNetwork j{ Oj=Netj; Step 2: for each hidden/outputLayerNetwork j { cudaMallocManaged(&X, N*sizeof(float)); cudaMallocManaged(&W, N*sizeof(float)); Step3: initialize data on CPU for input pattern and weights using function cudaMemAdvise (X, count, advice, CPUdeviceld) ; cudaMemAdvise (W, count, advice, CPUdeviceld) ; Step 4: unified memory prefetching for forward pass from host to GPU using functions CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; CudaMemPrefetchAsync(W, N*sizeof (float), device,NULL) ; Step5: define grid and blocks before calling a kernel NetSumj=MatrixMultKernel<<<blocks per grid, threads per block>>> (Oj, Xi) ; //While configuring the blocks, 16 threadsperblock and 100 blockspergrid has been used Step 6: calculate the weight sum of the inputs to the node by launching MatrixMultKernel ( to multiply the two matrix using tiling technique with coalescing shared memory; Step 7: add the threshold to the sum& calculate the activation for the node Netj=NetSumj +0j ; Ojf=1/+eN%) ; } Step 8: propagate the errors backward through the network for every node j in the output layer, calculate the error for the output layer Ej = Oj(1 — Oj) (Tj — 0j); Step 9: prefetch memory from GPU to hostby using the function CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; CudaMemPrefetchAsync(W, N*sizeof(float), device,NULL); Stepl0:Save results on GPU by using function cudaMemAdvise (E, count, advice, GPUdeviceld) ; Stepll: repeat step 2 to step 7 for the hidden layer Step 12: update weights and bias for each weight and bias for each weight Wi,j and bias @j AWi,j = rEjxj; Wi,j =Wi,j+AWi,j; Oj =rEj; Oj = Oj + ABs; } }}} Step 13: calculate Global Error E = 1/2X(È(Tk — 0k)? ) Step 14: prefetch Memory from GPU to host and save results back on GPU CudaMemPrefetchAsync (E, N*sizeof (float), device,NULL); CudaMemPrefetchAsync (W, N*sizeof (float), device,NULL); cudaMemAdvise (E, count, advice, GPUdeviceld) ; Step 14: while ((maximum no_ of iteration < than specified) AND (E > than specified) ) End of Algorithm 4. RESULTS AND DISCUSSION Implementation of parallelized back-propagation algorithm has been done on CUDA version 10.1 using Google Collab. Data set has been taken from [32] where total data taken into account comprises 120 rows; from 2001-2020, i.e., six annual readings skipping one month between two readings. The number of rows considered for data training was 90, while the number of rows considered for testing was 30. The prediction has been done for the next seven readings. Google collab is a data science research tool from Google. It is an open source that offers Jupyter Notebook for assessment. Users can access a variety of machine learning libraries as well as stimulating hardware [33]. Google is removing the barriers to entry into deep learning for users. Many researchers who do not have access to a large quantity of GPU resources can benefit from this tool. It allows GPU access for 12 hours at a time. Perform the following steps in the case of a GPU-enabled notebook backend: Go to Google collab—click on runtime—change the runtime type by clicking on hardware accelerator—change the run time to GPU. An NVIDIA Tesla T4 with 2560 CUDA Cores and CUDA Version of 11.2 was used to investigate the results. The NVIDIA system management interface is depicted in Figure 5. Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 Int J Elec & Comp Eng ISSN: 2088-8708 O 4095 This segment deals with the different outcomes and the interpretation of various resultant graphs for training execution time, accuracy, error, model loss, and prediction graph over GPU. GPUs deployment is distinguishable over CPUs results. Figure 6 shows the plot for the dataset of 120 readings. Here X-axis represents the observed months concerning Groundwater level in meters at the Y-axis. Where the blue line represents the complete 120 input dataset, the orange line shows the training done by the model on the first 90 readings and the green line represents the predicted test data by model for the last 30 readings. Figures 7(a) and 7(b) shows the execution time and mean squared error (MSE) with the increasing number of epochs for both CPU and GPU. Parallelized algorithm with GPU displays better performance with a minimum error rate and execution time. (> Wed Jun 9 64:48:23 2021 $----------------------------------------------------------------------------- + NVIDIA-SMI 465.27 Driver Version: 4680.32.03 CUDA Version: 11.2 ------------------------------- $----------------------4----------------------+ GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. MIG M. ee ee ee © Tesla T4 off | eeeeeeee:e0:64.8 OFF | 8 N/A 55C PO 29W / 70W | 104MiB / 151@9MiB | o% Default l | N/A $------------------------------- t---------------------- t---------------------- + $----------------------------------------------------------------------------- + Processes: GPU GI cI PID Type Process name GPU Memory ID ID Usage +----------------------------------------------------------------------------- + Figure 5. NVIDIA system management interface Training and Testing Plot 40 rf H 35 a S E — u : 25 [o] 5 20 15 i r r : : 7 : o 20 40 60 80 100 120 Observed Months Figure 6. Training and prediction 10 o1 —+—cPU ä 9 ea =e CPU E —E— GPU PA 5 E- GPU 7 £ 0.07 w = = 0.06 ie E o 5 s 0.05 = 4 2 0.04 3 É 0.03 2 0.02 4 1 0.01 o Esg = _—as o ai ee a 1 10 20 30 40 50 60 70 80 90 100 1 10 20 30 40 50 60 70 80 90 100 Epochs Epochs (a) (b) Figure 7. Comparing results for CPU vs. GPU (a) execution time comparison and (b) mean squared error comparison Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 4096 O ISSN: 2088-8708 Table 1 represents the type of error calculated while predicting the value of groundwater level for different parameters to evaluate the performance of different learning algorithms. The value of mean absolute error (MAE) and mean squared error (MSE) is used to check the efficiency of regression value. Whereas root mean square error (RMSE) is the error that shows the standard deviation while predicting based on data set records. To evaluate the efficiency of different standards in weather sciences, predicting atmospheric conditions, RMSE would be the regular analytical method, while MAE is good at the assessment of different models [34]. Table 1. Computational error Error Type Value Mean Absolute Error 0.0696268 Mean Squared Error 0.0051229 Root Mean Squared Error _0.0715743 Figure 8 shows the execution time taken by both CPU and GPU to predict the level of groundwater level for twenty years of a dataset. It is clear from the figure that the time taken by GPU using parallelized BPN algorithm is less than the time taken by CPU alone for the same data set. The comparison between CPU vs. GPU for total execution time, average time per epoch, and memory used has been shown below in Table 2. - Os 7ms/step - loss: 1.8031le-04 - val loss: ] - Os 7ms/step - loss: 1.5398e-04 - val_loss: - Os 7ms/step - loss: 2.1770e-04 - val loss: with CPU time taken in seconds: 10.076101181000013 with GPU time taken in seconds: 0.7781454485542905 Figure 8. CPU vs. GPU execution time Table 2. CPU vs. GPU Package CPU GPU Total Time [sec]: 10.07610 0.77814 Average Seconds/Step: 0.014 0.006 Memory Used: 0.99GB 1.54 GB 5. CONCLUSION Based on the results of the aforesaid research, it can be concluded that the suggested parallelized back-propagation method on GPU predicts groundwater levels in the Faridabad region faster than the CPU alone. It should also be noted that the CPU execution time is approximately 10.08 seconds while training and testing the network and in contrast, GPU execution time reduces to approximately 0.78 seconds, which is approximately a 90.3% improvement. It can be referred from above that parallelized implementation of the GPU produces an improved performance compared to CPUs with a minimum error rate of 0.0696268. 6. FUTURE WORK Future work includes the extension of parallelized back-propagation algorithm to other real-world problems to boost the acceleration of different hardware for ANN research and for faster GPUs; the power of various algorithms must be increased by parallelization. REFERENCES [1] K. A. N. Adiat, O. F. Ajayi, A. A. Akinlalu, and I. B. Tijani, “Prediction of groundwater level in basement complex terrain using artificial neural network: a case of Ijebu-Jesa, southwestern Nigeria,” Applied Water Science, vol. 10, no. 1, Nov. 2020, doi: 10.1007/s13201-019-1094-6. [2] T. Roshni, M. K. Jha, and J. Drisya, “Neural network modeling for groundwater-level forecasting in coastal aquifers,” Neural Computing and Applications, vol. 32, no. 16, pp. 12737-12754, Jan. 2020, doi: 10.1007/s00521-020-04722-z. Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098 Int J Elec & Comp Eng ISSN: 2088-8708 O 4097 [3] [4] [5] [6] [7] [8] [9] [10 [11 [12 [13 [14 [15 [16 [17 [18 [19 [20 [21 [22 [23 [24 [25 [26 [27 [28 [29 [30 [31 [32 [33 Y. Go, M. Jamshed, Y. G. Moon, C. Hwang, and K. S. Park, “Apunet: revitalizing GPU as packet processing accelerator,” in Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, 2017, pp. 83-96. Z. Zheng et al., “Gen: A GPU-accelerated elastic framework for NFV,” in ACM International Conference Proceeding Series, 2018, pp. 57—64., doi: 10.1145/3232565.3234510. I. M. Coelho, V. N. Coelho, E. J. d. S. Luz, L. S. Ochi, F. G. Guimaraes, and E. Rios, “A GPU deep learning metaheuristic based model for time series forecasting,” Applied Energy, vol. 201, pp. 412-418, Sep. 2017, doi: 10.1016/j.apenergy.2017.01.003. K. Raju and N. N. Chiplunkar, “A survey on techniques for cooperative CPU-GPU computing,” Sustainable Computing: Informatics and Systems, vol. 19, pp. 72-85, Sep. 2018, doi: 10.1016/j.suscom.2018.07.010. M. Dashti and A. Fedorova, “Analyzing memory management methods on integrated CPU-GPU systems,” in International Symposium on Memory Management, ISMM, Jun. 2017, vol. Part F1286, pp. 59—69., doi: 10.1145/3092255.3092256. D. T. V. D. Rao and K. V Ramana, “Accelerating training of deep neural networks on GPU using CUDA,” International Journal of Intelligent Systems and Applications, vol. 11, no. 5, pp. 18-26, May 2019, doi: 10.5815/ijisa.2019.05.03. J. Jung, D. Park, Y. Do, J. Park, and J. Lee, “Overlapping host-to-device copy and computation using hidden unified memory,” in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, Feb. 2020, pp. 321-335., doi: 10.1145/3332466.3374531. H. Xu, M. Emani, P.-H. Lin, L. Hu, and C. Liao, “Machine learning guided optimal use of GPU unified memory,” in 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC), Nov. 2019, pp. 64-70., doi: 10.1109/MCHPC49590.2019.00016. J. Zhang, S. H. Yeung, Y. Shu, B. He, and W. Wang, “Efficient memory management for GPU-based deep learning systems,” arXiv preprint arXiv:1903.06631, 2019 X. Sierra-Canto, F. Madera-Ramirez, and V. Uc-Cetina, “Parallel training of a back-propagation neural network using CUDA,” in 2010 Ninth International Conference on Machine Learning and Applications, Dec. 2010, pp. 307-312., doi: 10.1109/ICMLA.2010.52. A. O. Jimale, F. Ridzuan, and W. M. N. Wan Zainon, “Square matrix multiplication using CUDA on GP-GU,” Procedia Computer Science, vol. 161, pp. 398—405, 2019, doi: 10.1016/j.procs.2019.11.138. S. Shi, Q. Wang, and X. Chu, “Performance modeling and evaluation of distributed deep learning frameworks on GPUs,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data _ Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCon/CyberSciTech), Nov. 2017, pp. 949-957., doi: 10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4. A. Meister and G. Saake, GPU-accelerated dynamic programming for join-order optimization GPU-accelerated dynamic programming for join-order optimization. Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg, 2020. C. Yang et al., “Adaptive optimization for petascale heterogeneous CPU/GPU computing,” in 2010 IEEE International Conference on Cluster Computing, Sep. 2010, pp. 19-28., doi: 10.1109/CLUSTER.2010.12. B. Chen, T. Medini, J. Farwell, C. Tai, and A. Shrivastava, “Slide : in defense of smart algorithms over hardware acceleration for large-scale deep learning systems,” in Proceedings of the 3rd MLSys Conference, 2020, vol. 91, no. 4., doi: 1903.03129. A. Svyatkovskiy, J. Kates-Harbeck, and W. Tang, “Training distributed deep recurrent neural networks with mixed precision on GPU clusters,’ in Proceedings of the Machine Learning on HPC Environments, Nov. 2019, pp. 1-8., doi: 10.1145/3146347.3146358. K. Ganeshamoorthy and N. Ratnarajah, “On the performance of parallel back-propagation neural network implementations using CUDA,” in Proceedings of the 32nd International Conference on Computers and Their Applications, CATA 2017, 2017, pp. 85- 92. Q. H. Nguyen et al., “A novel hybrid model based on a feedforward neural network and one step secant algorithm for prediction of load-bearing capacity of rectangular concrete-filled steel tube columns,” Molecules, vol. 25, no. 15, Jul. 2020, doi: 10.3390/molecules25 153486. V. K. Ojha, P. Dutta, H. Saha, and S. Ghosh, “Detection of proportion of different gas components present in manhole gas mixture using backpropagation neural network,” in International Proceedings Of Computer Science and Information Technology, 2012, vol. 37, no. Icint, pp. 11-15. T.-A. Nguyen, H.-B. Ly, and B. T. Pham, “Backpropagation neural network-based machine learning model for prediction of soil friction angle,” Mathematical Problems in Engineering, vol. 2020, pp. 1-11, Dec. 2020, doi: 10.1155/2020/8845768. F. Izhari, M. Zarlis, and Sutarman, “Analysis of backpropagation neural neural network algorithm on student ability based cognitive aspects,” IOP Conference Series: Materials Science and Engineering, vol. 725, no. 1, Jan. 2020, doi: 10.1088/1757- 899X/725/1/012103. M. Gupta et al., “Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading,” in Proceedings of the 54th Annual Design Automation Conference 2017, Jun. 2017, pp. 1—6., doi: 10.1145/3061639.3062212. N. Singh and S. P. Panda, “Enhancing the proficiency of artificial neural network on prediction with GPU,” in Proceedings of the International Conference on Machine Learning, Big Data, Cloud and Parallel Computing: Trends, Prespectives and Prospects, COMITCon 2019, Feb. 2019, pp. 67—71., doi: 10.1109/COMITCon.2019.8862440. G. Bansal, C. J. Newburn, and P. Besl, “Fast matrix computations on heterogeneous streams,” in High Performance Parallelism Pearls, vol. 2, Elsevier, 2015, pp. 271-304., doi: 10.1016/B978-0-12-803819-2.0001 1-2. S. E. Kurt, A. Sukumaran-Rajam, F. Rastello, and P. Sadayyapan, “Efficient tiled sparse matrix multiplication through matrix signatures,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, pp. 1-14., doi: 10.1109/SC41405.2020.00091. T. Athil, R. Christian, and Y. B. Reddy, “CUDA memory techniques for matrix multiplication on quadro 4000,” in 20/4 11th International Conference on Information Technology: New Generations, Apr. 2014, pp. 419—425., doi: 10.1 109/ITNG.2014.24. M. Knap and P. Czarnul, “Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA pascal and volta GPUs,” The Journal of Supercomputing, vol. 75, no. 11, pp. 7625-7645, Nov. 2019, doi: 10.1007/s11227-019-02966-8. S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens, “GPU multisplit: an extended study of a parallel algorithm,’ ACM Transactions on Parallel Computing, vol. 4, no. 1, pp. 1-44, Jan. 2017, doi: 10.1145/3108139. X. Sun, L.-F. Lai, P. Chou, L.-R. Chen, and C.-C. Wu, “On GPU Implementation of the island model genetic algorithm for solving the unequal area facility layout problem,” Applied Sciences, vol. 8, no. 9, Sep. 2018, doi: 10.3390/app809 1604. IWRIS, “India water resources information system,” 2022. Accessed: Sep. 06 2021 [Online]. Available: https://indiawris. gov.in/wris/#/groundWater. T. Carneiro, R. V. M. Da Nobrega, T. Nepomuceno, G. Bin Bian, V. H. C. De Albuquerque, and P. P. R. Filho, “Performance Heterogeneous computing with graphical processing unit: improvised ... (Neeru Singh) 4098 O ISSN: 2088-8708 analysis of google colaboratory as a tool for accelerating deep learning applications,” [EEE Access, vol. 6, pp. 61677-61685, 2018, doi: 10.1109/ACCESS.2018.2874767. [34] T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? — arguments against avoiding RMSE in the literature,” Geoscientific Model Development, vol. 7, no. 3, pp. 1247—1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014. BIOGRAPHIES OF THE AUTHOR Neeru Singh Or P currently pursuing a Ph.D. in the Department of computer science from Manav Rachna International Institute of Research and Studies (MRIIRS), Faridabad, India. She is working as an assistant professor at Rawal Institute of Engineering and Technology, Faridabad. She has 5 years of teaching experience. She has done her M.Tech. from Maharishi Dayanand University, Rohtak. Her areas are artificial neural network and heterogeneous computing. Contact her at email: neeruksingh123 @ gmail.com. Supriya Priyabadini Panda © ki P’ she is working as a professor and Head of the Department of Computer Science and engineering at Manav Rachna International Institute of Research and Studies (MRIIRS), Faridabad, India. She is having 35+ years of experience in the teaching field. She guidesM.Tech. and Ph.D. Students in a variety of fields. She has done her Ph.D. from Ohio University, USA. Contact her at email: supriya.fet@ mriu.edu.in. Int J Elec & Comp Eng, Vol. 12, No. 4, August 2022: 4090-4098